zanmato1984 commented on issue #39332:
URL: https://github.com/apache/arrow/issues/39332#issuecomment-1870690063

   Took a look at this, here are my findings.
   
   First some trivial answers to related concerns:
   
   > If I change the `pa.string()` to a `pa.large_string()` then it works fine.
   
   Because only `string()` type goes to swiss join, which leverages the light 
array stuff, which is problematic (detailed later). `large_string()` takes 
another code path which is regular hash join impl, which doesn't have this 
problem.
   
   > Oddly, the duplicate value in `right_keys` is essential to reproduce the 
bug. If I set `right_keys = pa.array([0, 1], pa.int64() ` then it completes 
successfully.
   
   The essentiality of this failure is the overflow of the offsets in the 
resulting var length buffer (detailed later). So it is necessary for the right 
table to contain enough duplicate keys to explode the join result, which will 
trigger the overflow. E.g., the left table has 128 8mb-sized string rows, so if 
the right table contains 4 (or more) matches per row in left, then the result 
will have 128 * 8mb * 4 = 4gb data, which will overflow to zero if using 
`uint32_t` as offset.
   
   Now the detailed explanation of the segmentation fault (didn't look much on 
the infinite loop case, but I believe they have essentially the same cause):
   
   The result of the join will contain 128 * 4 = 512 rows. For the string 
column in the left table, each row has 8mb-sized string. So the result will 
have a string column with 8mb * 512 = 4gb-sized data.
   
   When `ExecBatchBuilder` appends this data, it first calculates the offsets 
for each row. Specifically, for the last (512-th) row, which occupies `[4gb - 
8mb, 4gb)` bytes, we'll write its offset as `offsets[512] = 4gb`. 
Unfortunately, in the following code
   
https://github.com/apache/arrow/blob/bcaeaa8c2d970b81249cfba019475598e3d3109f/cpp/src/arrow/compute/light_array.cc#L594
   `sum` will overflow to `0` when reaching `4gb` as it is `uint32_t`.
   
   And then when allocating the buffer for the string data, the following code
   
https://github.com/apache/arrow/blob/bcaeaa8c2d970b81249cfba019475598e3d3109f/cpp/src/arrow/compute/light_array.cc#L598
   basically does nothing because it sees `offsets[512] == 0` and thinks this 
buffer needs no resizing, leaving the data buffer in its initial size `align(8b 
+ 64b, 64b) = 128b`.
   
   Later when appending string data to it, segmentation fault happens when it 
writes to its 128-th byte.
   
   I'm not sure if the API contract is upheld for this particular case, because 
for a string array whose element is 8mb each, simply >=512 rows will exceed the 
valid offset range (`utint32_t`). Maybe large string is a better choice.
   
   But I think we should at least detect this kind of overflow and report an 
explicit error in this case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to