Nathan-Fenner opened a new issue, #39190:
URL: https://github.com/apache/arrow/issues/39190

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   When a `pyarrow.Table` contains very large rows, whose size is very close to 
`2**31 - 1`, segfault or allocator exceptions can be raised when performing a 
`group_by` on very big `large_utf8` columns:
   
   ```py
   import pyarrow as pa
   
   # MAX_SIZE is the largest value that can fit in a 32-bit signed integer.
   MAX_SIZE = int(2**31) - 1
   
   # Create a string whose length is very close to MAX_SIZE:
   BIG_STR_LEN = MAX_SIZE - 1
   print(f"{BIG_STR_LEN=} = 2**31 - {2**31 - BIG_STR_LEN}")
   BIG_STR = "A" * BIG_STR_LEN
   
   # Create a record batch with two rows, both containing the BIG_STR in each 
of their columns:
   record_batch = pa.RecordBatch.from_pydict(
       mapping={
           "id": [BIG_STR, BIG_STR],
           "other": [BIG_STR, BIG_STR],
       },
       schema=pa.schema(
           {
               "id": pa.large_utf8(),
               "other": pa.large_utf8(),
           }
       ),
   )
   
   # Create a table containing just the one RecordBatch:
   table = pa.Table.from_batches([record_batch])
   
   # Attempt to group by `id`:
   ans = table.group_by(["id"]).aggregate([("other", "max")])
   print(ans)
   ```
   
   On my M1 mac, the output from running this program looks like:
   
   **Pyarrow version: 14.0.1**
   
   ```
   BIG_STR_LEN=2147483646 = 2**31 - 2
   libc++abi: terminating due to uncaught exception of type std::bad_alloc: 
std::bad_alloc
   zsh: abort      python main.py=
   ```
   
   (in the previous version Pyarrow==10.0.1, this was a segfault instead of 
just a bad_alloc exception):
   ```
   BIG_STR_LEN=2147483642 = 2**31 - 2
   zsh: segmentation fault  python main.py
   ```
   
   ---
   
   I need to emphasize that there is more than enough memory to satisfy this 
operation. The problem is actually caused by integer overflow; I believe in 
one/both of the following places:
   
   - In 
[`VarLengthKeyEncoder::AddLength`](https://github.com/apache/arrow/blob/087fc8f5d31b377916711e98024048b76eae06e8/cpp/src/arrow/compute/kernels/row_encoder_internal.h#L134)
 there is no check that the size of the offset does not cause the length of the 
buffer to overflow an `int32_t`
   - In 
[`GrouperImpl::Consume`](https://github.com/apache/arrow/blob/087fc8f5d31b377916711e98024048b76eae06e8/cpp/src/arrow/compute/row/grouper.cc#L433-L441)
 there's no check that sums of the `offsets_batch` do not overflow an `int32_t`
   
   Overflow in signed integer arithmetic is undefined behavior in C++, but 
typically results in "wrap-around". The result is that we're getting a negative 
`int32_t` value.
   
   Then, when we construct
   
   ```cpp
   std::vector<uint8_t> key_bytes_batch(total_length);
   ```
   
   the `total_length` is converted from `int32_t` to `uint64_t` (since 
`std::vector`'s length constructor accepts a `size_t`, which is `uint64_t` on 
most modern computers). The conversion goes like this:
   
   ```
   int32_t(-1)  ==>  int64_t(-1)  ==>  uint64_t(2**64 - 1)
   ```
   
   But `2**64 - 1` bytes is obviously more memory than is available on my 
computer. The overflow needs to be detected sooner to prevent this 
excessively-large number from being used as an impossible allocation request.
   
   ### Component(s)
   
   Python


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to