Nathan-Fenner opened a new issue, #39190:
URL: https://github.com/apache/arrow/issues/39190
### Describe the bug, including details regarding any error messages,
version, and platform.
When a `pyarrow.Table` contains very large rows, whose size is very close to
`2**31 - 1`, segfault or allocator exceptions can be raised when performing a
`group_by` on very big `large_utf8` columns:
```py
import pyarrow as pa
# MAX_SIZE is the largest value that can fit in a 32-bit signed integer.
MAX_SIZE = int(2**31) - 1
# Create a string whose length is very close to MAX_SIZE:
BIG_STR_LEN = MAX_SIZE - 1
print(f"{BIG_STR_LEN=} = 2**31 - {2**31 - BIG_STR_LEN}")
BIG_STR = "A" * BIG_STR_LEN
# Create a record batch with two rows, both containing the BIG_STR in each
of their columns:
record_batch = pa.RecordBatch.from_pydict(
mapping={
"id": [BIG_STR, BIG_STR],
"other": [BIG_STR, BIG_STR],
},
schema=pa.schema(
{
"id": pa.large_utf8(),
"other": pa.large_utf8(),
}
),
)
# Create a table containing just the one RecordBatch:
table = pa.Table.from_batches([record_batch])
# Attempt to group by `id`:
ans = table.group_by(["id"]).aggregate([("other", "max")])
print(ans)
```
On my M1 mac, the output from running this program looks like:
**Pyarrow version: 14.0.1**
```
BIG_STR_LEN=2147483646 = 2**31 - 2
libc++abi: terminating due to uncaught exception of type std::bad_alloc:
std::bad_alloc
zsh: abort python main.py=
```
(in the previous version Pyarrow==10.0.1, this was a segfault instead of
just a bad_alloc exception):
```
BIG_STR_LEN=2147483642 = 2**31 - 2
zsh: segmentation fault python main.py
```
---
I need to emphasize that there is more than enough memory to satisfy this
operation. The problem is actually caused by integer overflow; I believe in
one/both of the following places:
- In
[`VarLengthKeyEncoder::AddLength`](https://github.com/apache/arrow/blob/087fc8f5d31b377916711e98024048b76eae06e8/cpp/src/arrow/compute/kernels/row_encoder_internal.h#L134)
there is no check that the size of the offset does not cause the length of the
buffer to overflow an `int32_t`
- In
[`GrouperImpl::Consume`](https://github.com/apache/arrow/blob/087fc8f5d31b377916711e98024048b76eae06e8/cpp/src/arrow/compute/row/grouper.cc#L433-L441)
there's no check that sums of the `offsets_batch` do not overflow an `int32_t`
Overflow in signed integer arithmetic is undefined behavior in C++, but
typically results in "wrap-around". The result is that we're getting a negative
`int32_t` value.
Then, when we construct
```cpp
std::vector<uint8_t> key_bytes_batch(total_length);
```
the `total_length` is converted from `int32_t` to `uint64_t` (since
`std::vector`'s length constructor accepts a `size_t`, which is `uint64_t` on
most modern computers). The conversion goes like this:
```
int32_t(-1) ==> int64_t(-1) ==> uint64_t(2**64 - 1)
```
But `2**64 - 1` bytes is obviously more memory than is available on my
computer. The overflow needs to be detected sooner to prevent this
excessively-large number from being used as an impossible allocation request.
### Component(s)
Python
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]