Sriniketh24 opened a new pull request, #50024: URL: https://github.com/apache/arrow/pull/50024
### Rationale `pyarrow.repeat` (backed by `MakeArrayFromScalar` in C++) silently created an invalid array with negative offsets when the total data size (`value_size * repetition_count`) exceeded `INT32_MAX` for 32-bit offset types (`StringType`, `BinaryType`). The resulting array passed creation without error but failed validation with a cryptic "Negative offsets in binary array" or "non-monotonic offset" message. ### What changed Added an early overflow check in `RepeatedArrayFactory::CreateOffsetsBuffer` that computes the total data size in `int64_t` and returns `Status::Invalid` with an actionable error message when it would exceed the offset type's maximum. The error message suggests using `large_*` types (e.g. `large_string`, `large_binary`) for data exceeding 2 GB. ### Are these changes tested? Yes. - **C++ test**: `TestMakeArrayFromScalarOffsetOverflow` in `array_test.cc` — tests string, binary, and large_string scalars - **Python test**: `test_repeat_offset_overflow` in `test_array.py` — verifies `pa.repeat` raises `ArrowInvalid` on overflow ### Are there any user-facing changes? Yes. `MakeArrayFromScalar` (and `pyarrow.repeat`) now raises `ArrowInvalid` early with a clear error message instead of silently returning a corrupt array. This is a strictly better user experience. Closes: #36388 --- This is AI-assisted work by Claude. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
