Copilot commented on code in PR #50024:
URL: https://github.com/apache/arrow/pull/50024#discussion_r3299910011
##########
cpp/src/arrow/array/util.cc:
##########
@@ -853,6 +853,19 @@ class RepeatedArrayFactory {
template <typename OffsetType>
Status CreateOffsetsBuffer(OffsetType value_length, std::shared_ptr<Buffer>*
out) {
+ // Check that the total data size does not overflow the offset type.
+ // For 32-bit offset types (e.g. StringType, BinaryType), value_length *
length_
+ // must fit in int32_t, otherwise the offsets wrap around and produce an
invalid
+ // array with negative offsets.
+ if (value_length > 0 && length_ > 0) {
+ int64_t total_size = static_cast<int64_t>(value_length) * length_;
+ if (total_size >
static_cast<int64_t>(std::numeric_limits<OffsetType>::max())) {
+ return Status::Invalid(
+ "Cannot create array: total data size (", total_size,
Review Comment:
The overflow check computes `total_size` via
`static_cast<int64_t>(value_length) * length_`, which can overflow `int64_t`
(signed overflow is UB) for large `length_`. Also, the `for (..., offset +=
value_length)` loop increments `offset` once after the last appended offset,
which can overflow `OffsetType` even when the final stored offset is within
range. Both issues can be avoided by using a division-based overflow check and
by appending the final offset without an extra increment.
##########
cpp/src/arrow/array/util.cc:
##########
@@ -853,6 +853,19 @@ class RepeatedArrayFactory {
template <typename OffsetType>
Status CreateOffsetsBuffer(OffsetType value_length, std::shared_ptr<Buffer>*
out) {
+ // Check that the total data size does not overflow the offset type.
+ // For 32-bit offset types (e.g. StringType, BinaryType), value_length *
length_
+ // must fit in int32_t, otherwise the offsets wrap around and produce an
invalid
+ // array with negative offsets.
+ if (value_length > 0 && length_ > 0) {
Review Comment:
`CreateOffsetsBuffer()` is called *after* `CreateBufferOf()` in the
var-binary / string repeat path (and similarly for some nested types). For the
overflow cases this PR targets (e.g. repeating a 1-byte string 2^31 times),
`CreateBufferOf()` would attempt a huge `Resize()` and then loop `length_`
times appending bytes before the offset overflow is detected, which is
effectively non-terminating / can OOM. To make the new overflow check effective
and ensure the intended `Status::Invalid` is returned quickly, the offset-size
check needs to run before building the repeated values buffer (e.g.
compute/check offsets first, then allocate/copy values).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]