vigneshsiva11 opened a new pull request, #9504:
URL: https://github.com/apache/arrow-rs/pull/9504
# Which issue does this PR close?
- Closes #7973.
# Rationale for this change
When reading Parquet byte-array columns (Utf8 / Binary) into Arrow arrays
with 32-bit offsets, the reader errors with "index overflow decoding byte
array" as soon as the accumulated string/binary data in a single batch exceeds
2 GiB (i32::MAX bytes).
With the default batch_size of 8 192 rows, this means any column where the
average value is larger than ~256 KB cannot be read at all—even though the file
is perfectly valid and both pyarrow and DuckDB handle it fine by splitting
internally.
The correct fix, as discussed in the issue, is for the Parquet reader to
treat `batch_size` as a *target* rather than a hard limit and emit a smaller
`RecordBatch` whenever the next value would overflow the offset
type.
# What changes are included in this PR?
### `parquet/src/arrow/buffer/offset_buffer.rs`
- Added `OffsetBuffer::would_overflow(data_len: usize) -> bool` — an
inline, zero-allocation helper that uses `checked_add` to safely test whether
appending `data_len` bytes would exceed the representable range of offset type
`I`, without mutating any state.
### `parquet/src/arrow/array_reader/byte_array.rs`
All four byte-array decoders are updated to call `would_overflow` **before**
each `try_push`. When the check fires the decoder breaks out of its loop and
returns the partial count. The decoder's internal position is left pointing at
the value that didn't fit, so the next `read_records()` call resumes from
exactly that value—no rows are lost, duplicated, or reordered.
| Decoder | Change |
|---|---|
| `ByteArrayDecoderPlain::read` | Check `would_overflow` before `try_push`;
fix `max_remaining_values` to subtract actual reads, not requested reads |
| `ByteArrayDecoderDeltaLength::read` | Same pattern; advance
`length_offset` / `data_offset` only by what was consumed |
| `ByteArrayDecoderDelta::read` | Check `would_overflow` inside the callback
closure; use an `overflow` flag to distinguish a clean stop from a real error |
| `ByteArrayDecoderDictionary::read` | Process one dictionary key at a time
via `decoder.read(1, …)` so `DictIndexDecoder` never advances past an
unconsumed key |
# Are these changes tested?
Yes:
- **`test_would_overflow`** — unit test for the new helper covering both
`i32` and `i64` offset types, including the `usize::MAX` edge case.
- **`test_plain_decoder_partial_read`** — confirms that a 3-value PLAIN page
is correctly split across two `read()` calls with no data lost or
duplicated.
# Are there any user-facing changes?
No breaking changes. Users who previously hit `"index overflow decoding byte
array"` with large string/binary columns will now get their data returned
across multiple `RecordBatch`es transparently, with no API or schema changes
required.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]