[PR] [Parquet] Split byte-array batches transparently when i32 offset would overflow [arrow-rs]

via GitHub Wed, 04 Mar 2026 07:26:30 -0800


vigneshsiva11 opened a new pull request, #9504:
URL: https://github.com/apache/arrow-rs/pull/9504


   # Which issue does this PR close?
   
   - Closes #7973.
   
   # Rationale for this change
   
   When reading Parquet byte-array columns (Utf8 / Binary) into Arrow arrays 
with 32-bit offsets, the reader errors with "index overflow decoding byte 
array" as soon as the accumulated string/binary data in a single batch exceeds 
2 GiB (i32::MAX bytes).
   
   With the default batch_size of 8 192 rows, this means any column where the 
average value is larger than ~256 KB cannot be read at all—even though the file 
is perfectly valid and both pyarrow and DuckDB handle it fine by splitting 
internally.
   
   The correct fix, as discussed in the issue, is for the Parquet reader to  
treat `batch_size` as a *target* rather than a hard limit and emit a smaller 
`RecordBatch` whenever the next value would overflow the offset
   type.
   
   # What changes are included in this PR?
   
   ### `parquet/src/arrow/buffer/offset_buffer.rs` 
    - Added `OffsetBuffer::would_overflow(data_len: usize) -> bool` — an  
inline, zero-allocation helper that uses `checked_add` to safely test whether 
appending `data_len` bytes would exceed the representable range  of offset type 
`I`, without mutating any state.
   
   ### `parquet/src/arrow/array_reader/byte_array.rs`
   
   All four byte-array decoders are updated to call `would_overflow` **before** 
each `try_push`. When the check fires the decoder breaks out of its loop and 
returns the partial count. The decoder's internal position is left pointing at 
the value that didn't fit, so the next `read_records()` call resumes from 
exactly that value—no rows are lost, duplicated, or reordered.
   
   | Decoder | Change |
   |---|---|
   | `ByteArrayDecoderPlain::read` | Check `would_overflow` before `try_push`; 
fix `max_remaining_values` to subtract actual reads, not requested reads |
   | `ByteArrayDecoderDeltaLength::read` | Same pattern; advance 
`length_offset` / `data_offset` only by what was consumed |
   | `ByteArrayDecoderDelta::read` | Check `would_overflow` inside the callback 
closure; use an `overflow` flag to distinguish a clean stop from a real error |
   | `ByteArrayDecoderDictionary::read` | Process one dictionary key at a time 
via `decoder.read(1, …)` so `DictIndexDecoder` never advances past an 
unconsumed key |
   
   # Are these changes tested?
   
   Yes:
   
   - **`test_would_overflow`** — unit test for the new helper covering both   
`i32` and `i64` offset types, including the `usize::MAX` edge case. 
   - **`test_plain_decoder_partial_read`** — confirms that a 3-value PLAIN page 
is correctly split across two `read()` calls with no data lost or
     duplicated.
   
   # Are there any user-facing changes?
   
   No breaking changes. Users who previously hit `"index overflow decoding byte 
array"` with large string/binary columns will now get their data returned 
across multiple `RecordBatch`es transparently, with no API or schema changes 
required.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [Parquet] Split byte-array batches transparently when i32 offset would overflow [arrow-rs]

Reply via email to