Re: [I] Error when reading row group larger than 2GB (total string length per 8k row batch exceeds 2GB) [arrow-rs]

via GitHub Wed, 04 Mar 2026 08:42:27 -0800


vigneshsiva11 commented on issue #7973:
URL: https://github.com/apache/arrow-rs/issues/7973#issuecomment-3998738220


    I've opened a PR that implements **step 4** having the decoder emit smaller 
`RecordBatches` automatically rather than returning an error.
   
   **PR:** [Parquet] Split byte-array batches transparently when i32 offset 
would overflow #9504
   
   The fix makes `batch_size` a target rather than a hard limit. When the next 
value would cause the accumulated byte-array data to exceed `i32::MAX`, all 
four decoders (**Plain, DeltaLength, DeltaByteArray, and Dictionary**) now stop 
early and return the partial batch. The decoder's internal position is left at 
the unread value, so the next `read_records()` call resumes seamlessly; no rows 
are lost, duplicated, or reordered.
   
   **Key changes:**
   
   * `OffsetBuffer::would_overflow(data_len)` a new zero-cost inline helper to 
detect overflow before any mutation
   * All four byte-array decoders updated to call `would_overflow` and break 
cleanly instead of propagating an error
   * Unit tests added for the new helper and for partial-read behaviour
   
   This means files like the ones from **HuggingFaceTB/dclm-edu** or the 
original reproduction case (`[5068563] * 500`) will be transparently readable 
with default settings and **no schema changes required**.
   
   Would appreciate a review! 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Error when reading row group larger than 2GB (total string length per 8k row batch exceeds 2GB) [arrow-rs]

Reply via email to