vigneshsiva11 commented on issue #7973:
URL: https://github.com/apache/arrow-rs/issues/7973#issuecomment-3998738220
I've opened a PR that implements **step 4** having the decoder emit smaller
`RecordBatches` automatically rather than returning an error.
**PR:** [Parquet] Split byte-array batches transparently when i32 offset
would overflow #9504
The fix makes `batch_size` a target rather than a hard limit. When the next
value would cause the accumulated byte-array data to exceed `i32::MAX`, all
four decoders (**Plain, DeltaLength, DeltaByteArray, and Dictionary**) now stop
early and return the partial batch. The decoder's internal position is left at
the unread value, so the next `read_records()` call resumes seamlessly; no rows
are lost, duplicated, or reordered.
**Key changes:**
* `OffsetBuffer::would_overflow(data_len)` a new zero-cost inline helper to
detect overflow before any mutation
* All four byte-array decoders updated to call `would_overflow` and break
cleanly instead of propagating an error
* Unit tests added for the new helper and for partial-read behaviour
This means files like the ones from **HuggingFaceTB/dclm-edu** or the
original reproduction case (`[5068563] * 500`) will be transparently readable
with default settings and **no schema changes required**.
Would appreciate a review!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]