vigneshsiva11 commented on issue #7973: URL: https://github.com/apache/arrow-rs/issues/7973#issuecomment-3853037835
Hi all, I’ve gone through the full discussion, and I agree this is an arrow-rs Parquet reader issue: the reader currently attempts to materialize batch_size rows into a single Utf8/binary array, which can overflow 32-bit offsets when the total string/binary data exceeds 2GB. From my understanding, the correct fix is not to require schema changes (e.g., LargeUtf8 or StringView) or smaller user-provided batch sizes, but instead to have the Parquet decoder emit smaller RecordBatches when the target Arrow array would exceed representable limits, similar to how pyarrow internally splits row groups. My proposed approach is 1. Add a regression test in arrow-rs that reproduces the current overflow panic when reading a Parquet file with large string/binary data using default settings. 2. Trace the Parquet Arrow decoding path to identify where "batch_size" rows are accumulated into a single "StringArray". 3. Introduce logic at the Parquet reader layer to stop decoding early and emit a partial RecordBatch when adding more rows would overflow the offset buffer. 4. Continue decoding the remaining rows in subsequent RecordBatches, treating batch_sizes as a target rather than a hard requirement. 5. Ensure this behavior only triggers when necessary so there is no impact on common cases. I plan to start with step (1) by adding the failing test, then iterate on the batching logic once the failure is covered. Does this approach align with the intended direction for fixing this issue? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
