vigneshsiva11 commented on issue #7973:
URL: https://github.com/apache/arrow-rs/issues/7973#issuecomment-3853037835

   Hi all,
   
   I’ve gone through the full discussion, and I agree this is an arrow-rs 
Parquet reader issue: the reader currently attempts to materialize batch_size 
rows into a single Utf8/binary array, which can overflow 32-bit offsets when 
the total string/binary data exceeds 2GB.
   
   From my understanding, the correct fix is not to require schema changes 
(e.g., LargeUtf8 or StringView) or smaller user-provided batch sizes, but 
instead to have the Parquet decoder emit smaller RecordBatches when the target 
Arrow array would exceed representable limits, similar to how pyarrow 
internally splits row groups.
   
   My proposed approach is
   1. Add a regression test in arrow-rs that reproduces the current overflow 
panic when reading a Parquet file with large string/binary data using default 
settings.
   2. Trace the Parquet Arrow decoding path to identify where "batch_size" rows 
are accumulated into a single "StringArray".
   3. Introduce logic at the Parquet reader layer to stop decoding early and 
emit a partial RecordBatch when adding more rows would overflow the offset 
buffer.
   4. Continue decoding the remaining rows in subsequent RecordBatches, 
treating batch_sizes as a target rather than a hard requirement.
   5. Ensure this behavior only triggers when necessary so there is no impact 
on common cases.
   
   I plan to start with step (1) by adding the failing test, then iterate on 
the batching logic once the failure is covered.
   
   Does this approach align with the intended direction for fixing this issue?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to