alamb commented on issue #7973:
URL: https://github.com/apache/arrow-rs/issues/7973#issuecomment-3853234125

   > My proposed approach is
   
   Yes this approach sounds good. 
   
   > but instead to have the Parquet decoder emit smaller RecordBatches when 
the target Arrow array would exceed representable limits, similar to how 
pyarrow internally splits row groups.
   
   Yes this is my understanding too
   
   > Add a regression test in arrow-rs that reproduces the current overflow 
panic when reading a Parquet file with large string/binary data using default 
settings.
   
   
   I think the code paths will be different for different string encodings, so 
I recommend you also add tests explicitly for the string Encoings: 
https://docs.rs/parquet/latest/parquet/basic/enum.Encoding.html
   * PLAIN encoding
   * DELTA_LENGTH_BYTE_ARRAY
   * DELTA_BYTE_ARRAY
   * RLE_DICTIONARY
   
   > Introduce logic at the Parquet reader layer to stop decoding early and 
emit a partial RecordBatch when adding more rows would overflow the offset 
buffer.
   
   Yes, this sounds good too. I think the key will be to ensure adding the 
check doesn't slow down the existing reader


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to