alamb commented on issue #7973: URL: https://github.com/apache/arrow-rs/issues/7973#issuecomment-3853234125
> My proposed approach is Yes this approach sounds good. > but instead to have the Parquet decoder emit smaller RecordBatches when the target Arrow array would exceed representable limits, similar to how pyarrow internally splits row groups. Yes this is my understanding too > Add a regression test in arrow-rs that reproduces the current overflow panic when reading a Parquet file with large string/binary data using default settings. I think the code paths will be different for different string encodings, so I recommend you also add tests explicitly for the string Encoings: https://docs.rs/parquet/latest/parquet/basic/enum.Encoding.html * PLAIN encoding * DELTA_LENGTH_BYTE_ARRAY * DELTA_BYTE_ARRAY * RLE_DICTIONARY > Introduce logic at the Parquet reader layer to stop decoding early and emit a partial RecordBatch when adding more rows would overflow the offset buffer. Yes, this sounds good too. I think the key will be to ensure adding the check doesn't slow down the existing reader -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
