[GitHub] [arrow-rs] tustvold commented on issue #4118: `ParquetRecordBatchReader` reads overlapping byte ranges

via GitHub Mon, 24 Apr 2023 10:08:24 -0700


tustvold commented on issue #4118:
URL: https://github.com/apache/arrow-rs/issues/4118#issuecomment-1520536549


   This is a consequence of #2464 which causes ChunkReader to be created per 
page, instead of per row group. This change was made to enable page-level 
predicate push down. We should definitely improve the documentation around 
ChunkReader, and its implicit assumptions regarding buffering at the 
application and/or OS level. I will add this to my list.
   
   The reason for the overlapping byte ranges, is that if the `OffsetIndex` 
isn't read, the reader doesn't know where the pages are located or even how 
many there are, only the end position of the column chunk. It therefore has to 
assume a given page may run to the end of the range. If you enable reading the 
[PageIndex](https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ArrowReaderOptions.html#method.with_page_index)
 it shouldn't perform overlapping reads (although it will now need to perform 
IO to read the page index).
   
   Taking a step back I wonder if you've considered using the 
[async_reader](https://docs.rs/parquet/latest/parquet/arrow/async_reader/index.html).
 Not only does this provide a native async interface, but the 
[AsyncFileReader](https://docs.rs/parquet/latest/parquet/arrow/async_reader/trait.AsyncFileReader.html)
 interface naturally lends itself to IO pre-fetching. There is even out of the 
box integration with 
[object_store](https://docs.rs/parquet/latest/parquet/arrow/async_reader/struct.ParquetObjectReader.html).
 
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] tustvold commented on issue #4118: `ParquetRecordBatchReader` reads overlapping byte ranges

Reply via email to