[I] ArrowReaderMetadata API makes it too easy to (accidentally) make an additional object store request [arrow-rs]

via GitHub Sat, 28 Sep 2024 03:34:07 -0700


alamb opened a new issue, #6476:
URL: https://github.com/apache/arrow-rs/issues/6476


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   
[ArrowReaderMetadata](https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ArrowReaderMetadata.html)
 to read parquet files,  and one major usecase is to supply pre-parsed metadata 
(to avoid a second object store request on read) by providing the 
`ParquetMetaData` to 
[`ArrowReaderMetadata::try_new`](https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ArrowReaderMetadata.html#method.try_new)
   
   However, the way the API is currently setup it is easy to supply the 
`ParquetMetaData` but the reader will *STILL* make 2 object store requests. 
   
   This happens if the `ArrowReaderOptions` has 
[`with_page_index`](https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ArrowReaderOptions.html#method.with_page_index)
 specified but the provided metadata doesn't (yet) have the page index, it will 
load it again
   
   This is a common source of confusion / bugs:  when someone supplies the 
`ParquetMetaData` to the `ArrowReaderMetadata` they are very often trying to 
avoid a second object store request, but as it often turns out the second fetch 
happens anyways to read the page index (thus obviating the attempt at 
optimization)
   
   This is (in a roundabout way) what is happening to @progval in 
https://github.com/apache/datafusion/pull/12593 and it took me a while to debug 
what was happening while working on the 
[advanced_parquet_index.rs](https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/advanced_parquet_index.rs)
 in DataFusion
   
   
   **Describe the solution you'd like**
   I would like the API to be harder to misuse. 
   
   
   **Describe alternatives you've considered**
   For example, maybe we could make ArrowReaderMetadata error if it was 
supplied with `ParquetMetaData` that did not have the page indexes, 
   
   for example, we could add a `ArrowReaderOptions::error_if_need_metadata` or 
something that would change the automatic fetch/load behavior into an error if 
the reader needs the page index, and the file has a page index, but it isn't 
loaded yet into `ParquetMetaData`
   
   **Additional context**
   <!--
   Add any other context or screenshots about the feature request here.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] ArrowReaderMetadata API makes it too easy to (accidentally) make an additional object store request [arrow-rs]

Reply via email to