[GitHub] [arrow-rs] thinkharderdev opened a new issue, #2110: Parallel fetching of column chunks in ParquetRecordBatchStream

GitBox Tue, 19 Jul 2022 14:49:36 -0700


thinkharderdev opened a new issue, #2110:
URL: https://github.com/apache/arrow-rs/issues/2110


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   A clear and concise description of what the problem is. Ex. I'm always 
frustrated when [...] 
   (This section helps Arrow developers understand the context and *why* for 
this feature, in addition to  the *what*)
   
   We recently rebased our project onto DataFusion's latest and we've seen a 
pretty big performance degradation. The issue is that we've lost the ability to 
prefetch entire files from object storage with the new `ObjectStore` interface. 
The buffered prefetch has been moved into `ParquetRecordBatchStream` but in a 
way that doesn't work particularly well for object storage (at least in our 
case). 
   
   The main issues that we've seen are:
   
   1. Before, we would read 64k from the end of the file optimistically when 
fetching metadata but now we do separate range requests for the footer (8 
bytes) to then fetch the metadata. Fetching 8 bytes from S3 takes about 80-90ms 
so when we are scanning a lot of objects this can add significantly to the 
execution time. 
   2. We are prefetching entire column chunks (which is better than fetching 
page-by-page) but we are fetching the column chunks sequentially.
   
   What we found was that (at least with parquet files on the order of 
100-200MB) it was much more efficient to just fetch the entire object into 
memory. All else equal it is of course better to read less from object storage 
but if we can't do it in one shot (or maybe two shots) the cost of the extra 
GET requests is going to significantly outweigh the benefit of fetching less 
data. 
   
   **Describe the solution you'd like**
   A clear and concise description of what you want to happen.
   
   I think there are couple things we can do:
   
   1. Optimistically try and fetch the metadata in one shot. We can pass a 
`metadata_size_hint` in as a param maybe to allow users to provide information 
about the expected size. And maybe it defaults to 64k as it was before. 
   2. Fetch the column chunks in parallel instead of sequentially. This can 
either be handled in `ParquetRecordBatchStream` or maybe added into the 
`ObjectStore` trait by adding a `get_ranges(&self, location: &Path, ranges: 
Vec<Range<usize>>) -> Result<Bytes>` method.
   
   
   **Describe alternatives you've considered**
   A clear and concise description of any alternative solutions or features 
you've considered.
   
   We could leave things as they are
   
   **Additional context**
   Add any other context or screenshots about the feature request here.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] thinkharderdev opened a new issue, #2110: Parallel fetching of column chunks in ParquetRecordBatchStream

Reply via email to