thinkharderdev opened a new issue, #2110: URL: https://github.com/apache/arrow-rs/issues/2110
**Is your feature request related to a problem or challenge? Please describe what you are trying to do.** A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] (This section helps Arrow developers understand the context and *why* for this feature, in addition to the *what*) We recently rebased our project onto DataFusion's latest and we've seen a pretty big performance degradation. The issue is that we've lost the ability to prefetch entire files from object storage with the new `ObjectStore` interface. The buffered prefetch has been moved into `ParquetRecordBatchStream` but in a way that doesn't work particularly well for object storage (at least in our case). The main issues that we've seen are: 1. Before, we would read 64k from the end of the file optimistically when fetching metadata but now we do separate range requests for the footer (8 bytes) to then fetch the metadata. Fetching 8 bytes from S3 takes about 80-90ms so when we are scanning a lot of objects this can add significantly to the execution time. 2. We are prefetching entire column chunks (which is better than fetching page-by-page) but we are fetching the column chunks sequentially. What we found was that (at least with parquet files on the order of 100-200MB) it was much more efficient to just fetch the entire object into memory. All else equal it is of course better to read less from object storage but if we can't do it in one shot (or maybe two shots) the cost of the extra GET requests is going to significantly outweigh the benefit of fetching less data. **Describe the solution you'd like** A clear and concise description of what you want to happen. I think there are couple things we can do: 1. Optimistically try and fetch the metadata in one shot. We can pass a `metadata_size_hint` in as a param maybe to allow users to provide information about the expected size. And maybe it defaults to 64k as it was before. 2. Fetch the column chunks in parallel instead of sequentially. This can either be handled in `ParquetRecordBatchStream` or maybe added into the `ObjectStore` trait by adding a `get_ranges(&self, location: &Path, ranges: Vec<Range<usize>>) -> Result<Bytes>` method. **Describe alternatives you've considered** A clear and concise description of any alternative solutions or features you've considered. We could leave things as they are **Additional context** Add any other context or screenshots about the feature request here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org