alamb commented on issue #2110: URL: https://github.com/apache/arrow-rs/issues/2110#issuecomment-1190663293
Sorry for being late to the party. Thanks for bringing this topic up @thinkharderdev > Optimistically try and fetch the metadata in one shot. We can pass a metadata_size_hint in as a param maybe to allow users to provide information about the expected size. And maybe it defaults to 64k as it was before. I think this is a great idea (and I think @thinkharderdev has already done it here): https://github.com/apache/arrow-datafusion/pull/2946 > Fetch the column chunks in parallel instead of sequentially. This can either be handled in ParquetRecordBatchStream or maybe added into the ObjectStore trait by adding a get_ranges(&self, location: &Path, ranges: Vec<Range<usize>>) -> Result<Bytes> method. Given that the optimal prefetch strategy is likely to vary project to project and with object store to object store, I feel like it will be challenging to put logic that works everywhere in DataFusion. Ideally in my mind DataFusion would provide enough information to allow downstream consumers to efficiently implement whatever prefetch strategy they wanted. For example, perhaps we could implement something like `PrefetchingObjectStore` that would wrap an existing `dyn ObjectStore` and that implemented the concurrent download paradigm suggested by @thinkharderdev. If adding `get_byte_ranges` is needed to allow implementing the required prefetching algorithm efficiently I am all for it 👍 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org