thinkharderdev commented on issue #2110: URL: https://github.com/apache/arrow-rs/issues/2110#issuecomment-1190698169
> Sorry for being late to the party. Thanks for bringing this topic up @thinkharderdev > > > Optimistically try and fetch the metadata in one shot. We can pass a metadata_size_hint in as a param maybe to allow users to provide information about the expected size. And maybe it defaults to 64k as it was before. > > I think this is a great idea (and I think @thinkharderdev has already done it here): [apache/arrow-datafusion#2946](https://github.com/apache/arrow-datafusion/pull/2946) > > > Fetch the column chunks in parallel instead of sequentially. This can either be handled in ParquetRecordBatchStream or maybe added into the ObjectStore trait by adding a get_ranges(&self, location: &Path, ranges: Vec<Range>) -> Result method. > > Given that the optimal prefetch strategy is likely to vary project to project and with object store to object store, I feel like it will be challenging to put logic that works everywhere in DataFusion. Ideally in my mind DataFusion would provide enough information to allow downstream consumers to efficiently implement whatever prefetch strategy they wanted. > > For example, perhaps we could implement something like `PrefetchingObjectStore` that would wrap an existing `dyn ObjectStore` and that implemented the concurrent download paradigm suggested by @thinkharderdev. > > If adding `get_byte_ranges` is needed to allow implementing the required prefetching algorithm efficiently I am all for it 👍 Yeah, I think from the point of view of `arrow-rs` this is the sensible way to go. It will however just push the problem one layer up the stack :). So in Datafusion we'll have to again decide whether to implement the parallel fetching directly in `ParquetFileReader` (which has many of the same problems as doing it here (in that we have to create a one-size-fits all solution or add extra configurations to allow some flexibility) or move it into the `ObjectStore` trait itself. I'll create another issue in Datafusion to discuss that though. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
