[GitHub] [arrow-rs] thinkharderdev commented on issue #2110: Parallel fetching of column chunks in ParquetRecordBatchStream

GitBox Wed, 20 Jul 2022 12:59:58 -0700


thinkharderdev commented on issue #2110:
URL: https://github.com/apache/arrow-rs/issues/2110#issuecomment-1190698169


   > Sorry for being late to the party. Thanks for bringing this topic up 
@thinkharderdev
   > 
   > > Optimistically try and fetch the metadata in one shot. We can pass a 
metadata_size_hint in as a param maybe to allow users to provide information 
about the expected size. And maybe it defaults to 64k as it was before.
   > 
   > I think this is a great idea (and I think @thinkharderdev has already done 
it here): 
[apache/arrow-datafusion#2946](https://github.com/apache/arrow-datafusion/pull/2946)
   > 
   > > Fetch the column chunks in parallel instead of sequentially. This can 
either be handled in ParquetRecordBatchStream or maybe added into the 
ObjectStore trait by adding a get_ranges(&self, location: &Path, ranges: 
Vec<Range>) -> Result method.
   > 
   > Given that the optimal prefetch strategy is likely to vary project to 
project and with object store to object store, I feel like it will be 
challenging to put logic that works everywhere in DataFusion. Ideally in my 
mind DataFusion would provide enough information to allow downstream consumers 
to efficiently implement whatever prefetch strategy they wanted.
   > 
   > For example, perhaps we could implement something like 
`PrefetchingObjectStore` that would wrap an existing `dyn ObjectStore` and that 
implemented the concurrent download paradigm suggested by @thinkharderdev.
   > 
   > If adding `get_byte_ranges` is needed to allow implementing the required 
prefetching algorithm efficiently I am all for it 👍
   
   
   Yeah, I think from the point of view of `arrow-rs` this is the sensible way 
to go. It will however just push the problem one layer up the stack :). So in 
Datafusion we'll have to again decide whether to implement the parallel 
fetching directly in `ParquetFileReader` (which has many of the same problems 
as doing it here (in that we have to create a one-size-fits all solution or add 
extra configurations to allow some flexibility) or move it into the 
`ObjectStore` trait itself. I'll create another issue in Datafusion to discuss 
that though. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] thinkharderdev commented on issue #2110: Parallel fetching of column chunks in ParquetRecordBatchStream

Reply via email to