[GitHub] [arrow-rs] alamb commented on issue #2110: Parallel fetching of column chunks in ParquetRecordBatchStream

GitBox Wed, 20 Jul 2022 12:21:43 -0700


alamb commented on issue #2110:
URL: https://github.com/apache/arrow-rs/issues/2110#issuecomment-1190663293


   Sorry for being late to the party. Thanks for bringing this topic up 
@thinkharderdev 
   
   > Optimistically try and fetch the metadata in one shot. We can pass a 
metadata_size_hint in as a param maybe to allow users to provide information 
about the expected size. And maybe it defaults to 64k as it was before.
   
   I think this is a great idea (and I think @thinkharderdev  has already done 
it here): https://github.com/apache/arrow-datafusion/pull/2946
   
   > Fetch the column chunks in parallel instead of sequentially. This can 
either be handled in ParquetRecordBatchStream or maybe added into the 
ObjectStore trait by adding a get_ranges(&self, location: &Path, ranges: 
Vec<Range<usize>>) -> Result<Bytes> method.
   
   Given that the optimal prefetch strategy is likely to vary project to 
project and with object store to object store, I feel like it will be 
challenging to put logic that works everywhere in DataFusion. Ideally in my 
mind DataFusion would provide enough information to allow downstream consumers 
to efficiently implement whatever prefetch strategy they wanted.
   
   For example, perhaps we could implement something like 
`PrefetchingObjectStore` that would wrap an existing `dyn ObjectStore` and that 
implemented the concurrent download paradigm suggested by @thinkharderdev. 
   
   If  adding `get_byte_ranges` is needed to allow implementing the required 
prefetching algorithm efficiently I am all for it 👍 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-rs] alamb commented on issue #2110: Parallel fetching of column chunks in ParquetRecordBatchStream

Reply via email to