Re: [I] [Parquet] PushDecoder: Add a peek API to support pre-fetching [arrow-rs]

via GitHub Wed, 04 Feb 2026 13:48:35 -0800


alamb commented on issue #8668:
URL: https://github.com/apache/arrow-rs/issues/8668#issuecomment-3849920233


   > With better pre-fetching it's pretty plausible that the network would be 
saturated (I'm only focusing on bandwidth here because this a "read all" case). 
I just wonder if "peeking" is the best API design for this. I was thinking more 
in the lines of a "scan plan" or something similar. Depending on the reader 
params, all ranges + decoders could be exposed in single call which is easily 
schedulable across tasks
   
   Thank you for the report @igorcalabria 
   
   Another thing I discovered while working on this code is the existing API 
https://docs.rs/parquet/latest/parquet/arrow/async_reader/struct.ParquetRecordBatchStream.html#method.next_row_group
   
   (this is a pretty thin wrapper over the push decoder `try_next_reader` API)
   
   
   That being said it will still fetch the ranges sequentially which is not 
ideal
   
   Another thought I had for your usecase is to, as you say, create an 
individual PushDecoder for each row group (or some other RowSelection) and then 
run them all in parallel 🤔 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [Parquet] PushDecoder: Add a peek API to support pre-fetching [arrow-rs]

Reply via email to