igorcalabria commented on issue #8668:
URL: https://github.com/apache/arrow-rs/issues/8668#issuecomment-3835573223

   We did some tests with the push based API and found that pre-fetching would 
be really beneficial for the no filter, no projection cases. (`SELECT * ...`). 
We did something like this
   
   1. "brute force" read, load all bytes from a s3 prefix into memory and then 
decode the parquet into record batches after. This reached the network limit of 
about 1Gb/s 
   2. Used the pushed based API, offloading the parquet decoding to a separate 
pool to not block the async runtime. This reached between 600-700 Mb/s. 
   
   With better pre-fetching it's pretty plausible that the network would be 
saturated (I'm only focusing on bandwidth here because this a "read all" case). 
I just wonder if "peeking" is the best API design for this. I was thinking more 
in the lines of a "scan plan" or something similar. Depending on the reader 
params, all ranges + decoders could be exposed in single call which is easily 
schedulable across tasks 
   ```rust
   // One-shot planning: expose all work upfront
   let scan_plan = decoder.plan_batches()?;
   // Vec<ScanStep> where each step describes what to read and how to decode it
   // e.g. ScanStep { ranges: Vec<Range<u64>>, decode: DecodeHandle }
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to