igorcalabria commented on issue #8668:
URL: https://github.com/apache/arrow-rs/issues/8668#issuecomment-3835573223
We did some tests with the push based API and found that pre-fetching would
be really beneficial for the no filter, no projection cases. (`SELECT * ...`).
We did something like this
1. "brute force" read, load all bytes from a s3 prefix into memory and then
decode the parquet into record batches after. This reached the network limit of
about 1Gb/s
2. Used the pushed based API, offloading the parquet decoding to a separate
pool to not block the async runtime. This reached between 600-700 Mb/s.
With better pre-fetching it's pretty plausible that the network would be
saturated (I'm only focusing on bandwidth here because this a "read all" case).
I just wonder if "peeking" is the best API design for this. I was thinking more
in the lines of a "scan plan" or something similar. Depending on the reader
params, all ranges + decoders could be exposed in single call which is easily
schedulable across tasks
```rust
// One-shot planning: expose all work upfront
let scan_plan = decoder.plan_batches()?;
// Vec<ScanStep> where each step describes what to read and how to decode it
// e.g. ScanStep { ranges: Vec<Range<u64>>, decode: DecodeHandle }
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]