alamb commented on issue #8668: URL: https://github.com/apache/arrow-rs/issues/8668#issuecomment-3849920233
> With better pre-fetching it's pretty plausible that the network would be saturated (I'm only focusing on bandwidth here because this a "read all" case). I just wonder if "peeking" is the best API design for this. I was thinking more in the lines of a "scan plan" or something similar. Depending on the reader params, all ranges + decoders could be exposed in single call which is easily schedulable across tasks Thank you for the report @igorcalabria Another thing I discovered while working on this code is the existing API https://docs.rs/parquet/latest/parquet/arrow/async_reader/struct.ParquetRecordBatchStream.html#method.next_row_group (this is a pretty thin wrapper over the push decoder `try_next_reader` API) That being said it will still fetch the ranges sequentially which is not ideal Another thought I had for your usecase is to, as you say, create an individual PushDecoder for each row group (or some other RowSelection) and then run them all in parallel 🤔 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
