alamb commented on PR #6907: URL: https://github.com/apache/arrow-rs/pull/6907#issuecomment-2557712798
> So this PR does have a certain elegant simplicity to it, however, it doesn't really solve the separation of IO and compute given that `reader_factory.read_factory` potentially performs CPU-bound parquet decoding as part of late materialization / filter pushdown. I agree it doesn't solve (nor claim to) separting CPU and compute. Also, neither does what is currently in the repo > It also has no ability to be parallelised. I don't understand the assertion that this can't be parallelized. Do you mean there is now way to have concurrent outstanding `fetch` requests? As I understand it, once the reader is returned, reading from the returned stream actually decodes the parquet data so this PR would allow the next IO to be interleaved with actually decoding the data. > Given that this isn't adding a host of additional complexity, I don't object to merging this in, but I wanted to flag that a solution to that problem likely will require something a bit different. I think we could support concurrent download / decode on multiple row groups of the same file today by creating multiple `ParquetRecordBatchStream` (each for a different row group / set of row groups) 🤔 Maybe it doesn't need a new API -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
