Re: [PR] feat(parquet): Add next_row_group API for ParquetRecordBatchStream [arrow-rs]

via GitHub Fri, 20 Dec 2024 13:02:59 -0800


alamb commented on PR #6907:
URL: https://github.com/apache/arrow-rs/pull/6907#issuecomment-2557712798


   > So this PR does have a certain elegant simplicity to it, however, it 
doesn't really solve the separation of IO and compute given that 
`reader_factory.read_factory` potentially performs CPU-bound parquet decoding 
as part of late materialization / filter pushdown.
   
   I agree it doesn't solve (nor claim to) separting CPU and compute. Also, 
neither does what is currently in the repo
   
   
   >  It also has no ability to be parallelised.
   
   I don't understand the assertion that this can't be parallelized. Do you 
mean there is now way to have concurrent outstanding `fetch` requests?
   
   As I understand it, once the reader is returned, reading from the returned 
stream actually decodes the parquet data so this PR would allow the next IO to 
be interleaved with actually decoding the data.
   
   > Given that this isn't adding a host of additional complexity, I don't 
object to merging this in, but I wanted to flag that a solution to that problem 
likely will require something a bit different.
   
   I think we could support concurrent download / decode on multiple row groups 
of the same file today by creating multiple `ParquetRecordBatchStream` (each 
for a different row group / set of row groups) 🤔  Maybe it doesn't need a new 
API
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat(parquet): Add next_row_group API for ParquetRecordBatchStream [arrow-rs]

Reply via email to