Re: [PR] WIP: Rewrite `ParquetRecordBatchStream` in terms of the PushDecoder [arrow-rs]

via GitHub Tue, 19 Aug 2025 09:39:07 -0700


alamb commented on PR #8159:
URL: https://github.com/apache/arrow-rs/pull/8159#issuecomment-3201455557


   Status report: Rewriting the async decoder to use the push decoder went well 
(though this is not overly surprising given that the push decoder state machine 
was mostly modeled on the async record batch reader
   
   I found a few items to address, but no show stoppers. Pretty much all the 
tests pass except
   1. A few that are unit tests of the old APIs
   2. `async_reader_with_next_row_groups` (this is doable, it just needs 
another hook into the push decoder)
   
   
   Things to do:
   1. Rewrite the inner async reader tests to not use the inner reader state 
(move to IO) -- no ArrowReaderBuilder
   4. Implement async_reader_with_next_row_groups
   
   I also found a few things that would be very nice to fix in the push decoder 
in general:
   1. Box the ParquetDecoderSstate inner state of the decoder (to make moving 
it around faster)
   3. remove the  file_len = 0 from the push decoder builder (the async reader 
does not know the length of the file and it does not need to)
   
   I updated the description of https://github.com/apache/arrow-rs/pull/7997 to 
reflect these items and will work on them now. 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] WIP: Rewrite `ParquetRecordBatchStream` in terms of the PushDecoder [arrow-rs]

Reply via email to