Re: [I] Rewrite `ParquetRecordBatchReader` (sync api) in terms of the PushDecoder [arrow-rs]

via GitHub Sat, 07 Feb 2026 19:55:23 -0800


Jedi18 commented on issue #8678:
URL: https://github.com/apache/arrow-rs/issues/8678#issuecomment-3866072871


   There is also the additional complexity of the fact that the push decoder 
itself uses the `ParquetRecordBatchReader`.
   
   The `RowGroupReaderBuilder` returns a `ParquetRecordBatchReader` for each 
row group, which I guess is motivated by the next_row_group API of the 
`ParquetRecordBatchStream`.
   
   So would the right path here would be to have the `RowGroupReaderBuilder` 
hold the row group data in a different struct and to have the final assembly 
into a `ParquetRecordBatchReader` happen in the push decoder's 
`try_next_reader` before returning the result? This would allow us to refactor 
the `ParquetRecordBatchReader` to re-use the `RowGroupReaderBuilder`.
   
   Also regarding predicate evaluation, one simple solution I had in mind was 
that we add a filter stage to the RowGroupReaderBuilder and then in the sync 
decoder, have a vector of RowGroupReaderBuilder's for all rowgroups. Then we 
can advance each of them to the filter state. This would follow the required IO 
pattern of evaluating all predicates first before any decoding begins. Does 
this seem like a viable solution?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Rewrite `ParquetRecordBatchReader` (sync api) in terms of the PushDecoder [arrow-rs]

Reply via email to