Jedi18 commented on issue #8678: URL: https://github.com/apache/arrow-rs/issues/8678#issuecomment-3866072871
There is also the additional complexity of the fact that the push decoder itself uses the `ParquetRecordBatchReader`. The `RowGroupReaderBuilder` returns a `ParquetRecordBatchReader` for each row group, which I guess is motivated by the next_row_group API of the `ParquetRecordBatchStream`. So would the right path here would be to have the `RowGroupReaderBuilder` hold the row group data in a different struct and to have the final assembly into a `ParquetRecordBatchReader` happen in the push decoder's `try_next_reader` before returning the result? This would allow us to refactor the `ParquetRecordBatchReader` to re-use the `RowGroupReaderBuilder`. Also regarding predicate evaluation, one simple solution I had in mind was that we add a filter stage to the RowGroupReaderBuilder and then in the sync decoder, have a vector of RowGroupReaderBuilder's for all rowgroups. Then we can advance each of them to the filter state. This would follow the required IO pattern of evaluating all predicates first before any decoding begins. Does this seem like a viable solution? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
