hhhizzz opened a new pull request, #8733: URL: https://github.com/apache/arrow-rs/pull/8733
# Which issue does this PR close? - Closes #8565 - Closes #5523 # Rationale for this change This change improves the performance of reading Parquet files. # What changes are included in this PR? This pull request introduces significant improvements to row selection and filtering in the Parquet Arrow reader, optimizing batch reading and handling of sparse data. The most important changes include a new mask-based row selection state, enhancements to synthetic page handling, and expanded test coverage for these features. **Row selection and filtering improvements:** * Introduced `RowSelectionState` in `read_plan.rs`, which dynamically chooses between a bitmap mask array and selector queue for efficient row selection during batch reads. This enables streaming with contiguous mask segments and reduces overhead for sparse selections. * Updated `ParquetRecordBatchReader` to leverage the mask-based selection, streaming record batches using boolean masks and applying Arrow filtering for selected rows. This avoids intermediate materialization and improves performance for sparse row selections. If the average length of the `RowSelector` is less than 8, it will be replaced by a bitmap mask. * If the average RowSelector length is less than 8, it is automatically replaced by a bitmap mask. * Added a benchmark to determine this threshold value (8). **Synthetic page and definition level handling:** A challenge with the mask-based approach is that some pages may be skipped, and due to the streaming design of the reader, it’s not always possible to determine in advance which pages will be skipped. To address this, additional logic was added to return None when a page is skipped, ensuring correct handling in such cases. Together, these improvements enhance both efficiency and correctness in row selection, filtering, and sparse data processing for the Parquet Arrow reader. # Are these changes tested? * Added new tests for interleaved skip/select row selections and mask-based sparse row selection, ensuring correctness of the new mask-based streaming logic and synthetic page handling. # Are there any user-facing changes? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
