hhhizzz opened a new pull request, #8733:
URL: https://github.com/apache/arrow-rs/pull/8733

   # Which issue does this PR close?
   
   - Closes #8565
   - Closes #5523
   
   # Rationale for this change
   
   This change improves the performance of reading Parquet files.
   
   # What changes are included in this PR?
   This pull request introduces significant improvements to row selection and 
filtering in the Parquet Arrow reader, optimizing batch reading and handling of 
sparse data. The most important changes include a new mask-based row selection 
state, enhancements to synthetic page handling, and expanded test coverage for 
these features.
   
   **Row selection and filtering improvements:**
   
   * Introduced `RowSelectionState` in `read_plan.rs`, which dynamically 
chooses between a bitmap mask array and selector queue for efficient row 
selection during batch reads. This enables streaming with contiguous mask 
segments and reduces overhead for sparse selections. 
   * Updated `ParquetRecordBatchReader` to leverage the mask-based selection, 
streaming record batches using boolean masks and applying Arrow filtering for 
selected rows. This avoids intermediate materialization and improves 
performance for sparse row selections. If the average length of the 
`RowSelector` is less than 8, it will be replaced by a bitmap mask.
   * If the average RowSelector length is less than 8, it is automatically 
replaced by a bitmap mask.
   * Added a benchmark to determine this threshold value (8).
   
   
   **Synthetic page and definition level handling:**
   
   A challenge with the mask-based approach is that some pages may be skipped, 
and due to the streaming design of the reader, it’s not always possible to 
determine in advance which pages will be skipped.
   To address this, additional logic was added to return None when a page is 
skipped, ensuring correct handling in such cases.
   Together, these improvements enhance both efficiency and correctness in row 
selection, filtering, and sparse data processing for the Parquet Arrow reader.
   
   
   # Are these changes tested?
   
   * Added new tests for interleaved skip/select row selections and mask-based 
sparse row selection, ensuring correctness of the new mask-based streaming 
logic and synthetic page handling. 
   
   
   # Are there any user-facing changes?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to