haohuaijin opened a new pull request, #10141:
URL: https://github.com/apache/arrow-rs/pull/10141

   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and 
enhancements and this helps us generate change logs for our releases. You can 
link an issue to this PR using the GitHub syntax.
   -->
   
   - Closes #10140 
   
   # Rationale for this change
   
   `RowSelection` currently stores selections as `Vec<RowSelector>` (16 bytes 
per selector). This is compact for long runs, but expensive for scattered 
matches. With ~35% isolated single-row hits, it uses about 11.2 bytes per input 
row. A `BooleanBuffer` uses 1 bit per input row, about 90x less memory.
   
   The reader can also choose the `Mask` strategy, which converts selectors 
back into a bitmap. When the caller already had a bitmap, this conversion 
round-trip is unnecessary.
   
   # What changes are included in this PR?
   
   `RowSelection` can now be backed by either `Vec<RowSelector>` or 
`BooleanBuffer`. New public construction:
   
   ```rust
   pub fn RowSelection::from_boolean_buffer(mask: BooleanBuffer) -> Self;
   impl From<BooleanBuffer> for RowSelection;
   ```
   
   Methods that can work directly on the bitmap now do so:
   
   - `iter()` streams via `BitSliceIterator`
   - `row_count` / `skipped_row_count` use `count_set_bits`
   - `selects_any` uses `set_indices().next()`
   - `trim` preserves mask backing via `BooleanBuffer::slice`
   - `intersection` / `union` on `Mask`+`Mask` use `BitAnd` / `BitOr`
   - `split_off` on a mask uses `BooleanBuffer::slice` (`O(1)`, both halves 
stay mask-backed)
   - `limit` slices at the selected-row boundary via 
`find_nth_set_bit_position`, staying mask-backed
   - `offset` finds the first selected row to keep via 
`find_nth_set_bit_position` and rebuilds only the mask buffer, avoiding 
selector materialization
   - `and_then` applies the inner selection over the mask's set positions, 
returning a mask-backed result
   - `FromIterator<RowSelection>` concatenates `BooleanBuffer`s when every 
input is mask-backed
   
   Mixed inputs, and existing selector-backed inputs, still use the existing 
selector helpers. Existing callers keep the same behavior.
   
   The reader (`ReadPlanBuilder::build`) passes a mask-backed selection 
straight to `RowSelectionCursor::new_mask_from_buffer`, so it skips rebuilding 
the bitmap from selectors.
   
   # Are these changes tested?
   
   Yes. This PR extends the existing `RowSelection` unit tests with coverage 
for:
   
   - constructing from `BooleanBuffer`, including empty and all-unset masks
   - `From<BooleanBuffer>`
   - preserving mask backing across clone, `split_off`, `limit`, `offset`, 
`and_then`, and all-mask `FromIterator<RowSelection>`
   - falling back to selector backing for mixed-backed concatenation
   - equality between equivalent selector-backed and mask-backed selections
   - mask-backed `intersection` / `union`, including uneven-length inputs
   - fuzz-style equivalence between mask-backed selections and the existing 
`from_filters` selector path
   
   # Are there any user-facing changes?
   
   **Yes — one source-breaking change suitable for the next major release**: 
`RowSelection::iter()` now yields `RowSelector` (`Item = RowSelector`) instead 
of `&RowSelector`. This is needed because a mask-backed selection does not have 
a `Vec<RowSelector>` to borrow from.
   
   `RowSelector` is `Copy` (16 bytes), so most call sites are source-compatible:
   
   ```rust
   selection.iter().map(|s| s.row_count).sum::<usize>()    // unchanged
   selection.iter().filter(|s| !s.skip).count()            // unchanged
   selection.iter().any(|s| !s.skip)                       // unchanged
   ```
   
   Call sites that need updating:
   
   ```rust
   // before
   let v: Vec<RowSelector> = selection.iter().cloned().collect();
   let v: Vec<&RowSelector> = selection.iter().collect();
   
   // after
   let v: Vec<RowSelector> = selection.iter().collect();
   ```
   
   Within arrow-rs, the required updates are limited to parquet's own 
`RowSelection` code and tests. The other crates do not use 
`RowSelection::iter()`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to