haohuaijin opened a new pull request, #10141: URL: https://github.com/apache/arrow-rs/pull/10141
# Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. --> - Closes #10140 # Rationale for this change `RowSelection` currently stores selections as `Vec<RowSelector>` (16 bytes per selector). This is compact for long runs, but expensive for scattered matches. With ~35% isolated single-row hits, it uses about 11.2 bytes per input row. A `BooleanBuffer` uses 1 bit per input row, about 90x less memory. The reader can also choose the `Mask` strategy, which converts selectors back into a bitmap. When the caller already had a bitmap, this conversion round-trip is unnecessary. # What changes are included in this PR? `RowSelection` can now be backed by either `Vec<RowSelector>` or `BooleanBuffer`. New public construction: ```rust pub fn RowSelection::from_boolean_buffer(mask: BooleanBuffer) -> Self; impl From<BooleanBuffer> for RowSelection; ``` Methods that can work directly on the bitmap now do so: - `iter()` streams via `BitSliceIterator` - `row_count` / `skipped_row_count` use `count_set_bits` - `selects_any` uses `set_indices().next()` - `trim` preserves mask backing via `BooleanBuffer::slice` - `intersection` / `union` on `Mask`+`Mask` use `BitAnd` / `BitOr` - `split_off` on a mask uses `BooleanBuffer::slice` (`O(1)`, both halves stay mask-backed) - `limit` slices at the selected-row boundary via `find_nth_set_bit_position`, staying mask-backed - `offset` finds the first selected row to keep via `find_nth_set_bit_position` and rebuilds only the mask buffer, avoiding selector materialization - `and_then` applies the inner selection over the mask's set positions, returning a mask-backed result - `FromIterator<RowSelection>` concatenates `BooleanBuffer`s when every input is mask-backed Mixed inputs, and existing selector-backed inputs, still use the existing selector helpers. Existing callers keep the same behavior. The reader (`ReadPlanBuilder::build`) passes a mask-backed selection straight to `RowSelectionCursor::new_mask_from_buffer`, so it skips rebuilding the bitmap from selectors. # Are these changes tested? Yes. This PR extends the existing `RowSelection` unit tests with coverage for: - constructing from `BooleanBuffer`, including empty and all-unset masks - `From<BooleanBuffer>` - preserving mask backing across clone, `split_off`, `limit`, `offset`, `and_then`, and all-mask `FromIterator<RowSelection>` - falling back to selector backing for mixed-backed concatenation - equality between equivalent selector-backed and mask-backed selections - mask-backed `intersection` / `union`, including uneven-length inputs - fuzz-style equivalence between mask-backed selections and the existing `from_filters` selector path # Are there any user-facing changes? **Yes — one source-breaking change suitable for the next major release**: `RowSelection::iter()` now yields `RowSelector` (`Item = RowSelector`) instead of `&RowSelector`. This is needed because a mask-backed selection does not have a `Vec<RowSelector>` to borrow from. `RowSelector` is `Copy` (16 bytes), so most call sites are source-compatible: ```rust selection.iter().map(|s| s.row_count).sum::<usize>() // unchanged selection.iter().filter(|s| !s.skip).count() // unchanged selection.iter().any(|s| !s.skip) // unchanged ``` Call sites that need updating: ```rust // before let v: Vec<RowSelector> = selection.iter().cloned().collect(); let v: Vec<&RowSelector> = selection.iter().collect(); // after let v: Vec<RowSelector> = selection.iter().collect(); ``` Within arrow-rs, the required updates are limited to parquet's own `RowSelection` code and tests. The other crates do not use `RowSelection::iter()`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
