haohuaijin opened a new issue, #10140:
URL: https://github.com/apache/arrow-rs/issues/10140
### Is your feature request related to a problem or challenge?
Some callers already have the selected rows as a `BooleanBuffer`. Today they
still need to call `RowSelection::from_filters`, which converts that bitmap
into a `Vec<RowSelector>`.
That can use much more memory. For example, with 35% isolated single-row
hits, the selector form is about 11.2 bytes per input row. The bitmap is 1 bit
per input row. On 500M rows, that is about 5.6 GB for selectors versus 62.5 MB
for the bitmap.
The rough calculation is:
```text
RowSelector = 16 bytes
selectors = 500M rows * 35% hits * 2 selectors per isolated hit
= 350M selectors
memory = 350M * 16 bytes = 5.6 GB
bitmap = 500M bits / 8 = 62.5 MB
```
The reader may then choose the `Mask` strategy and convert the selectors
back into a bitmap. In that case, the caller's bitmap was converted to
selectors and then back to a bitmap again.
### Describe the solution you'd like
A first-class mask backing on `RowSelection`:
```rust
let selection = RowSelection::from_boolean_buffer(buf);
let selection: RowSelection = buf.into();
```
The reader's `Mask` strategy can pass that buffer straight to the cursor.
Existing `from_filters` / `from_consecutive_ranges` users are unchanged.
### Describe alternatives you've considered
Doing the conversion downstream doesn't help: the producer still has to call
`from_filters` and pay the RLE encoding cost.
### Additional context
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]