haohuaijin opened a new issue, #10140:
URL: https://github.com/apache/arrow-rs/issues/10140

   ### Is your feature request related to a problem or challenge?
   
   Some callers already have the selected rows as a `BooleanBuffer`. Today they 
still need to call `RowSelection::from_filters`, which converts that bitmap 
into a `Vec<RowSelector>`.
   
   That can use much more memory. For example, with 35% isolated single-row 
hits, the selector form is about 11.2 bytes per input row. The bitmap is 1 bit 
per input row. On 500M rows, that is about 5.6 GB for selectors versus 62.5 MB 
for the bitmap.
   
   The rough calculation is:
   
   ```text
   RowSelector = 16 bytes
   selectors   = 500M rows * 35% hits * 2 selectors per isolated hit
               = 350M selectors
   memory      = 350M * 16 bytes = 5.6 GB
   
   bitmap      = 500M bits / 8 = 62.5 MB
   ```
   
   The reader may then choose the `Mask` strategy and convert the selectors 
back into a bitmap. In that case, the caller's bitmap was converted to 
selectors and then back to a bitmap again.
   
   ### Describe the solution you'd like
   
   A first-class mask backing on `RowSelection`:
   
   ```rust
   let selection = RowSelection::from_boolean_buffer(buf);
   let selection: RowSelection = buf.into();
   ```
   
   The reader's `Mask` strategy can pass that buffer straight to the cursor. 
Existing `from_filters` / `from_consecutive_ranges` users are unchanged.
   
   
   ### Describe alternatives you've considered
   
   Doing the conversion downstream doesn't help: the producer still has to call 
`from_filters` and pay the RLE encoding cost.
   
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to