XiangpengHao commented on issue #7363:
URL: https://github.com/apache/arrow-rs/issues/7363#issuecomment-2770101880

   > This is likely the same observation that lead 
[@XiangpengHao](https://github.com/XiangpengHao) to propose adding bitmask 
support in
   > 
   > * [](https://github.com/apache/arrow-rs/pull/6624)
   
   Exactly, the predicate `SearchPhase <> ''` is not selective, meaning that we 
have lots of small selections:
   ```
   select 2 
   skip 3
   select 4
   ....
   ```
   Each 
[selector](https://github.com/apache/arrow-rs/blob/cf6e041cd11d98376c0f3632903d74f24f14529f/parquet/src/arrow/arrow_reader/selection.rs#L27)
 is 16 bytes, causing a lot of memory overhead.
   
   This also means that for each small select and skip, we will call the 
corresponding `read_records` and `skip_records`. These methods are currently 
not super optimized for millions of calls.
   
   The associated 
[`and_then`](https://github.com/apache/arrow-rs/blob/cf6e041cd11d98376c0f3632903d74f24f14529f/parquet/src/arrow/arrow_reader/selection.rs#L273)
 method is also expensive to evaluate.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to