hhhizzz commented on issue #10140: URL: https://github.com/apache/arrow-rs/issues/10140#issuecomment-4768760679
I think the benchmark results are directionally useful, but I want to separate two questions: 1. Is bitmap-backed `RowSelection` faster when the bitmap already exists? 2. Where does that bitmap-backed `RowSelection` come from in real execution paths? For (1), the PR benchmark does show clear benefits in the intended case. For fragmented/random masks at common selectivities, bitmap-backed selection can be much faster. But the result is shape-dependent: very sparse or clustered selections can still favor selector/RLE representation. For (2), this is where I think the current practical limitation is. In my DataFusion checks, I could not find a common SQL path that naturally produces `RowSelection::from_boolean_buffer`: - row-filter predicates produce `BooleanArray`, but the first predicate selection still goes through `RowSelection::from_filters`; - page/access-plan pruning generally produces selector/RLE-style `RowSelection`; - TPC-DS / ClickBench do not seem to naturally construct bitmap-backed `RowSelection`. I also tried an Arrow-side experiment where I explicitly constructed/preserved bitmap-backed row selections to force this path. Even then, the broad end-to-end result was limited: TPC-DS SF10 full 99-query runs were basically neutral, within roughly <0.5% geomean. Some targeted Arrow row-filter microbenchmarks showed small single-digit wins, for example around 6%-7% in a couple of sync row-filter cases, but I would not treat those as evidence that current DataFusion workloads will naturally benefit from this PR. So my current understanding is: - this PR is useful when an upstream caller already has a row-level bitmap, such as an external index / FTS / bitmap-index integration; - for current DataFusion TPC-DS / ClickBench style workloads, the optimized path is very hard to trigger; - from a performance perspective, **the more important missing piece may be the producer side: how to create or preserve bitmap-backed `RowSelection` in real scan paths.** It may be worth documenting this scope, and possibly adding an integration-style benchmark that starts from an actual bitmap-producing access path. That would make it clearer that the PR optimizes preserving/consuming an existing bitmap, rather than making the common DataFusion SQL path faster by itself. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
