hhhizzz commented on PR #9956: URL: https://github.com/apache/arrow-rs/pull/9956#issuecomment-4519603832
A bit of context on how this PR evolved. The initial motivation was #8565: predicate pushdown is not always cheaper than scanning when the produced `RowSelection` becomes highly fragmented. After my last PR(https://github.com/apache/arrow-rs/pull/8733), I think there would be more improvement. My first attempt was to make `Auto` prefer `Mask` more often for fragmented selections, because a dense bitmap can avoid some of the tiny select/skip run overhead. That turned out to be incomplete. Page pruning means the loaded pages may be sparse, and the previous mask path could assume rows were available even when their pages had not been loaded. So the work split into two parts: 1. make explicit `Mask` safe by tracking loaded row ranges; 2. keep `Auto` free to choose `Selectors`, `Mask`, or post-filter execution based on the shape of the actual read. I also tried several purely static rules based on selectivity, projection shape, and data type. Some helped, but the results were fragile: rules that improved one fragmented case could regress sparse output reads or cacheable predicate cases. In particular, string / variable-width cases were easy to overfit. That is why the final design moved toward a small adaptive cost model instead of a larger pile of static heuristics. The current implementation is roughly: - use row-selection shape analysis to decide between `Mask` and `Selectors`; - represent sparse loaded ranges explicitly so `Mask` does not fail on page-pruned data; - observe early row-group pushdown behavior; - switch later row groups to post-filter execution only when pushdown appears unlikely to pay for itself; - keep explicit `Mask` / `Selectors` policies as caller intent and only apply this adaptive behavior to `Auto`. I also added focused benchmark cases after seeing that the original benchmark suite did not clearly expose the cost-model-sensitive cases. The goal was to cover both the original fragmented-selection cliff and the cases where an overly aggressive rule could regress. So the PR is larger than a single heuristic change because the important part was separating the concepts: - what rows are selected; - which pages are actually loaded; - how that selection should be represented; - whether pushdown or post-filter execution is cheaper for `Auto`. That separation is what makes the `Mask` correctness fix and the adaptive `Auto` behavior fit together. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
