hhhizzz commented on PR #9956:
URL: https://github.com/apache/arrow-rs/pull/9956#issuecomment-4519603832

   A bit of context on how this PR evolved.
   
   The initial motivation was #8565: predicate pushdown is not always cheaper 
than scanning when the produced `RowSelection` becomes highly fragmented. After 
my last PR(https://github.com/apache/arrow-rs/pull/8733), I think there would 
be more improvement. My first attempt was to make `Auto` prefer `Mask` more 
often for fragmented selections, because a dense bitmap can avoid some of the 
tiny select/skip run overhead.
   
   That turned out to be incomplete. Page pruning means the loaded pages may be 
sparse, and the previous mask path could assume rows were available even when 
their pages had not been loaded. So the work split into two parts:
   
   1. make explicit `Mask` safe by tracking loaded row ranges;
   2. keep `Auto` free to choose `Selectors`, `Mask`, or post-filter execution 
based on the shape of the actual read.
   
   I also tried several purely static rules based on selectivity, projection 
shape, and data type. Some helped, but the results were fragile: rules that 
improved one fragmented case could regress sparse output reads or cacheable 
predicate cases. In particular, string / variable-width cases were easy to 
overfit. That is why the final design moved toward a small adaptive cost model 
instead of a larger pile of static heuristics.
   
   The current implementation is roughly:
   
   - use row-selection shape analysis to decide between `Mask` and `Selectors`;
   - represent sparse loaded ranges explicitly so `Mask` does not fail on 
page-pruned data;
   - observe early row-group pushdown behavior;
   - switch later row groups to post-filter execution only when pushdown 
appears unlikely to pay for itself;
   - keep explicit `Mask` / `Selectors` policies as caller intent and only 
apply this adaptive behavior to `Auto`.
   
   I also added focused benchmark cases after seeing that the original 
benchmark suite did not clearly expose the cost-model-sensitive cases. The goal 
was to cover both the original fragmented-selection cliff and the cases where 
an overly aggressive rule could regress.
   
   So the PR is larger than a single heuristic change because the important 
part was separating the concepts:
   
   - what rows are selected;
   - which pages are actually loaded;
   - how that selection should be represented;
   - whether pushdown or post-filter execution is cheaper for `Auto`.
   
   That separation is what makes the `Mask` correctness fix and the adaptive 
`Auto` behavior fit together.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to