adriangb commented on PR #22343:
URL: https://github.com/apache/datafusion/pull/22343#issuecomment-4598824655

   Regardless of what we end up going with I think a good first step is to have 
some benchmarks that actually flex the cases we are trying to optimize in 
clearer ways. To that end I opened 
https://github.com/apache/datafusion/pull/22704.
   
   Interestingly in doing so I remembered hearing that 
[Velox](https://github.com/facebookincubator/velox) does predicate reordering. 
I had Claude explore the codebase and summarize to me what they do:
   
   ## Adaptive predicate reordering
   
   A conjunctive filter (`a AND b AND c`) is evaluated with short-circuiting, so
   order matters: cheap, selective predicates up front eliminate most rows 
before
   the expensive ones run. The optimizer usually can't predict cost or 
selectivity
   well, so instead of guessing statically, **measure predicates as they run and
   reorder them on the fly.**
   
   **The metric.** Per predicate, track running counters: `numIn` (rows in),
   `numOut` (rows that passed), `timeClocks` (CPU cycles). Rank by:
   
   ```
   score = timeClocks / (numIn - numOut)      // cost per row eliminated; lower 
is better
   ```
   
   If it dropped nothing, fall back to raw `timeClocks`. This balances cost 
*and*
   selectivity in one number — better than "most selective first," which ignores
   cost.
   
   **Mechanics.**
   - Time with a cheap hardware cycle counter (RDTSC-style), not a system clock.
   - Reorder **between batches**, and sort an **index permutation**, not the
     predicates themselves — keeps it cheap and side-effect-free.
   - Skip the sort when the order is already sorted by score (the common case).
   - Stats accumulate across batches, so the order converges on what the real 
data
     rewards.
   
   **Cold start.** No stats on the first batch — fall back to heuristics: 
filters
   before non-filters, cheap predicate kinds (int compare) before expensive ones
   (regex), then a stable tiebreak.
   
   **Caveats.** Order becomes nondeterministic, so handle throwing predicates
   carefully (defer errors until a row survives all predicates); assumes
   side-effect-free evaluation; keep it behind a config flag (default on).
   
   Applies at two layers with the same machinery: boolean expression evaluation
   (`AND`/`OR`) and pushed-down scan/column filters.
   
   ## My takeaway
   
   They are doing both the heuristic version (this PR), 
https://github.com/apache/datafusion/pull/22698 and the scan reordering for 
late materialization (https://github.com/apache/datafusion/pull/22698).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to