adriangb commented on PR #22343: URL: https://github.com/apache/datafusion/pull/22343#issuecomment-4598824655
Regardless of what we end up going with I think a good first step is to have some benchmarks that actually flex the cases we are trying to optimize in clearer ways. To that end I opened https://github.com/apache/datafusion/pull/22704. Interestingly in doing so I remembered hearing that [Velox](https://github.com/facebookincubator/velox) does predicate reordering. I had Claude explore the codebase and summarize to me what they do: ## Adaptive predicate reordering A conjunctive filter (`a AND b AND c`) is evaluated with short-circuiting, so order matters: cheap, selective predicates up front eliminate most rows before the expensive ones run. The optimizer usually can't predict cost or selectivity well, so instead of guessing statically, **measure predicates as they run and reorder them on the fly.** **The metric.** Per predicate, track running counters: `numIn` (rows in), `numOut` (rows that passed), `timeClocks` (CPU cycles). Rank by: ``` score = timeClocks / (numIn - numOut) // cost per row eliminated; lower is better ``` If it dropped nothing, fall back to raw `timeClocks`. This balances cost *and* selectivity in one number — better than "most selective first," which ignores cost. **Mechanics.** - Time with a cheap hardware cycle counter (RDTSC-style), not a system clock. - Reorder **between batches**, and sort an **index permutation**, not the predicates themselves — keeps it cheap and side-effect-free. - Skip the sort when the order is already sorted by score (the common case). - Stats accumulate across batches, so the order converges on what the real data rewards. **Cold start.** No stats on the first batch — fall back to heuristics: filters before non-filters, cheap predicate kinds (int compare) before expensive ones (regex), then a stable tiebreak. **Caveats.** Order becomes nondeterministic, so handle throwing predicates carefully (defer errors until a row survives all predicates); assumes side-effect-free evaluation; keep it behind a config flag (default on). Applies at two layers with the same machinery: boolean expression evaluation (`AND`/`OR`) and pushed-down scan/column filters. ## My takeaway They are doing both the heuristic version (this PR), https://github.com/apache/datafusion/pull/22698 and the scan reordering for late materialization (https://github.com/apache/datafusion/pull/22698). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
