adriangb opened a new pull request, #22919:
URL: https://github.com/apache/datafusion/pull/22919

   ## Which issue does this PR close?
   
   - Related to #11262 (predicate evaluation ordering). Extends the 
`predicate_eval`
     suite added in #22704. No single issue closed.
   
   ## Rationale for this change
   
   The `correlation` subgroup's existing cases (q70–q72) use two predicates of 
equal
   cost and equal selectivity. For two conjuncts the evaluation cost of an 
order is
   `cost(first) + selectivity(first) × cost(second)`, which is symmetric here — 
so
   the two orders cost the same and correlation only affects the result 
cardinality.
   These cases measure the *overhead* of an ordering system, but give it no
   opportunity: nothing in the suite rewards (or even detects) correlation-aware
   ordering.
   
   This adds a case with real, measurable headroom that **only** joint 
statistics can
   find. A cheap integer predicate (`c0 = 1`, ~30%) is a perfect proxy for three
   string regexes on `s1`; a fourth regex on `s2` has the same ~30% selectivity 
and
   similar cost but is independent. Marginally, the four regexes are
   indistinguishable in any position. Conditionally — behind the proxy — the 
three
   `s1` regexes keep every survivor while the `s2` regex still discards ~70%.
   
   The query is written in the natural-but-pessimal order (the redundant regexes
   grouped with their proxy, the informative one last). On an M-series laptop 
the
   written order runs ~1.9x slower than the hand-optimal order `[c0, s2-regex,
   s1-regexes...]` (16.4 ms vs 8.6 ms per iteration), so:
   
   - an ordering system using *marginal* per-predicate statistics (or an
     independence assumption) is blind to the difference — every ranking of the 
four
     regexes looks equivalent;
   - a system measuring the predicates' *joint* behaviour can reliably collect 
~1.9x.
   
   ## What changes are included in this PR?
   
   - `load/corrproxy.sql` — the correlated-proxy dataset (deterministic, 
generated
     from `generate_series` like the existing datasets; `PRED_ROWS`/`PRED_FILL`
     knobs as elsewhere).
   - `queries/correlation/q73.sql`, `benchmarks/correlation/q73.benchmark` — 
the new
     case, following the suite's existing conventions.
   
   Run with: `BENCH_NAME=predicate_eval BENCH_SUBGROUP=correlation cargo bench 
--bench sql`
   
   ## Are these changes tested?
   
   The suite's shared template asserts the query returns rows; the case runs 
green
   locally alongside q70–q72.
   
   ## Are there any user-facing changes?
   
   No — benchmark-only.
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to