adriangb opened a new pull request, #22919:
URL: https://github.com/apache/datafusion/pull/22919
## Which issue does this PR close?
- Related to #11262 (predicate evaluation ordering). Extends the
`predicate_eval`
suite added in #22704. No single issue closed.
## Rationale for this change
The `correlation` subgroup's existing cases (q70–q72) use two predicates of
equal
cost and equal selectivity. For two conjuncts the evaluation cost of an
order is
`cost(first) + selectivity(first) × cost(second)`, which is symmetric here —
so
the two orders cost the same and correlation only affects the result
cardinality.
These cases measure the *overhead* of an ordering system, but give it no
opportunity: nothing in the suite rewards (or even detects) correlation-aware
ordering.
This adds a case with real, measurable headroom that **only** joint
statistics can
find. A cheap integer predicate (`c0 = 1`, ~30%) is a perfect proxy for three
string regexes on `s1`; a fourth regex on `s2` has the same ~30% selectivity
and
similar cost but is independent. Marginally, the four regexes are
indistinguishable in any position. Conditionally — behind the proxy — the
three
`s1` regexes keep every survivor while the `s2` regex still discards ~70%.
The query is written in the natural-but-pessimal order (the redundant regexes
grouped with their proxy, the informative one last). On an M-series laptop
the
written order runs ~1.9x slower than the hand-optimal order `[c0, s2-regex,
s1-regexes...]` (16.4 ms vs 8.6 ms per iteration), so:
- an ordering system using *marginal* per-predicate statistics (or an
independence assumption) is blind to the difference — every ranking of the
four
regexes looks equivalent;
- a system measuring the predicates' *joint* behaviour can reliably collect
~1.9x.
## What changes are included in this PR?
- `load/corrproxy.sql` — the correlated-proxy dataset (deterministic,
generated
from `generate_series` like the existing datasets; `PRED_ROWS`/`PRED_FILL`
knobs as elsewhere).
- `queries/correlation/q73.sql`, `benchmarks/correlation/q73.benchmark` —
the new
case, following the suite's existing conventions.
Run with: `BENCH_NAME=predicate_eval BENCH_SUBGROUP=correlation cargo bench
--bench sql`
## Are these changes tested?
The suite's shared template asserts the query returns rows; the case runs
green
locally alongside q70–q72.
## Are there any user-facing changes?
No — benchmark-only.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]