kumarUjjawal opened a new issue, #21583: URL: https://github.com/apache/datafusion/issues/21583
### Is your feature request related to a problem or challenge? Current join cardinality estimation for multi-column equi-joins is too conservative. When a join has several equality keys, the estimate mostly follows only the single strongest key. That means extra join predicates do not reduce the estimated output rows enough. This can overestimate join size and lead to weaker join planning decisions. ### Describe the solution you'd like Use NDV-based exponential decay for multi-column join selectivity. For an equi-join, compute one NDV factor per join key pair as: `max(NDV(left_key), NDV(right_key))` Then sort these factors from largest to smallest and combine them with decay: `ndv0 * ndv1^(1/2) * ndv2^(1/4) * ndv3^(1/8) * ...` Use that decayed value as the denominator for inner join cardinality estimation instead of using only the single largest NDV. In practice, this would mean: - collect the per-key NDV factors for all join keys - sort them descending - combine them with powers `1`, `1/2`, `1/4`, ... - estimate rows as `left_num_rows * right_num_rows / decayed_ndv` - keep the current disjoint-range short-circuit - keep the current fallback behavior when NDV stats are missing This should make multi-key join estimates tighter without being as aggressive as multiplying all key NDVs directly. It would also be good to add tests for: - single-key joins keeping the current behavior - multi-key joins producing smaller estimates than the current largest-NDV-only rule - missing NDV stats falling back cleanly - disjoint join keys still producing zero where applicable - join-key order not changing the estimate ### Describe alternatives you've considered Keep the current largest-NDV-only rule. This is simple, but it ignores useful information from the other join keys. Multiply all join-key NDVs directly. This is likely too aggressive and can under-estimate badly when keys are correlated. Use histograms or stronger correlation models. This could be more accurate, but it is a larger change and not needed for a first improvement. ### Additional context Part of #20766. Generated with Codex -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
