[I] feat: use exponential decay for multi-column join selectivity estimation [datafusion]

via GitHub Mon, 13 Apr 2026 01:12:55 -0700


kumarUjjawal opened a new issue, #21583:
URL: https://github.com/apache/datafusion/issues/21583


   ### Is your feature request related to a problem or challenge?
   
   Current join cardinality estimation for multi-column equi-joins is too 
conservative.
   
   When a join has several equality keys, the estimate mostly follows only the 
single strongest key. That means extra join predicates do not reduce the 
estimated output rows enough. This can overestimate join size and lead to 
weaker join planning decisions.
   
   ### Describe the solution you'd like
   
   Use NDV-based exponential decay for multi-column join selectivity.
   
   For an equi-join, compute one NDV factor per join key pair as:
   
   `max(NDV(left_key), NDV(right_key))`
   
   Then sort these factors from largest to smallest and combine them with decay:
   
   `ndv0 * ndv1^(1/2) * ndv2^(1/4) * ndv3^(1/8) * ...`
   
   Use that decayed value as the denominator for inner join cardinality 
estimation instead of using only the single largest NDV.
   
   In practice, this would mean:
   
   - collect the per-key NDV factors for all join keys
   - sort them descending
   - combine them with powers `1`, `1/2`, `1/4`, ...
   - estimate rows as `left_num_rows * right_num_rows / decayed_ndv`
   - keep the current disjoint-range short-circuit
   - keep the current fallback behavior when NDV stats are missing
   
   This should make multi-key join estimates tighter without being as 
aggressive as multiplying all key NDVs directly.
   
   It would also be good to add tests for:
   
   - single-key joins keeping the current behavior
   - multi-key joins producing smaller estimates than the current 
largest-NDV-only rule
   - missing NDV stats falling back cleanly
   - disjoint join keys still producing zero where applicable
   - join-key order not changing the estimate
   
   ### Describe alternatives you've considered
   
   Keep the current largest-NDV-only rule.
   This is simple, but it ignores useful information from the other join keys.
   
   Multiply all join-key NDVs directly.
   This is likely too aggressive and can under-estimate badly when keys are 
correlated.
   
   Use histograms or stronger correlation models.
   This could be more accurate, but it is a larger change and not needed for a 
first improvement.
   
   ### Additional context
   
   Part of #20766.
   
   Generated with Codex
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] feat: use exponential decay for multi-column join selectivity estimation [datafusion]

Reply via email to