Dandandan opened a new pull request, #21500:
URL: https://github.com/apache/datafusion/pull/21500

   ## Which issue does this PR close?
   
   - Closes #.
   
   ## Rationale for this change
   
   When distinct count statistics are absent (common when join keys involve 
CAST expressions or when column stats aren't collected), the join cardinality 
estimator severely underestimates FK join output. For example, `warehouse(5 
rows) ⋈ catalog_sales(1.4M rows)` was estimated as **5 rows** instead of ~1.4M.
   
   This caused the optimizer to keep the 1.4M-row fact table as the hash join 
build side (CollectLeft), leading to massive `concat_batches` allocations 
(128MB+) and 100x+ slowdowns on queries like TPC-DS Q99 (10.4s instead of 59ms).
   
   ## What changes are included in this PR?
   
   The inner join cardinality estimator uses `(L × R) / selectivity` where 
selectivity is derived from distinct value counts. When actual `distinct_count` 
stats are available, `max(left_distinct, right_distinct)` is used (Spark's 
algorithm). When falling back to `num_rows` estimates (no distinct stats), this 
PR changes from `max` to `min`:
   
   - **Before**: `(5 × 1.4M) / max(5, 1.4M) = 5` — underestimates, optimizer 
puts fact table as build side
   - **After**: `(5 × 1.4M) / min(5, 1.4M) = 1.4M` — correct FK estimate, 
optimizer swaps to put dimension table as build side
   
   Also handles the edge case where selectivity is 0 (returns 0 rows instead of 
`None`).
   
   ### Benchmark results (TPC-DS SF1, local, prefer_hash_join=true)
   
   | Query | Before | After | Speedup |
   |-------|--------|-------|---------|
   | Q99 | 10,384ms | 59ms | **157x** |
   | Q6 | ~870ms | 145ms | **6x** |
   | Q62 | ~970ms | ~200ms | **~5x** |
   
   ## Are these changes tested?
   
   Updated existing cardinality estimation unit tests to reflect the new 
behavior.
   
   ## Are there any user-facing changes?
   
   Join plans may change for queries where distinct count statistics are absent 
(the common case). The change produces higher cardinality estimates 
(overestimate rather than underestimate), which leads the optimizer to prefer 
smaller tables as the hash join build side. This should improve performance for 
star-schema / FK join patterns.
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to