asolimando commented on code in PR #20904:
URL: https://github.com/apache/datafusion/pull/20904#discussion_r2925081745
##########
datafusion/physical-plan/src/joins/utils.rs:
##########
@@ -697,6 +710,78 @@ fn estimate_disjoint_inputs(
None
}
+/// Estimates the number of outer rows that have at least one matching
+/// key on the inner side (i.e. semi join cardinality) using NDV
+/// (Number of Distinct Values) statistics.
+///
+/// Assuming the smaller domain is contained in the larger, the number
+/// of overlapping distinct values is `min(outer_ndv, inner_ndv)`.
+/// Under the uniformity assumption (each distinct value contributes
+/// equally to row counts), the surviving fraction of outer rows is:
+///
+/// ```text
+/// overlap_fraction = min(outer_ndv, inner_ndv) / outer_ndv
+/// ```
+///
+/// For multi-column join keys, each column pair contributes an
+/// independent selectivity factor and the overall selectivity is the
+/// product of these factors:
+///
+/// ```text
+/// selectivity = product_i(min(outer_ndv_i, inner_ndv_i) / outer_ndv_i)
+/// semi_cardinality = outer_rows * selectivity
+/// ```
+///
+/// Anti join cardinality is derived as the complement:
+/// `outer_rows - semi_cardinality`.
+///
+/// Boundary cases:
+/// * `inner_ndv >= outer_ndv` → selectivity = 1.0 (all outer rows match)
+/// * `inner_ndv = 0` → selectivity = 0.0
+/// * Missing NDV statistics → returns `None` (fallback to `outer_rows`)
+fn estimate_semi_join_cardinality(
+ outer_num_rows: &Precision<usize>,
+ inner_num_rows: &Precision<usize>,
+ outer_col_stats: &[ColumnStatistics],
+ inner_col_stats: &[ColumnStatistics],
+) -> Option<usize> {
+ let outer_rows = *outer_num_rows.get_value()?;
+ if outer_rows == 0 {
+ return Some(0);
+ }
+
+ let mut selectivity = 1.0_f64;
+ let mut has_ndv = false;
Review Comment:
Nit: this is more about having a selectivity estimate than NDV (judging on
lines 774 which set it, and 778 that consumes it), how would
`has_selectivity_estimate` sound?
##########
datafusion/physical-plan/src/joins/utils.rs:
##########
@@ -2634,6 +2719,49 @@ mod tests {
(10, Inexact(30), Absent, Absent, Absent),
Some(50),
),
+ // NDV-based semi join: outer_ndv=20, inner_ndv=10
Review Comment:
Good test coverage, but I'd also add test cases for:
- Multi-column join keys (to exercise the multiplicative selectivity path,
which is new code)
- Mixed stats availability (one column has NDV, another doesn't)
##########
datafusion/physical-plan/src/joins/utils.rs:
##########
@@ -697,6 +710,78 @@ fn estimate_disjoint_inputs(
None
}
+/// Estimates the number of outer rows that have at least one matching
Review Comment:
The math looks sound to me, and coherent with that of
https://github.com/apache/datafusion/pull/20846.
I was wondering if you did check other notable systems using CBO like Trino
or Spark.
If so, consider adding a note, this will help reviewers trust the change, as
already battle-tested elsewhere.
##########
datafusion/physical-plan/src/joins/utils.rs:
##########
@@ -697,6 +710,78 @@ fn estimate_disjoint_inputs(
None
}
+/// Estimates the number of outer rows that have at least one matching
+/// key on the inner side (i.e. semi join cardinality) using NDV
+/// (Number of Distinct Values) statistics.
+///
+/// Assuming the smaller domain is contained in the larger, the number
+/// of overlapping distinct values is `min(outer_ndv, inner_ndv)`.
+/// Under the uniformity assumption (each distinct value contributes
+/// equally to row counts), the surviving fraction of outer rows is:
+///
+/// ```text
+/// overlap_fraction = min(outer_ndv, inner_ndv) / outer_ndv
+/// ```
+///
+/// For multi-column join keys, each column pair contributes an
+/// independent selectivity factor and the overall selectivity is the
+/// product of these factors:
+///
+/// ```text
+/// selectivity = product_i(min(outer_ndv_i, inner_ndv_i) / outer_ndv_i)
+/// semi_cardinality = outer_rows * selectivity
+/// ```
+///
+/// Anti join cardinality is derived as the complement:
+/// `outer_rows - semi_cardinality`.
+///
+/// Boundary cases:
+/// * `inner_ndv >= outer_ndv` → selectivity = 1.0 (all outer rows match)
+/// * `inner_ndv = 0` → selectivity = 0.0
+/// * Missing NDV statistics → returns `None` (fallback to `outer_rows`)
+fn estimate_semi_join_cardinality(
+ outer_num_rows: &Precision<usize>,
+ inner_num_rows: &Precision<usize>,
+ outer_col_stats: &[ColumnStatistics],
+ inner_col_stats: &[ColumnStatistics],
+) -> Option<usize> {
+ let outer_rows = *outer_num_rows.get_value()?;
+ if outer_rows == 0 {
+ return Some(0);
+ }
+
+ let mut selectivity = 1.0_f64;
+ let mut has_ndv = false;
+
+ for (outer_stat, inner_stat) in
outer_col_stats.iter().zip(inner_col_stats.iter()) {
+ let outer_has_stats = outer_stat.distinct_count.get_value().is_some()
+ || (outer_stat.min_value.get_value().is_some()
+ && outer_stat.max_value.get_value().is_some());
+ let inner_has_stats = inner_stat.distinct_count.get_value().is_some()
+ || (inner_stat.min_value.get_value().is_some()
+ && inner_stat.max_value.get_value().is_some());
+ if !outer_has_stats && !inner_has_stats {
Review Comment:
I'd rather be even more conservative, and turn the AND into an OR: with
missing stats (both NDV and min/max), the number of rows is used as fallback,
mixing it NDV would make the estimation probably too inaccurate to be useful,
so my suggestion is as-follows:
```suggestion
if !outer_has_stats || !inner_has_stats {
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]