buraksenn commented on code in PR #20846:
URL: https://github.com/apache/datafusion/pull/20846#discussion_r2917464823


##########
datafusion/physical-plan/src/union.rs:
##########
@@ -863,6 +863,96 @@ fn col_stats_union(
     left
 }
 
+fn union_distinct_count(
+    left: &ColumnStatistics,
+    right: &ColumnStatistics,
+) -> Precision<usize> {
+    let (ndv_left, ndv_right) = match (
+        left.distinct_count.get_value(),
+        right.distinct_count.get_value(),
+    ) {
+        (Some(&l), Some(&r)) => (l, r),
+        _ => return Precision::Absent,
+    };
+
+    // Even with exact inputs, the union NDV depends on how
+    // many distinct values are shared between the left and right.
+    // We can only estimate this via range overlap. Thus both paths
+    // below return `Inexact`.
+    if let Some(ndv) = estimate_ndv_with_overlap(left, right, ndv_left, 
ndv_right) {
+        return Precision::Inexact(ndv);
+    }
+
+    Precision::Inexact(ndv_left + ndv_right)
+}
+
+/// Estimates the distinct count for a union using range overlap, following
+/// the approach used by Trino:
+///
+/// overlap_a = fraction of A's range that overlaps with B
+/// overlap_b = fraction of B's range that overlaps with A
+/// NDV = max(overlap_a * NDV_a, overlap_b * NDV_b)   \[intersection\]
+///     + (1 - overlap_a) * NDV_a                      [only in A]
+///     + (1 - overlap_b) * NDV_b                      [only in B]

Review Comment:
   Thanks @asolimando , with that comment and also some llm input I've add 
comments hope it is ok now



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to