asolimando commented on code in PR #19957:
URL: https://github.com/apache/datafusion/pull/19957#discussion_r2917174928


##########
datafusion/common/src/stats.rs:
##########
@@ -660,7 +637,14 @@ impl Statistics {
             col_stats.max_value = 
col_stats.max_value.max(&item_col_stats.max_value);
             col_stats.min_value = 
col_stats.min_value.min(&item_col_stats.min_value);
             col_stats.sum_value = 
col_stats.sum_value.add(&item_col_stats.sum_value);
-            col_stats.distinct_count = Precision::Absent;
+            // Use max as a conservative lower bound for distinct count
+            // (can't accurately merge NDV since duplicates may exist across 
partitions)

Review Comment:
   Since you approved https://github.com/apache/datafusion/pull/20846 already, 
and the code in `union.rs::col_stats_union` is exactly what we want here too, I 
plan to wait for that PR to get merged, then turn `estimate_ndv_with_overlap` 
into a utility function in `datafusion-common/src/stats.rs` for reuse (all the 
types it needs already live there).
   
   The formula is binary, but `try_merge` folds over many row groups, iterative 
pairwise application works naturally. No extra heap allocations: `distance()` 
returns a stack `usize`, the rest is `f64` arithmetic on top of the min/max 
comparisons `try_merge` already does.
   
   Note: in #20846 the fallback when min/max are absent is `sum` (sensible for 
Union). For row group merging I'll keep `max` as fallback, since row groups 
from the same file are more likely to have overlapping values. I will make sure 
to generalize the current code to handle both cases.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to