asolimando commented on code in PR #19957:
URL: https://github.com/apache/datafusion/pull/19957#discussion_r2913413380


##########
datafusion/common/src/stats.rs:
##########
@@ -660,7 +637,14 @@ impl Statistics {
             col_stats.max_value = 
col_stats.max_value.max(&item_col_stats.max_value);
             col_stats.min_value = 
col_stats.min_value.min(&item_col_stats.min_value);
             col_stats.sum_value = 
col_stats.sum_value.add(&item_col_stats.sum_value);
-            col_stats.distinct_count = Precision::Absent;
+            // Use max as a conservative lower bound for distinct count
+            // (can't accurately merge NDV since duplicates may exist across 
partitions)

Review Comment:
   Actually @jonathanc-n in 
https://github.com/apache/datafusion/pull/20846#discussion_r2913027707 proposes 
to use what Trino has for updating NDV when min and max are available, which is 
quite elegant (quoting from his message):
   
   ```text
   // for merging A + B when min/max are available
   
   overlap_a = (overlap range) / (A's range)  // fraction of A's range that 
overlaps with B
   overlap_b = (overlap range) / (B's range)  // fraction of B's range that 
overlaps with A
   
   new_ndv = max(overlap_a * NDV_a, overlap_b * NDV_b)  // NDV in the 
overlapping range
           + (1 - overlap_a) * NDV_a                     // NDV unique to A's 
range
           + (1 - overlap_b) * NDV_b                     // NDV unique to B's 
range
   ```
   
   The formula ranges between `[max(ndvs), sum(ndvs)]`, from full overlap to no 
overlap (under the uniform distribution of NDV values in the `[min, max]` 
range, which is classic for scalar-based statistics propagation).
   
   When min/max are not available, we can fall back to `max`, as currently 
implemented.
   
   I can update `try_merge` accordingly, if you agree, @xudong963 @jonathanc-n.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to