asolimando commented on code in PR #19957:
URL: https://github.com/apache/datafusion/pull/19957#discussion_r2913413380
##########
datafusion/common/src/stats.rs:
##########
@@ -660,7 +637,14 @@ impl Statistics {
col_stats.max_value =
col_stats.max_value.max(&item_col_stats.max_value);
col_stats.min_value =
col_stats.min_value.min(&item_col_stats.min_value);
col_stats.sum_value =
col_stats.sum_value.add(&item_col_stats.sum_value);
- col_stats.distinct_count = Precision::Absent;
+ // Use max as a conservative lower bound for distinct count
+ // (can't accurately merge NDV since duplicates may exist across
partitions)
Review Comment:
Actually @jonathanc-n in
https://github.com/apache/datafusion/pull/20846#discussion_r2913027707 proposes
to use what Trino has for updating NDV when min and max are available, which is
quite elegant (quoting from his message):
```text
// for merging A + B when min/max are available
overlap_a = (overlap range) / (A's range) // fraction of A's range that
overlaps with B
overlap_b = (overlap range) / (B's range) // fraction of B's range that
overlaps with A
new_ndv = max(overlap_a * NDV_a, overlap_b * NDV_b) // NDV in the
overlapping range
+ (1 - overlap_a) * NDV_a // NDV unique to A's
range
+ (1 - overlap_b) * NDV_b // NDV unique to B's
range
```
The formula ranges between `[max(ndvs), sum(ndvs)]`, from full overlap to no
overlap (under the uniform distribution of NDV values in the `[min, max]`
range, which is classic for scalar-based statistics propagation).
When min/max are not available, we can fall back to `max`, as currently
implemented.
I can update `try_merge` accordingly, if you agree, @xudong963 @jonathanc-n.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]