gene-bordegaray commented on code in PR #19957:
URL: https://github.com/apache/datafusion/pull/19957#discussion_r2923487862
##########
datafusion/datasource-parquet/src/metadata.rs:
##########
@@ -541,6 +541,36 @@ fn summarize_min_max_null_counts(
)
.map(|(idx, _)| idx);
+ // Extract distinct counts from row group column statistics
+ accumulators.distinct_counts_array[logical_schema_index] =
+ if let Some(parquet_idx) = parquet_index {
+ let distinct_counts: Vec<u64> = row_groups_metadata
+ .iter()
+ .filter_map(|rg| {
+ rg.columns()
+ .get(parquet_idx)
+ .and_then(|col| col.statistics())
+ .and_then(|stats| stats.distinct_count_opt())
+ })
+ .collect();
+
+ if distinct_counts.is_empty() {
+ Precision::Absent
+ } else if distinct_counts.len() == 1 {
Review Comment:
I think the partial-coverage case still needs tightening.
If we merge multiple row groups and only some of them expose NDV returning
NDV as Exact for the whole file doesn't seem accurrate to me. Could we return
make it Inexact?
The test `test_distinct_count_partial_ndv_in_row_groups` seems wrong
asserting `Exact(15)` for a file where another row group has unknown NDV.
##########
datafusion/common/src/stats.rs:
##########
@@ -660,7 +637,14 @@ impl Statistics {
col_stats.max_value =
col_stats.max_value.max(&item_col_stats.max_value);
col_stats.min_value =
col_stats.min_value.min(&item_col_stats.min_value);
col_stats.sum_value =
col_stats.sum_value.add(&item_col_stats.sum_value);
- col_stats.distinct_count = Precision::Absent;
+ // Use max as a conservative lower bound for distinct count
+ // (can't accurately merge NDV since duplicates may exist across
partitions)
Review Comment:
If we keep max() as the merge rule, I think we should document that merged
NDV is treated as a lower-bound-ish estimate in public facing comments.
##########
datafusion/physical-expr/src/projection.rs:
##########
@@ -660,9 +660,25 @@ impl ProjectionExprs {
}
}
} else {
- // TODO stats: estimate more statistics from expressions
- // (expressions should compute their statistics themselves)
- ColumnStatistics::new_unknown()
+ // TODO: expressions should compute their own statistics
Review Comment:
Following up on this thread for my review:
I’d prefer to drop the new NDV propagation in projections from this PR. The
heuristic “expression references a single column” is too broad to be correct
for NDV. The thread already suggests handling this in a separate expression
statistics design. For this PR I think derived expressions should stay unknown,
then address the rest in a follow-up
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]