gabotechs commented on code in PR #19957:
URL: https://github.com/apache/datafusion/pull/19957#discussion_r2726962898
##########
datafusion/common/src/stats.rs:
##########
@@ -632,7 +632,14 @@ impl Statistics {
col_stats.max_value =
col_stats.max_value.max(&item_col_stats.max_value);
col_stats.min_value =
col_stats.min_value.min(&item_col_stats.min_value);
col_stats.sum_value =
col_stats.sum_value.add(&item_col_stats.sum_value);
- col_stats.distinct_count = Precision::Absent;
+ // Use max as a conservative lower bound for distinct count
+ // (can't accurately merge NDV since duplicates may exist across
partitions)
+ col_stats.distinct_count = col_stats
+ .distinct_count
+ .get_value()
+ .max(item_col_stats.distinct_count.get_value())
+ .map(|&v| Precision::Inexact(v))
+ .unwrap_or(Precision::Absent);
Review Comment:
👍 thinks sounds pretty reasonable to me
##########
datafusion/physical-expr/src/projection.rs:
##########
@@ -660,9 +660,25 @@ impl ProjectionExprs {
}
}
} else {
- // TODO stats: estimate more statistics from expressions
- // (expressions should compute their statistics themselves)
- ColumnStatistics::new_unknown()
+ // TODO: expressions should compute their own statistics
Review Comment:
I think this TODO comment is right on point: if we wanted to compute
statistics for all expressions here, it would become unmanageable pretty soon.
Even if the estimation about expressions referencing single columns here is
an improvement, it would be awesome if we could try to be one step ahead and
think how proper stats propagation across expressions would look like. That
way, this PR, instead of special-handling one particular case, it could be
setting the foundation on top of which more people can start contributing
smarter stats calculation.
I see that some prior work has been done in order to improve stats
propagation in expressions. Some context here:
- https://github.com/apache/datafusion/pull/14699
- https://github.com/apache/datafusion/issues/14896
By looking at the API for handling statistics in expressions shipped in
https://github.com/apache/datafusion/pull/14699... I would not know how to use
it in order to properly calculate NDV values. I get the impression that 1) we
do not want to introduce yet another API for propagating stats across
expressions and 2) the current "new" API shipped in
https://github.com/apache/datafusion/pull/14699 is not suitable for propagating
NDV values.
##########
datafusion/datasource-parquet/src/metadata.rs:
##########
@@ -411,53 +415,6 @@ fn create_max_min_accs(
(max_values, min_values)
}
-fn get_col_stats(
Review Comment:
Correct me if I'm wrong but the `get_col_stats()` changes look more like a
noop refactor right? I think it looks good 👍 but just double checking that
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]