gene-bordegaray commented on code in PR #21081:
URL: https://github.com/apache/datafusion/pull/21081#discussion_r2988366494
##########
datafusion/common/src/stats.rs:
##########
@@ -551,6 +551,10 @@ impl Statistics {
}
Precision::Absent => Precision::Absent,
};
+ // NDV can never exceed the number of rows
+ if let Some(&rows) = self.num_rows.get_value() {
+ cs.distinct_count =
cs.distinct_count.min(&Precision::Inexact(rows));
Review Comment:
This seems like it would be fine when using this rule for a `LIMIT` since
this is a hard cap.
But `with_fetch()` also seems to handle `skip` which results in estimated
rows. I don't know if treating the hard cap provided form `fetch` and the
estimate from `skip` is the eeet way of doing this since we could easily
overestimate the NDV.
##########
datafusion/physical-plan/src/filter.rs:
##########
@@ -842,12 +844,19 @@ fn collect_new_statistics(
let is_exact = !lower.is_null() && !upper.is_null() && lower
== upper;
let min_value = interval_bound_to_precision(lower, is_exact);
let max_value = interval_bound_to_precision(upper, is_exact);
+ // NDV can never exceed the number of rows after filtering
+ let capped_distinct_count = match filtered_num_rows {
+ Some(rows) => {
+
distinct_count.to_inexact().min(&Precision::Inexact(rows))
+ }
+ None => distinct_count.to_inexact(),
+ };
Review Comment:
Same thing I am wondering here. Is is safe to cap the NDV to an inexact
value?
##########
datafusion/common/src/stats.rs:
##########
@@ -2075,6 +2080,46 @@ mod tests {
assert_eq!(result.total_byte_size, Precision::Inexact(800));
}
+ #[test]
Review Comment:
Adding a test here with `skip` being set would be nice to see what expecte
behavior is.
Maybe ths should be not changing or downgrading its precision? Lmk your
thoughts.
##########
datafusion/physical-plan/src/joins/utils.rs:
##########
@@ -707,7 +707,13 @@ fn max_distinct_count(
stats: &ColumnStatistics,
) -> Precision<usize> {
match &stats.distinct_count {
- &dc @ (Precision::Exact(_) | Precision::Inexact(_)) => dc,
+ &dc @ (Precision::Exact(_) | Precision::Inexact(_)) => {
+ // NDV can never exceed the number of rows
+ match num_rows {
+ Precision::Absent => dc,
+ _ => dc.min(num_rows).to_inexact(),
+ }
+ }
Review Comment:
Ditto to other comments
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]