alamb commented on issue #8078: URL: https://github.com/apache/datafusion/issues/8078#issuecomment-2545903445
> (1) and (2) are not mutually exclusive. Sometimes we have info on both, sometimes only one but not the other, sometimes we don't know anything about either. The type definition as stated currently doesn't accommodate this, but this is an easy fix. I think this is a **great** insight. Maybe the problem is that we are trying to overload `Statistics` to keep both types of information (statistics and bounds) 🤔 The [`aggregate_stastics`](https://github.com/apache/datafusion/blob/main/datafusion/physical-optimizer/src/aggregate_statistics.rs) pass in the optimizer uses `Statistics` for the bounds use case (aka makes transformations that rely on values in Statistics for correctness): - We hit a bug recently about this: https://github.com/apache/datafusion/pull/12471 from @waruto210 - Bounds are needed for @suremarc 's work in https://github.com/apache/datafusion/issues/10316 What is the easy fix you envision? One thing I could think of would be to add a structure similar to `Statistics` but explicitly for tracking the known bounds of a column ```rust /// Known bounds for a set of columns /// /// This information is guaranteed to be correct and is used for /// optimization and correctness transformations. See [`Statistics`] for /// struct Bounds { /// Interval, if known, for each output column inner: Vec<Option<Interval>>, } ``` However, [`aggregate_stastics`](https://github.com/apache/datafusion/blob/main/datafusion/physical-optimizer/src/aggregate_statistics.rs) does use column count (not just value ranges) so we would have to work that in somehow > Column statistics vs Expression evaluation This also makes sense (that statistics propagation is a higher level concept that naturally builds on `evaluate_bounds`) One way maybe we could support the usecase of code that called the existing evaluate_bounds function with the new API could be to make it easy to a `Statistics `object ranges on an expression. For example, given existing code like ```rust let input_intervals: Vec<&Interval> = ....; let output_interval = expr.evaluate_bounds(&input_itervals)?; ... ``` Might look something like ```rust let input_intervals: Vec<&Interval> = ....; // wrap input intervals with Statistics let temp_statistics = Statistics::new_from_bounds(&input_intervals); // compute output column statistics let output_column_statistics = expr.column_statistics(&temp_statistics)?; // use the output value if it was known let output_interval. = match output_column_statstics.value() { Precision::Absent | Precision::PointEstimation => None, Precision::Interval(interval) => interval }; -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
