alamb commented on issue #8078:
URL: https://github.com/apache/datafusion/issues/8078#issuecomment-2545903445

   > (1) and (2) are not mutually exclusive. Sometimes we have info on both, 
sometimes only one but not the other, sometimes we don't know anything about 
either. The type definition as stated currently doesn't accommodate this, but 
this is an easy fix.
   
   I think this is a **great** insight. Maybe the problem is that we are trying 
to overload `Statistics` to keep both types of information (statistics and 
bounds) 🤔 
   
   The 
[`aggregate_stastics`](https://github.com/apache/datafusion/blob/main/datafusion/physical-optimizer/src/aggregate_statistics.rs)
 pass in the optimizer uses `Statistics` for the bounds use case (aka makes 
transformations that rely on values in Statistics for correctness):
   - We hit a bug recently about this: 
https://github.com/apache/datafusion/pull/12471 from @waruto210 
   - Bounds are needed for @suremarc 's work in 
https://github.com/apache/datafusion/issues/10316
   
   What is the easy fix you envision? One thing I could think of would be to 
add a structure similar to `Statistics` but explicitly for tracking the known 
bounds of a column
   
   ```rust
   /// Known bounds for a set of columns
   /// 
   /// This information is guaranteed to be correct and is used for
   /// optimization and correctness transformations. See [`Statistics`] for 
   /// 
   struct Bounds {
     /// Interval, if known, for each output column
     inner: Vec<Option<Interval>>,
   }
   ```
   
   However, 
[`aggregate_stastics`](https://github.com/apache/datafusion/blob/main/datafusion/physical-optimizer/src/aggregate_statistics.rs)
 does use column count (not just value ranges) so we would have to work that in 
somehow
   
   > Column statistics vs Expression evaluation
   
   This also makes sense (that statistics propagation is a higher level concept 
that naturally builds on `evaluate_bounds`)
   
   One way maybe we could support the usecase of code that called the existing 
evaluate_bounds function with the new API could be to make it easy to a 
`Statistics `object  ranges on an expression. 
   
   For example, given existing code like
   
   ```rust
   let input_intervals: Vec<&Interval> = ....;
   let output_interval = expr.evaluate_bounds(&input_itervals)?;
   ...
   ```
   
   Might look something like
   ```rust
   let input_intervals: Vec<&Interval> = ....;
   // wrap input intervals with Statistics
   let temp_statistics = Statistics::new_from_bounds(&input_intervals);
   // compute output column statistics
   let output_column_statistics = expr.column_statistics(&temp_statistics)?;
   // use the output value if it was known
   let output_interval. = match output_column_statstics.value() {
     Precision::Absent | Precision::PointEstimation => None,
     Precision::Interval(interval) => interval
   };


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to