alamb commented on issue #8078:
URL: https://github.com/apache/datafusion/issues/8078#issuecomment-2543895426

   There was a bunch of discussion on 
https://github.com/apache/datafusion/pull/13293 that I would like to try and 
summarize:
   
   Trying to combine the proposals from @crepererum 
   https://github.com/apache/datafusion/pull/13293#issuecomment-2470333151  and 
@berkaysynnada on 
https://github.com/apache/datafusion/pull/13293#issuecomment-2467866670 and my 
own hopes that we make it cheaper to manage `Statistics`:
   
   # Changes to `Statistics`
   
[`Statistics`](https://docs.rs/datafusion/latest/datafusion/common/struct.Statistics.html)
   
   ```statistics
   pub struct Statistics {
       pub num_rows: Precision<usize>,
       pub total_byte_size: Precision<usize>,
       pub column_statistics: Vec<ColumnStatistics>,
   }
   ```
   
   I propose changing this to:
   ```statistics
   pub struct Statistics {
       pub num_rows: Precision<usize>,
       pub total_byte_size: Precision<usize>,
       pub column_statistics: Vec<Arc<ColumnStatistics>>,
   }
   ```
   
   Which would allow for quickly copying ColumnStatistics for unchanged columns.
   
   An alternate may be to just use `Arc<Statistics>` everywhere (or maybe both 
🤔 )
   
   # Changes to `ColumnStatistics`
   
   At the time of writing, 
[`ColumnStatistics`](https://docs.rs/datafusion/latest/datafusion/physical_plan/struct.ColumnStatistics.html)
 looks like 
   ```rust
   pub struct ColumnStatistics {
       pub null_count: Precision<usize>,
       pub max_value: Precision<ScalarValue>,
       pub min_value: Precision<ScalarValue>,
       pub distinct_count: Precision<usize>,
   }
   ```
   
   I believe the proposal involves changing 
[`Precision`](https://docs.rs/datafusion/latest/datafusion/common/stats/enum.Precision.html)
 so it can capture the min/max data, which would result in a CoumnStatistics 
structure like this:
   ```rust
   pub struct ColumnStatistics {
       pub null_count: Precision<usize>,
       // The value (or range) that this column takes on
       pub value: Precision<ScalarValue>,
       pub distinct_count: Precision<usize>,
   }
   ```
   
   # Changes to `Precision`
   
   Some constraints:
   1. 
[`Precision`](https://docs.rs/datafusion/latest/datafusion/common/stats/enum.Precision.html)
 today is a template which allows for both `Precision<usize>` and 
`Precision<ScalarValue>`
   2. 
[`Interval`](https://docs.rs/datafusion/latest/datafusion/logical_expr/interval_arithmetic/struct.Interval.html)
 already provides the ability to do range analysis and is based on `ScalarValue`
   
   ## Options 1: Use `ScalarValue` exclusively
   
   ```rust
   /// ...
   enum Precision {
     /// Nothing is known about the value
     Absent,
     /// Estimated, but very close to the given point.
     ///
     /// This can be treated as an over-simplified normal distribution.
     PointEstimation(ScalarValue),
     /// Value is for sure in the given Interval. If min=max the value is known 
to be a single value
     Interval(Interval),
   }
   ```
     
   Pros: Can reuse `Interval` arithmetic and will allow us to hopefully unify 
Interval and statitics calculations
   Cons: Will be less efficient for statistics such as `count` which we know 
must be `usize`
   
   ## Options 2: Precision remains parameterized on `T`
   
   ```rust
   /// ...
   ///
   /// # Note
   /// Unknown ranges are modeled as `Precision::Range{lower: None, upper: 
None}`,
   enum Precision {
     /// Estimated, but very close to the given point.
     ///
     /// This can be treated as an over-simplified normal distribution.
     PointEstimation(T),
     
     /// Value is for sure in the given open/half-open/closed range.
     ///
     /// If `lower`/`upper` are given, they are INCLUSIVE ends.
     Range {
       lower: Option<T>,
       upper: Option<T>,
     },
   }
   ```
   Cons: More natural analysis for statistics such as `count` which we know 
must be `usize`
   Pros: Harder to reuse `Interval` arithmetic and will allow us to hopefully 
unify Interval and statistics calculations
   
   ## Options 3: Change interval to be parameterized on `T`
   
   ```rust
   /// ...
   enum Precision<T> {
     /// Nothing is known about the value
     Absent,
     /// Estimated, but "close" to the given point.
     ///
     /// This can be treated as an over-simplified normal distribution and used 
for 
     /// Cardinality estimation, for example
     PointEstimation(T),
     /// Value is for sure in the given Interval. If min=max the value is known 
to be a single value
     Interval(Interval<T>),
   }
   ```
   Cons: The most work/ changes required
   Pros: Unifies Intervals and Statistics
   
   
   # Unify Statistics and `evaluate_bounds`
   
   Today  
[`PhysicalExpr::evaluate_bounds`](https://docs.rs/datafusion/latest/datafusion/physical_expr/trait.PhysicalExpr.html#method.evaluate_bounds)
 computes the output  bounds of an expression output using 
[`Interval`](https://docs.rs/datafusion/latest/datafusion/logical_expr/interval_arithmetic/struct.Interval.html)s
 for the inputs. 
   
   ```rust
   trait PhysicalExpr { 
   ...
   fn evaluate_bounds(
       &self,
       _children: &[&Interval],
   ) -> Result<Interval, DataFusionError>
   ...
   ```
   
   @gatesn proposes adding an API that does something similar in:
   -  https://github.com/apache/datafusion/pull/13736
   
   His API computes the output *ColumnStatistics* given the input `Statistics`, 
which I believe is a more general concept than just `evalute_bounds` (as it 
encompasses counts as well as the value)
   
   ```rust
   trait PhysicalExpr { 
   ...
       fn column_statistics(&self, _statistics: &Statistics) -> 
Result<ColumnStatistics> {
   ...
   ```
   
   If we allowed `Precision` to represent `Interval`s (Option 3 above)I think 
we could deprecate `evaluate_bounds` and simply use `column_statistics` as it 
would be a strict superset .
   
   
   # Incremental steps to get us there:
   
   The changes contemplated above are substantial and potentially quite 
disruptive to downstream APIs. 
   
   I propose a phased approach:
   1. Step 1 is to make the fields of `Statistics` / `ColumnStatistics`/ 
`Precision` not `pub` and only allow access through accessors (this will permit 
us to 1) change internal structures with more limited impact and 2) formalize 
the API by which statistics are manipulated. 
   3. Change `Interval` to be `Interval<T>` which will allow `Interval<usize>` 
types
   4. Change `Precision` to follow Option 3 above
   5. Change `evalute_bounds` to `column_statistics`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to