alamb commented on issue #8078: URL: https://github.com/apache/datafusion/issues/8078#issuecomment-2543895426
There was a bunch of discussion on https://github.com/apache/datafusion/pull/13293 that I would like to try and summarize: Trying to combine the proposals from @crepererum https://github.com/apache/datafusion/pull/13293#issuecomment-2470333151 and @berkaysynnada on https://github.com/apache/datafusion/pull/13293#issuecomment-2467866670 and my own hopes that we make it cheaper to manage `Statistics`: # Changes to `Statistics` [`Statistics`](https://docs.rs/datafusion/latest/datafusion/common/struct.Statistics.html) ```statistics pub struct Statistics { pub num_rows: Precision<usize>, pub total_byte_size: Precision<usize>, pub column_statistics: Vec<ColumnStatistics>, } ``` I propose changing this to: ```statistics pub struct Statistics { pub num_rows: Precision<usize>, pub total_byte_size: Precision<usize>, pub column_statistics: Vec<Arc<ColumnStatistics>>, } ``` Which would allow for quickly copying ColumnStatistics for unchanged columns. An alternate may be to just use `Arc<Statistics>` everywhere (or maybe both 🤔 ) # Changes to `ColumnStatistics` At the time of writing, [`ColumnStatistics`](https://docs.rs/datafusion/latest/datafusion/physical_plan/struct.ColumnStatistics.html) looks like ```rust pub struct ColumnStatistics { pub null_count: Precision<usize>, pub max_value: Precision<ScalarValue>, pub min_value: Precision<ScalarValue>, pub distinct_count: Precision<usize>, } ``` I believe the proposal involves changing [`Precision`](https://docs.rs/datafusion/latest/datafusion/common/stats/enum.Precision.html) so it can capture the min/max data, which would result in a CoumnStatistics structure like this: ```rust pub struct ColumnStatistics { pub null_count: Precision<usize>, // The value (or range) that this column takes on pub value: Precision<ScalarValue>, pub distinct_count: Precision<usize>, } ``` # Changes to `Precision` Some constraints: 1. [`Precision`](https://docs.rs/datafusion/latest/datafusion/common/stats/enum.Precision.html) today is a template which allows for both `Precision<usize>` and `Precision<ScalarValue>` 2. [`Interval`](https://docs.rs/datafusion/latest/datafusion/logical_expr/interval_arithmetic/struct.Interval.html) already provides the ability to do range analysis and is based on `ScalarValue` ## Options 1: Use `ScalarValue` exclusively ```rust /// ... enum Precision { /// Nothing is known about the value Absent, /// Estimated, but very close to the given point. /// /// This can be treated as an over-simplified normal distribution. PointEstimation(ScalarValue), /// Value is for sure in the given Interval. If min=max the value is known to be a single value Interval(Interval), } ``` Pros: Can reuse `Interval` arithmetic and will allow us to hopefully unify Interval and statitics calculations Cons: Will be less efficient for statistics such as `count` which we know must be `usize` ## Options 2: Precision remains parameterized on `T` ```rust /// ... /// /// # Note /// Unknown ranges are modeled as `Precision::Range{lower: None, upper: None}`, enum Precision { /// Estimated, but very close to the given point. /// /// This can be treated as an over-simplified normal distribution. PointEstimation(T), /// Value is for sure in the given open/half-open/closed range. /// /// If `lower`/`upper` are given, they are INCLUSIVE ends. Range { lower: Option<T>, upper: Option<T>, }, } ``` Cons: More natural analysis for statistics such as `count` which we know must be `usize` Pros: Harder to reuse `Interval` arithmetic and will allow us to hopefully unify Interval and statistics calculations ## Options 3: Change interval to be parameterized on `T` ```rust /// ... enum Precision<T> { /// Nothing is known about the value Absent, /// Estimated, but "close" to the given point. /// /// This can be treated as an over-simplified normal distribution and used for /// Cardinality estimation, for example PointEstimation(T), /// Value is for sure in the given Interval. If min=max the value is known to be a single value Interval(Interval<T>), } ``` Cons: The most work/ changes required Pros: Unifies Intervals and Statistics # Unify Statistics and `evaluate_bounds` Today [`PhysicalExpr::evaluate_bounds`](https://docs.rs/datafusion/latest/datafusion/physical_expr/trait.PhysicalExpr.html#method.evaluate_bounds) computes the output bounds of an expression output using [`Interval`](https://docs.rs/datafusion/latest/datafusion/logical_expr/interval_arithmetic/struct.Interval.html)s for the inputs. ```rust trait PhysicalExpr { ... fn evaluate_bounds( &self, _children: &[&Interval], ) -> Result<Interval, DataFusionError> ... ``` @gatesn proposes adding an API that does something similar in: - https://github.com/apache/datafusion/pull/13736 His API computes the output *ColumnStatistics* given the input `Statistics`, which I believe is a more general concept than just `evalute_bounds` (as it encompasses counts as well as the value) ```rust trait PhysicalExpr { ... fn column_statistics(&self, _statistics: &Statistics) -> Result<ColumnStatistics> { ... ``` If we allowed `Precision` to represent `Interval`s (Option 3 above)I think we could deprecate `evaluate_bounds` and simply use `column_statistics` as it would be a strict superset . # Incremental steps to get us there: The changes contemplated above are substantial and potentially quite disruptive to downstream APIs. I propose a phased approach: 1. Step 1 is to make the fields of `Statistics` / `ColumnStatistics`/ `Precision` not `pub` and only allow access through accessors (this will permit us to 1) change internal structures with more limited impact and 2) formalize the API by which statistics are manipulated. 3. Change `Interval` to be `Interval<T>` which will allow `Interval<usize>` types 4. Change `Precision` to follow Option 3 above 5. Change `evalute_bounds` to `column_statistics` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
