alamb commented on issue #8078: URL: https://github.com/apache/arrow-datafusion/issues/8078#issuecomment-1810844263
After some thought, I think the key information we are trying to capture in statistics are: 1. Known min/max values 2. Estimated values that are imprecise (the result of applying some sort of estimate), but are better than just "somewhere between min/max" I don't think the proposal (at least as I understand it) in https://github.com/apache/arrow-datafusion/issues/8078#issuecomment-1804546752, can capture the known min/max values However, perhaps something like this could: ```rust pub enum Precision<T: Debug + Clone + PartialEq + Eq + PartialOrd> { /// The exact value is known Exact(T), /// The exact value is not known, but the real value is **known** to be within /// the specified range: `lower <= value <= upper` and furthermore is estimated /// to be `estimate`. However, the real value may fall anywhere in the range /// TODO: we could use `Interval` here instead, which could represent more complex cases (like /// open/closed bounds rather than using T::MIN/T::MAX) Bounded { lower: T, estimate: T upper: T}, /// Nothing is known about the value #[default] Absent, } ``` Maybe estimate should be `Option<T>` rather than just `T` 🤔 # Example1: Filter `x <= 10` In the example of a filter where you know the column `x` has bounds `[1, 100]` and the row_count is `N`. After applying a filter of `x <= 10`, assuming an even distribution you get a selectivity of `0.1 `, so we can conclude the following about the output `row_count`: 1. It is still between 0 and N (the max value hasn't been changed by the filter) 2. We estimate that the value is near `0.1 * N` given our cardinality estimation, but this is just an estimate As I understand @berkaysynnada 's proposal https://github.com/apache/arrow-datafusion/issues/8078#issuecomment-1804546752, it would represent the output as either as: ```rust Precision::PointEstimate(Inexact(0.1*N)) ``` or ```rust Precision::Range( PointEstimate::Exact(0), // min PointEstimate::Inexact(0.1 * N), // max ) ``` But neither captures the fact that we know for sure that row count lies between 0 and `N` (the max wasn't changed by the filter). Using a `Bounded` variant could capture this information: ```rust Precision::Bounded { lower: 0, estimate: 0.1 * N, upper: 100 } ``` Here is the setup in pictures for those who like that ``` Output Statistics for row_count value: estimated to be 0.1*N min: Exactly 0 max: Exactly 100 (can not possibly be higher) ▲ │ ┌────────────┴────────────┐ │ │ │ FilterExec │ │ x <= 10 │ │ │ └─────────────────────────┘ ▲ │ │ Input Statistics for row count: value: Exactly N min: Exactly 0 max: Exactly 100 ``` # `x <= y` example The other example raised above is a predicate like `x <= y`. I assume the challenge is now how do we represent the column ranges with this information. Let's say we know that `x` was between `[50, 100]` and `y` was between [0, 200] and there were N input rows Assuming even and uncorrelated distribution, we would get a selectivity estimate of 0.75 ``` │ │ │ │ │──────────────────────────────────────────────────────┤ y │ │ min: 0 │ │ max: 200 0 │ │ 200 │ │ ├──────────────────────┤ x │ │ min: 50 │ │ max: 100 0 200 ** Only range that can be true ** ** ** ************************************** ``` We could again represent the output statistic with the proper range as well as the cardinality estimate like this, which both captures the fact `y` can't be less than `50` as well as the `0.75` selectivity): ``` x: Bounded { lower: 50, estimate: 0.75 * N, upper: 100 }, y: Bounded { lower: 50, estimate: 0.75 * N, upper: 200 } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
