Re: [I] Introduce a way to represent constrained statistics / bounds on values in Statistics [arrow-datafusion]

via GitHub Tue, 14 Nov 2023 10:05:48 -0800


alamb commented on issue #8078:
URL: 
https://github.com/apache/arrow-datafusion/issues/8078#issuecomment-1810844263


   After some thought, I think the key information we are trying to capture in 
statistics are:
   1. Known min/max values
   2. Estimated values that are imprecise (the result of applying some sort of 
estimate), but are better than just "somewhere between min/max"
   
   I don't think the proposal (at least as I understand it) in 
https://github.com/apache/arrow-datafusion/issues/8078#issuecomment-1804546752, 
can capture the known min/max values
   
   However, perhaps something like this could:
   
   ```rust
   pub enum Precision<T: Debug + Clone + PartialEq + Eq + PartialOrd> {
       /// The exact value is known
       Exact(T),
       /// The exact value is not known, but the real value is **known** to be 
within
       /// the specified range: `lower <= value <= upper` and furthermore is 
estimated
       /// to be `estimate`. However, the real value may fall anywhere in the 
range
       /// TODO: we could use `Interval` here instead, which could represent 
more complex cases (like
       /// open/closed bounds rather than using T::MIN/T::MAX)
       Bounded { lower: T, estimate: T upper: T},
       /// Nothing is known about the value
       #[default]
       Absent,
   }
   ```
   
   Maybe estimate should be `Option<T>` rather than just `T` 🤔 
   
   # Example1: Filter `x <= 10`
   In the example of a filter where you know the column `x` has bounds `[1, 
100]` and the row_count is `N`. 
   
   After applying a filter of `x <= 10`, assuming an even distribution you get 
a selectivity of `0.1 `, so we can conclude the following about the output 
`row_count`:
   1. It is still between 0 and N (the max value hasn't been changed by the 
filter)
   2. We estimate that the value is near `0.1 * N` given our cardinality 
estimation, but this is just an estimate
   
   As I understand @berkaysynnada 's proposal 
https://github.com/apache/arrow-datafusion/issues/8078#issuecomment-1804546752, 
it would represent the output as either as:
   
   
   ```rust
   Precision::PointEstimate(Inexact(0.1*N))
   ```
   
   or
   
   ```rust
   Precision::Range(
     PointEstimate::Exact(0), // min
     PointEstimate::Inexact(0.1 * N), // max
   )
   ```
   
   But neither captures the fact that we know for sure that row count lies 
between 0 and `N` (the max wasn't changed by the filter). Using a `Bounded` 
variant could capture this information:
   
   ```rust
   Precision::Bounded {
     lower: 0,
     estimate: 0.1 * N,
     upper: 100
   }
   ```
   
   Here is the setup in pictures for those who like that
   
   ```
        Output Statistics  for row_count
         value: estimated to be 0.1*N           
         min: Exactly 0           
         max: Exactly 100 (can not possibly be higher)         
                                  
                ▲                 
                │                 
   ┌────────────┴────────────┐    
   │                         │    
   │       FilterExec        │    
   │         x <= 10         │    
   │                         │    
   └─────────────────────────┘    
                ▲                 
                │                 
                │                 
                                  
       Input Statistics for row count:     
        value: Exactly N      
        min: Exactly 0            
        max: Exactly 100          
                                  
   ```
   
   
   # `x <= y` example
   
   The other example raised above is a predicate like `x <= y`. I assume the 
challenge is now how do we represent the column ranges with this information. 
Let's say we know that `x` was between `[50, 100]` and `y` was between [0, 200] 
and there were N input rows
   
   Assuming even and uncorrelated distribution, we would get a selectivity 
estimate of 0.75
   
   ```
     │                                                      │                   
   
     │                                                      │                   
   
     │──────────────────────────────────────────────────────┤       y           
   
     │                                                      │       min: 0      
   
     │                                                      │       max: 200    
   
                                                                                
   
     0             │                      │                200                  
   
                   │                      │                                     
   
                   ├──────────────────────┤                         x           
   
                   │                      │                         min: 50     
   
                   │                      │                         max: 100    
   
                                                                                
   
                   0                     200                                    
   
                                                                                
   
                                                                                
   
                  **     Only range that can be true      **                    
   
                   **                                    **                     
   
                    **************************************                      
   
                                                                                
   
   ```
   
   We could again represent the output statistic with the proper range as well 
as the cardinality estimate like this, which both captures the fact `y` can't 
be less than `50` as well as the `0.75` selectivity):
   
   ```
   x: Bounded { lower: 50, estimate: 0.75 * N, upper: 100 }, 
   y: Bounded { lower: 50, estimate: 0.75 * N, upper: 200 }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Introduce a way to represent constrained statistics / bounds on values in Statistics [arrow-datafusion]

Reply via email to