Re: [I] Introduce a way to represent constrained statistics / bounds on values in Statistics [arrow-datafusion]

via GitHub Tue, 14 Nov 2023 07:00:59 -0800


alamb commented on issue #8078:
URL: 
https://github.com/apache/arrow-datafusion/issues/8078#issuecomment-1810386490


   I have been playing and studying this  code. While the suggestion from 
@ozankabak  and @berkaysynnada  in 
https://github.com/apache/arrow-datafusion/issues/8078#issuecomment-1804546752 
is very general and can represent many types of uncertainty in statistics, I 
haven't found cases yet where that full generality is important
   
   For example, I can't find (nor think of) an important case where the lower 
bound would be known with certainty and the upper bound was uncertain vs 
TYPE::MAX). 
   
   Another example would be a use case where distinguishing between ranges like
   
   ```
   min: `PointEstimate::Absent`, max: `PointEstimate::Precise(value)`
   min: PointEstimate::Precise(TYPE::MIN), max: PointEstimate::Precise(value)
   ```
   
   Thus I am going to prototype what adding `Bounded` variant to `Precision` 
looks like.
   
   I also plan to encapsulate more of the checks into `Precision` so that if 
choose to go with a more general formulation we won't have to change as much of 
the rest of the code. 
   
   ```
   pub enum Precision<T: Debug + Clone + PartialEq + Eq + PartialOrd> {
       /// The exact value is known
       Exact(T),
       /// The exact value is not known, but the real value is known to be 
within
       /// the specified range: `lower <= value <= upper` TOOD: we could use
       /// `Interval` here instead, which could represent more complex cases 
(like
       /// open/closed bounds)
       Bounded { lower: T, upper: T},
       /// The value is not known exactly, but is likely close to this value.
       /// NOTHING can assumed about the value for cor in this case.
       Inexact(T),
       /// Nothing is known about the value
       #[default]
       Absent,
   }
   
   ```
   
   I'll report back here with how it goes shorty
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Introduce a way to represent constrained statistics / bounds on values in Statistics [arrow-datafusion]

Reply via email to