crepererum commented on issue #5613:
URL: 
https://github.com/apache/arrow-datafusion/issues/5613#issuecomment-1471571066

   > I wonder how an "overestimate" would apply to num_rows. Unless we knew the 
distribution exactly, in order to preserve an overestimate in num_rows, 
wouldn't we have to assume no rows were filtered ?
   
   I guess if the ranges (min/max) are overestimated / too wide, then the 
number of rows is likely an overestimate as well (upper bound).
   
   Thinking about that more since this is getting really confusing with 
min/max/row_count/n_bytes because "overestimate" for "min" is the lower bound 
while of "max" it's the upper bound. So #997 already suggest to rework this 
attribute to be field-specific. I would propose to extend the interface even 
further:
   
   ```rust
   struct Boundary<T: PartialOrd> {
       pub val: T,
       pub is_lower_bound: bool,
       pub is_upper_bound: bool,
   }
   
   impl<T: PartialOrd> Boundary<T> {
       pub fn is_exact(&self) -> bool {
           self.is_lower_bound && self.is_upper_bound
       }
   }
   
   pub struct Statistics {
       /// The number of table rows
       pub num_rows: Option<Boundary<usize>>,
       /// total bytes of the table rows
       pub total_byte_size: Option<Boundary<usize>>,
       /// Statistics on a column level
       pub column_statistics: Option<Vec<ColumnStatistics>>,
   }
   
   pub struct ColumnStatistics {
       /// Number of null values on column
       pub null_count: Option<Boundary<usize>>,
       /// Maximum value of column
       pub max_value: Option<Boundary<ScalarValue>>,
       /// Minimum value of column
       pub min_value: Option<Boundary<ScalarValue>>,
       /// Number of distinct values
       pub distinct_count: Option<Boundary<usize>>,
   }
   
   impl ColumnStatistics {
       pub fn min_max_exact(&self) -> bool {
           self.min_value.map(|b| b.is_exact()).unwrap_or_default()
           && self.max_value.map(|b| b.is_exact()).unwrap_or_default()
       }
   
       /// Does the range described by min-max contain ALL values?
       ///
       /// Note that the range might be too large. Some filters may not 
       /// have be considered when this range was determined.
       pub fn min_max_countains_all(&self) -> bool {
           self.min_value.map(|b| b.is_lower_bound).unwrap_or_default()
           && self.max_value.map(|b| b.is_upper_bound).unwrap_or_default()
       }
   
       /// Does the range described by min-max contain actual data?
       ///
       /// Note that there might be values outside of this range, esp. when the
       /// statistics were constructed using sampling.
       pub fn min_max_guaranteed_to_contain_value(&self) -> bool {
           self.min_value.map(|b| b.is_upper_bound).unwrap_or_default()
           && self.max_value.map(|b| b.is_lower_bound).unwrap_or_default()
       }
   }
   ```
   
   Note that the exact interface and names are TBD, but it's a rough idea. Also 
there might be similar interfaces in the pruning predicates and analysis 
passes, so maybe the `Boundary` struct can be reused.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to