suremarc commented on issue #8078:
URL: https://github.com/apache/datafusion/issues/8078#issuecomment-2545969283

   > I think this is a great insight. Maybe the problem is that we are trying 
to overload Statistics to keep both types of information (statistics and 
bounds) 🤔
   
   Having attempted to implement a new `Precision` API looking something like 
this:
   
   ```rust
   struct Precision<T> {
       lower: Option<T>,
       upper: Option<T>,
       point_estimate: Option<T>,
   }
   ```
   
   I did indeed notice that just about all of the code dealing with cardinality 
estimates (`total_byte_size`, `distinct_count`, `null_count`, `num_rows`, etc.) 
only cared about `point_estimate`. Meanwhile, for column min/max statistics, we 
only really care about the upper & lower bounds. 
   
   Notwithstanding, I had already taken a stab at implementing this layout for 
`ColumnStatistics` (which is what @alamb proposed):
   ```rust
   pub struct ColumnStatistics {
       pub null_count: Precision<usize>,
       // The value (or range) that this column takes on
       pub value: Precision<ScalarValue>,
       pub distinct_count: Precision<usize>,
   }
   ```
   
   As far as I can gather from looking at the codebase, current use cases for 
column statistics center around cardinality estimates (`null_count` and 
`distinct_count`) and lower/upper bounds (which is what `value`) does. However, 
this API is technically a regression, as we lose the ability to express when 
the lower/upper bounds are "exact". My use case for #13296 is to have 
conservative min/max bounds, so I would be happy to make this change, however 
having exact min/maxes would make certain optimizations possible, such as 
evaluating `MIN(value)` or `MAX(value)` only by looking at column statistics 
without reading any data. 
   
   The other option is to keep `ColumnStatistics` the same (keep both 
`min_value` and `max_value`), and include a `Precision` for both `min_value` 
and `max_value`. But this will increase the size of `Statistics` significantly. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to