alamb opened a new issue, #8078:
URL: https://github.com/apache/arrow-datafusion/issues/8078

   ### Is your feature request related to a problem or challenge?
   
   This has come up a few times, most recently in discussions with 
@berkaysynnada  on 
https://github.com/apache/arrow-rs/issues/5037#issuecomment-1796384939
   
   Usecase 1 is that for large binary/string columns, formats like parquet 
allow storing a truncated value that does not actually appear in the data. 
Given that values are stored in the min/max metadata, storing truncated values 
keeps the size of metadata down
   
   For example, for a string column that has very long values, it requires much 
less space to store a short value slightly _lower_ than the actual minimum as 
the "minimum" statistics value, and one that is slightly _higher_ than the 
actual maximum as the "maximum" statistics value.
   
   For example:
   
   | actual min in data | actual max in data | "min" value in statistics | 
"max" value in statistics |
   |--------|--------|--------|--------|
   | `aaa......z` | `qqq......q` | `a` | `r` | 
   
   
   There is a similar usecase when applying a Filter, as described by @korowa  
on  
https://github.com/apache/arrow-datafusion/issues/5646#issuecomment-1796178380 
and we have a similar one in IOx where the operator may remove values, but 
won't decrease the minimum value or increase the maximum value in any column 
   
   Currently 
[`Precision`](https://github.com/apache/arrow-datafusion/blob/e95e3f89c97ae27149c1dd8093f91a5574210fe6/datafusion/common/src/stats.rs#L29-L36)
 only represents `Exact` and `Inexact`, there is no way to represent "unexact, 
but bounded above/below"
   
   ### Describe the solution you'd like
   
   Per @berkaysynnada  I  propose changing `Precision::Inexact` to a new 
variant `Precision::Between` which would store an 
[`Interval`](https://docs.rs/datafusion/latest/datafusion/physical_expr/intervals/struct.Interval.html)
 of known min/maxes of the value. 
    
   ```rust
   enum Precision {
     ...
     /// The value is known to be in the specified interval
     Between(Interval)
   }
   ```
   
   This is a quite general formulation, and it can describe "how" inexact the 
values are. 
   
   This would have the benefit of being very expressive (Intervals can 
represent open/closed bounds, etc)
   
   ### Describe alternatives you've considered
   
   There is also a possibility of introducing a simpler, but more limited 
version of these statistics, like:
   
   ```rust
   enum Precision {
     // The value is known to be within the range (it is at at most this large 
for Max, or at least this large for Min)
     // but the actual values may be lower/higher. 
     Bounded(ScalarValue)
   }
   ```
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to