alamb opened a new issue, #8078: URL: https://github.com/apache/arrow-datafusion/issues/8078
### Is your feature request related to a problem or challenge? This has come up a few times, most recently in discussions with @berkaysynnada on https://github.com/apache/arrow-rs/issues/5037#issuecomment-1796384939 Usecase 1 is that for large binary/string columns, formats like parquet allow storing a truncated value that does not actually appear in the data. Given that values are stored in the min/max metadata, storing truncated values keeps the size of metadata down For example, for a string column that has very long values, it requires much less space to store a short value slightly _lower_ than the actual minimum as the "minimum" statistics value, and one that is slightly _higher_ than the actual maximum as the "maximum" statistics value. For example: | actual min in data | actual max in data | "min" value in statistics | "max" value in statistics | |--------|--------|--------|--------| | `aaa......z` | `qqq......q` | `a` | `r` | There is a similar usecase when applying a Filter, as described by @korowa on https://github.com/apache/arrow-datafusion/issues/5646#issuecomment-1796178380 and we have a similar one in IOx where the operator may remove values, but won't decrease the minimum value or increase the maximum value in any column Currently [`Precision`](https://github.com/apache/arrow-datafusion/blob/e95e3f89c97ae27149c1dd8093f91a5574210fe6/datafusion/common/src/stats.rs#L29-L36) only represents `Exact` and `Inexact`, there is no way to represent "unexact, but bounded above/below" ### Describe the solution you'd like Per @berkaysynnada I propose changing `Precision::Inexact` to a new variant `Precision::Between` which would store an [`Interval`](https://docs.rs/datafusion/latest/datafusion/physical_expr/intervals/struct.Interval.html) of known min/maxes of the value. ```rust enum Precision { ... /// The value is known to be in the specified interval Between(Interval) } ``` This is a quite general formulation, and it can describe "how" inexact the values are. This would have the benefit of being very expressive (Intervals can represent open/closed bounds, etc) ### Describe alternatives you've considered There is also a possibility of introducing a simpler, but more limited version of these statistics, like: ```rust enum Precision { // The value is known to be within the range (it is at at most this large for Max, or at least this large for Min) // but the actual values may be lower/higher. Bounded(ScalarValue) } ``` ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
