Re: [I] Introduce a way to represent constrained statistics / bounds on values in Statistics [arrow-datafusion]


alamb commented on issue #8078:
URL: 
https://github.com/apache/arrow-datafusion/issues/8078#issuecomment-1800080462


   I think there is a mismatch between the current use of `Precision::Inexact` 
and its documentation 
   
   Specifically, `Precision::Inexact` appears to be treated as a  "conservative 
estimate" in several places (which is actually what we need in IOx), so perhaps 
another option would be to rename `Precision::Exact` to 
`Precision::Conservative` and document that it is a conservative estimate of 
the actual value 🤔 
   
   For example the comments on `Precision::Inexact` say
   
   
https://github.com/apache/arrow-datafusion/blob/c3430d71179d68536008cd7272f4f57b7f50d4a2/datafusion/statistics/src/statistics.rs#L32-L33
   
   However, it is used to skip processing files when the (inexact) row count is 
above the fetch/limit:
   
   
https://github.com/apache/arrow-datafusion/blob/87aeef5aa4d8f38a8328f8e51e530e6c9cd9afa9/datafusion/core/src/datasource/statistics.rs#L69-L73
   
   I am pretty sure this is only valid if the estimate of the number of rows is 
conservative (aka the real value is at least as large as the statistics), but 
the code uses the value even for `Precision::Inexact`:
   
   
   
https://github.com/apache/arrow-datafusion/blob/c3430d71179d68536008cd7272f4f57b7f50d4a2/datafusion/statistics/src/statistics.rs#L42-L46
   
   Another example is `ColumnStatistics::is_singleton`, which I think is also 
only correct if the statistics are conservative (aka the actual min is no lower 
than the reported min and the actual max is no larger than the reported mx)
   
   
https://github.com/apache/arrow-datafusion/blob/c3430d71179d68536008cd7272f4f57b7f50d4a2/datafusion/statistics/src/statistics.rs#L281-L286
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Introduce a way to represent constrained statistics / bounds on values in Statistics [arrow-datafusion]

Reply via email to