alamb commented on issue #8078: URL: https://github.com/apache/arrow-datafusion/issues/8078#issuecomment-1800080462
I think there is a mismatch between the current use of `Precision::Inexact` and its documentation Specifically, `Precision::Inexact` appears to be treated as a "conservative estimate" in several places (which is actually what we need in IOx), so perhaps another option would be to rename `Precision::Exact` to `Precision::Conservative` and document that it is a conservative estimate of the actual value 🤔 For example the comments on `Precision::Inexact` say https://github.com/apache/arrow-datafusion/blob/c3430d71179d68536008cd7272f4f57b7f50d4a2/datafusion/statistics/src/statistics.rs#L32-L33 However, it is used to skip processing files when the (inexact) row count is above the fetch/limit: https://github.com/apache/arrow-datafusion/blob/87aeef5aa4d8f38a8328f8e51e530e6c9cd9afa9/datafusion/core/src/datasource/statistics.rs#L69-L73 I am pretty sure this is only valid if the estimate of the number of rows is conservative (aka the real value is at least as large as the statistics), but the code uses the value even for `Precision::Inexact`: https://github.com/apache/arrow-datafusion/blob/c3430d71179d68536008cd7272f4f57b7f50d4a2/datafusion/statistics/src/statistics.rs#L42-L46 Another example is `ColumnStatistics::is_singleton`, which I think is also only correct if the statistics are conservative (aka the actual min is no lower than the reported min and the actual max is no larger than the reported mx) https://github.com/apache/arrow-datafusion/blob/c3430d71179d68536008cd7272f4f57b7f50d4a2/datafusion/statistics/src/statistics.rs#L281-L286 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
