crepererum opened a new issue, #5613: URL: https://github.com/apache/arrow-datafusion/issues/5613
**Describe the bug** It is unclear what `Statistics::is_exact` = `false` means. The docs are here: https://github.com/apache/arrow-datafusion/blob/a578150e63e344fbaa7d13eda58544482dea4729/datafusion/common/src/stats.rs#L34-L37 These state for this case: > may contain an **inexact estimate** and may not be the actual value What does "inexact" mean? Some potential definitions (we only consider `Some(...)` fields here!): - **underestimate:** There are values within the data source that are NOT included within the statistics, i.e. the statistics do NOT cover the whole range. This could happen when you sample statistics from a larger data source. - **overestimate:** All values from the data stream are covered by the statistics, but the range might be too large. This can happen when some source doesn't fold predicates into the statistics (which in general is pretty hard to do). - **both:** The statistics are only a rough guide. I think there is a pretty important difference between "overestimate" and "both", because the former allows you to prune execution branches or entire operations (e.g. sorts in some cases) while the latter can only be used to re-order operations (e.g. joins) or select a concrete operation from a pool (e.g. type of join). Side note: Due to predicate pushdown it will be pretty unlikely that there will be exact statistics for any realistic data sources. **Expected behavior** Clarify behavior. **Additional context** Cross-ref #997. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
