crepererum opened a new issue, #5613:
URL: https://github.com/apache/arrow-datafusion/issues/5613

   **Describe the bug**
   It is unclear what `Statistics::is_exact` = `false` means. The docs are here:
   
   
https://github.com/apache/arrow-datafusion/blob/a578150e63e344fbaa7d13eda58544482dea4729/datafusion/common/src/stats.rs#L34-L37
   
   These state for this case:
   
   > may contain an **inexact estimate** and may not be the actual value
   
   What does "inexact" mean? Some potential definitions (we only consider 
`Some(...)` fields here!):
   
   - **underestimate:** There are values within the data source that are NOT 
included within the statistics, i.e. the statistics do NOT cover the whole 
range. This could happen when you sample statistics from a larger data source.
   - **overestimate:** All values from the data stream are covered by the 
statistics, but the range might be too large. This can happen when some source 
doesn't fold predicates into the statistics (which in general is pretty hard to 
do).
   - **both:** The statistics are only a rough guide.
   
   I think there is a pretty important difference between "overestimate" and 
"both", because the former allows you to prune execution branches or entire 
operations (e.g. sorts in some cases) while the latter can only be used to 
re-order operations (e.g. joins) or select a concrete operation from a pool 
(e.g. type of join).
   
   Side note: Due to predicate pushdown it will be pretty unlikely that there 
will be exact statistics for any realistic data sources.
   
   **Expected behavior**
   Clarify behavior.
   
   **Additional context**
   Cross-ref #997.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to