findepi commented on PR #13293:
URL: https://github.com/apache/datafusion/pull/13293#issuecomment-2465085079

   With single estimated value the conceptual idea is that the random variable 
we're estimating has distribution "condensed" around that value. One can 
imagine this being normal distribution with the value being the mean.
   
   Obviously this is over-simplification. Not everything has normal 
distribution. For example non-negative numbers. But this mental model is easy 
to work with.
   
   While exact ranges are super useful (for example for predicate derivation 
and pruning), inexact ranges as statistics model pose a problem how to 
interpret the value, when e.g. judging which side of the join is smaller, or 
when computing filter on top of some other computation. It's tempting to 
capture uncertainty as ever-widening ranges and to finally interpret the range 
as its middle value.
   
   This is definitely more complex mental model and it will suerly result in 
the code being more complex. Will it also result in better estimates? Maybe.
   
   There are also other alternatives so consider
   
   - histograms. Some optimizers (eg MySQL 8) use that to capture ranges of 
values together with their cardinality, which allows to derive histograms after 
applying a range filter
   - run-time adaptivity. There is only so much the optimizer can do a priori, 
before seeing the data. At some point of maturity making optimizer smarter 
doesn't result in queries returning faster. However, managing run-time detected 
skew or being able to replan are other and very powerful techniques. 
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to