findepi commented on PR #13293: URL: https://github.com/apache/datafusion/pull/13293#issuecomment-2465085079
With single estimated value the conceptual idea is that the random variable we're estimating has distribution "condensed" around that value. One can imagine this being normal distribution with the value being the mean. Obviously this is over-simplification. Not everything has normal distribution. For example non-negative numbers. But this mental model is easy to work with. While exact ranges are super useful (for example for predicate derivation and pruning), inexact ranges as statistics model pose a problem how to interpret the value, when e.g. judging which side of the join is smaller, or when computing filter on top of some other computation. It's tempting to capture uncertainty as ever-widening ranges and to finally interpret the range as its middle value. This is definitely more complex mental model and it will suerly result in the code being more complex. Will it also result in better estimates? Maybe. There are also other alternatives so consider - histograms. Some optimizers (eg MySQL 8) use that to capture ranges of values together with their cardinality, which allows to derive histograms after applying a range filter - run-time adaptivity. There is only so much the optimizer can do a priori, before seeing the data. At some point of maturity making optimizer smarter doesn't result in queries returning faster. However, managing run-time detected skew or being able to replan are other and very powerful techniques. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org