xudong963 commented on PR #14699: URL: https://github.com/apache/datafusion/pull/14699#issuecomment-2676878054
> Specifically, if we let `X_i` denote the value of `i`th row of column `X`, the maximum value for the column would be `M = max(X_1, ..., X_N)` with `N` being the number of rows. Given probabilistic information on the possible values of an arbitrary `X_i`, we can also make a probabilistic guess on what `M` can be. This makes a lot of sense, thanks for your clear explanation. Now I understand how the distribution works and the difference between the current statistics model and the original Min/max/nv. > There is no reason why we can't use statistical tests to "recognize" distributions and use recognized distributions instead of directly falling back to unknown distributions in such cases. Yes, `Sample` does have great significance for the lack of statistics. From my experience, I've built the whole optimizer, the annoying problem is that the statistics are often accurate due to frequent data increases and lack due to unstructured data, under the context, `Sample` will show its muscle. > In the worst case, all the calculus will work through unknown distributions and we will not be in a worse position than where we were before (sans bugs) Make sense, after statisticsv2, the worst situation is to fall back to the origin case. > It may be an interesting idea to write something up once we finalize all the details. Thanks, looking forward! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org