ozankabak commented on PR #14699: URL: https://github.com/apache/datafusion/pull/14699#issuecomment-2673333676
Thank you for the review @xudong963. Here are my thoughts on your questions: > 1. As the summary says, `StatisticsV2` will replace the usage of `Precision`, so the min/max/ndv of `ColumnStatistic` will be removed, the `ColumnStatistics` will like this: `pub struct ColumnStatistics{stat: StatisticsV2}`, right? What information do we expect from the user's `TableProvider` to build the `Statistics`?(For the old statistic, it's better to know accurate min/max/ndv to do cardinality estimation). > My understanding is that the user needs to know their data distribution, e.g. if their data distribution is uniform they need to provide the interval, if skewed they need to provide the information needed for the Exponential distribution. Or Datafusion can do sampling to decide? Indeed, column and table statistics will be built on top of `StatisticsV2`, which simply encapsulates our information about a random variable. Treating a value drawn from a column as a random variable, the maximum/minimum/average etc. of a column also becomes a random variable (with a distinct statistical distribution). Specifically, if we let `X_i` denote the value of `i`th row of column `X`, the maximum value for the column would be `M = max(X_1, ..., X_N)` with `N` being the number of rows. Given probabilistic information on the possible values of an arbitrary `X_i`, we can also make a probabilistic guess on what `M` can be. So, like how @berkaysynnada mentions, we expect to have one `StatisticsV2` object for each piece of information like maximum/minimum etc. This will enable us to express *certain and uncertain* information about things like the maximum of a column in a single unified framework. Coming back to the information flow from data sources: If the user supplies distributional information, it will be used by the leaf nodes as we evaluate/propagate statistics in expression graphs. Otherwise, we will fall back on the unknown distribution for leaf nodes, whose defining summary statistics can be automatically generated. In this context, your suggestion about sampling makes a lot of sense. There is no reason why we can't use statistical tests to "recognize" distributions and use recognized distributions instead of directly falling back to unknown distributions in such cases. Actually, thinking about it, that would make a fantastic follow-up project once we have the basics in place 🙂 > 2. Have we done any work on the accuracy of the new statistical information during cardinality estimation? Do you mean things like distinct counts? I think we will be able to see how well we estimate such things probabilistically once we finalize this and rework column/table stats with the new framework. In the worst case, all the calculus will work through unknown distributions and we will not be in a worse position than where we were before (sans bugs). In cases where we can avoid loss of statistical information, we will end up with better estimations. > 3. Are there certain papers that describe this statistical information framework in more detail? I'm not sure. I don't know of any that describes exactly the same thing with what we are doing here, but the approach is somewhat similar to how belief propagation in probabilistic graphical models work (but not the same). It may be an interesting idea to write something up once we finalize all the details. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org