ozankabak commented on PR #14699:
URL: https://github.com/apache/datafusion/pull/14699#issuecomment-2673333676
Thank you for the review @xudong963. Here are my thoughts on your questions:
> 1. As the summary says, `StatisticsV2` will replace the usage of
`Precision`, so the min/max/ndv of `ColumnStatistic` will be removed, the
`ColumnStatistics` will like this: `pub struct ColumnStatistics{stat:
StatisticsV2}`, right? What information do we expect from the user's
`TableProvider` to build the `Statistics`?(For the old statistic, it's better
to know accurate min/max/ndv to do cardinality estimation).
> My understanding is that the user needs to know their data
distribution, e.g. if their data distribution is uniform they need to provide
the interval, if skewed they need to provide the information needed for the
Exponential distribution. Or Datafusion can do sampling to decide?
Indeed, column and table statistics will be built on top of `StatisticsV2`,
which simply encapsulates our information about a random variable. Treating a
value drawn from a column as a random variable, the maximum/minimum/average
etc. of a column also becomes a random variable (with a distinct statistical
distribution). Specifically, if we let `X_i` denote the value of `i`th row of
column `X`, the maximum value for the column would be `M = max(X_1, ..., X_N)`
with `N` being the number of rows. Given probabilistic information on the
possible values of an arbitrary `X_i`, we can also make a probabilistic guess
on what `M` can be.
So, like how @berkaysynnada mentions, we expect to have one `StatisticsV2`
object for each piece of information like maximum/minimum etc. This will enable
us to express *certain and uncertain* information about things like the maximum
of a column in a single unified framework.
Coming back to the information flow from data sources: If the user supplies
distributional information, it will be used by the leaf nodes as we
evaluate/propagate statistics in expression graphs. Otherwise, we will fall
back on the unknown distribution for leaf nodes, whose defining summary
statistics can be automatically generated.
In this context, your suggestion about sampling makes a lot of sense. There
is no reason why we can't use statistical tests to "recognize" distributions
and use recognized distributions instead of directly falling back to unknown
distributions in such cases. Actually, thinking about it, that would make a
fantastic follow-up project once we have the basics in place 🙂
> 2. Have we done any work on the accuracy of the new statistical
information during cardinality estimation?
Do you mean things like distinct counts? I think we will be able to see how
well we estimate such things probabilistically once we finalize this and rework
column/table stats with the new framework. In the worst case, all the calculus
will work through unknown distributions and we will not be in a worse position
than where we were before (sans bugs). In cases where we can avoid loss of
statistical information, we will end up with better estimations.
> 3. Are there certain papers that describe this statistical information
framework in more detail?
I'm not sure. I don't know of any that describes exactly the same thing with
what we are doing here, but the approach is somewhat similar to how belief
propagation in probabilistic graphical models work (but not the same). It may
be an interesting idea to write something up once we finalize all the details.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]