ozankabak commented on PR #14699:
URL: https://github.com/apache/datafusion/pull/14699#issuecomment-2673333676

   Thank you for the review @xudong963. Here are my thoughts on your questions:
   
   > 1. As the summary says, `StatisticsV2` will replace the usage of 
`Precision`, so the min/max/ndv of `ColumnStatistic` will be removed, the 
`ColumnStatistics` will like this: `pub struct ColumnStatistics{stat: 
StatisticsV2}`, right? What information do we expect from the user's 
`TableProvider` to build the `Statistics`?(For the old statistic, it's better 
to know accurate min/max/ndv to do cardinality estimation).
   >    My understanding is that the user needs to know their data 
distribution, e.g. if their data distribution is uniform they need to provide 
the interval, if skewed they need to provide the information needed for the 
Exponential distribution. Or Datafusion can do sampling to decide?
   
   Indeed, column and table statistics will be built on top of `StatisticsV2`, 
which simply encapsulates our information about a random variable. Treating a 
value drawn from a column as a random variable, the maximum/minimum/average 
etc. of a column also becomes a random variable (with a distinct statistical 
distribution). Specifically, if we let `X_i` denote the value of `i`th row of 
column `X`, the maximum value for the column would be `M = max(X_1, ..., X_N)` 
with `N` being the number of rows. Given probabilistic information on the 
possible values of an arbitrary `X_i`, we can also make a probabilistic guess 
on what `M` can be.
   
   So, like how @berkaysynnada mentions, we expect to have one `StatisticsV2` 
object for each piece of information like maximum/minimum etc. This will enable 
us to express *certain and uncertain* information about things like the maximum 
of a column in a single unified framework.
   
   Coming back to the information flow from data sources: If the user supplies 
distributional information, it will be used by the leaf nodes as we 
evaluate/propagate statistics in expression graphs. Otherwise, we will fall 
back on the unknown distribution for leaf nodes, whose defining summary 
statistics can be automatically generated.
   
   In this context, your suggestion about sampling makes a lot of sense. There 
is no reason why we can't use statistical tests to "recognize" distributions 
and use recognized distributions instead of directly falling back to unknown 
distributions in such cases. Actually, thinking about it, that would make a 
fantastic follow-up project once we have the basics in place 🙂 
   
   > 2. Have we done any work on the accuracy of the new statistical 
information during cardinality estimation?
   
   Do you mean things like distinct counts? I think we will be able to see how 
well we estimate such things probabilistically once we finalize this and rework 
column/table stats with the new framework. In the worst case, all the calculus 
will work through unknown distributions and we will not be in a worse position 
than where we were before (sans bugs). In cases where we can avoid loss of 
statistical information, we will end up with better estimations.
   
   > 3. Are there certain papers that describe this statistical information 
framework in more detail?
   
   I'm not sure. I don't know of any that describes exactly the same thing with 
what we are doing here, but the approach is somewhat similar to how belief 
propagation in probabilistic graphical models work (but not the same). It may 
be an interesting idea to write something up once we finalize all the details.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to