[
https://issues.apache.org/jira/browse/CALCITE-4132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17161263#comment-17161263
]
Liya Fan commented on CALCITE-4132:
-----------------------------------
[~zabetak] Thanks a lot for your kind reply.
As you have indicated, both methods are estimations of some random variables.
Our new method is based on an unbiased estimator, which means that when we
perform the experiments repeatedly, our results should approach the "real"
value. (Analogously, when we estimate the standard deviation of a population
with a sample, we divide by n - 1 instead of n, because it is an unbiased
estimator.)
You are right, we found this problem in our downstream project. The query is
not a special one, just a simple user OLAP query. Our SQL engine has a quick
path for small tables, and because of the error introduced by the approximate
method (which was magnified along the plan tree), the optimizer selected the
quick path by mistake. Later when it found the quick path was not feasible, it
had to rerun, and this led to significant performance overhead.
Our investigations show that, the approximate method works well for large
values, whereas for small values, the error introduced by the approximate
method can be notable and counter-intuitive.
For example, suppose we have 4 distinct values (4-value enum), and we randomly
select one row from the table. According to common sense, we expect to have 1
distinct value, but the approximate method gives 0.88 distinct value, which is
counter-intuitive. This is a 10+% error, and when it is magnified by other
metadata operations, the resultant error can be large, which can lead to
incorrect behaviors of the optimizer.
> Estimate the NDV accurately
> ---------------------------
>
> Key: CALCITE-4132
> URL: https://issues.apache.org/jira/browse/CALCITE-4132
> Project: Calcite
> Issue Type: Improvement
> Components: core
> Reporter: Liya Fan
> Assignee: Liya Fan
> Priority: Major
> Labels: pull-request-available
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Currently, we estimate the NDV of many operators based on the
> RelMdUtil#numDistinctVals method. This method estimates the expected number
> of distinct values selected n times (with replacement) from a collection with
> N distinct values. The estimation is based on the approximation when N
> approaches infinity.
> However, when N is not a large number, the difference between the approximate
> and exact values can be notabe. In addtion, the error can be magnified by
> different combinations of N and n, which can lead the optimizer to make wrong
> decisions.
> Therefore, we give the exact estimation based on the unbiased estimator (The
> proof is given in the comment).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)