[ 
https://issues.apache.org/jira/browse/CALCITE-4132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17161263#comment-17161263
 ] 

Liya Fan commented on CALCITE-4132:
-----------------------------------

[~zabetak] Thanks a lot for your kind reply.

As you have indicated, both methods are estimations of some random variables. 
Our new method is based on an unbiased estimator, which means that when we 
perform the experiments repeatedly, our results should approach the "real" 
value. (Analogously, when we estimate the standard deviation of a population 
with a sample, we divide by n - 1 instead of n, because it is an unbiased 
estimator.)

You are right, we found this problem in our downstream project. The query is 
not a special one, just a simple user OLAP query. Our SQL engine has a quick 
path for small tables, and because of the error introduced by the approximate 
method (which was magnified along the plan tree), the optimizer selected the 
quick path by mistake. Later when it found the quick path was not feasible, it 
had to rerun, and this led to significant performance overhead.

Our investigations show that, the approximate method works well for large 
values, whereas for small values, the error introduced by the approximate 
method can be notable and counter-intuitive.

For example, suppose we have 4 distinct values (4-value enum), and we randomly 
select one row from the table. According to common sense, we expect to have 1 
distinct value, but the approximate method gives 0.88 distinct value, which is 
counter-intuitive. This is a 10+% error, and when it is magnified by other 
metadata operations, the resultant error can be large, which can lead to 
incorrect behaviors of the optimizer. 

> Estimate the NDV accurately
> ---------------------------
>
>                 Key: CALCITE-4132
>                 URL: https://issues.apache.org/jira/browse/CALCITE-4132
>             Project: Calcite
>          Issue Type: Improvement
>          Components: core
>            Reporter: Liya Fan
>            Assignee: Liya Fan
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, we estimate the NDV of many operators based on the 
> RelMdUtil#numDistinctVals method. This method estimates the expected number 
> of distinct values selected n times (with replacement) from a collection with 
> N distinct values. The estimation is based on the approximation when N 
> approaches infinity.
> However, when N is not a large number, the difference between the approximate 
> and exact values can be notabe. In addtion, the error can be magnified by 
> different combinations of N and n, which can lead the optimizer to make wrong 
> decisions. 
> Therefore, we give the exact estimation based on the unbiased estimator (The 
> proof is given in the comment). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to