[ 
https://issues.apache.org/jira/browse/CALCITE-4132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17161583#comment-17161583
 ] 

Julian Hyde commented on CALCITE-4132:
--------------------------------------

This new method isn't 'accurate', it's just 'more accurate'; you should update 
the title.

I believe it also assumes that values are uniformly distributed (i.e. there are 
the same number of occurrences of each value).

NDV is an acronym. I think if you spell it out it will be more descriptive to 
people who are not statisticians.

As [~zabetak] says, it would help to update the description to give some 
specific examples where the current formula is off. And include some concrete 
examples in the code comment.

The old and new algorithms are with-replacement. Wouldn't without-replacement 
be better? If there are 5 rows with 5 distinct values and I pick 5, I will get 
5 distinct values. (Different than if you pick 5 rows from a pool of 1,000,000 
rows that have 5 distinct values.) I suppose you don't know that there are 5 
rows, just that there are 5 distinct values. So the algorithm might be better 
if you told it the number of rows.

> Estimate the NDV accurately
> ---------------------------
>
>                 Key: CALCITE-4132
>                 URL: https://issues.apache.org/jira/browse/CALCITE-4132
>             Project: Calcite
>          Issue Type: Improvement
>          Components: core
>            Reporter: Liya Fan
>            Assignee: Liya Fan
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently, we estimate the NDV of many operators based on the 
> RelMdUtil#numDistinctVals method. This method estimates the expected number 
> of distinct values selected n times (with replacement) from a collection with 
> N distinct values. The estimation is based on the approximation when N 
> approaches infinity.
> However, when N is not a large number, the difference between the approximate 
> and exact values can be notabe. In addtion, the error can be magnified by 
> different combinations of N and n, which can lead the optimizer to make wrong 
> decisions. 
> Therefore, we give the exact estimation based on the unbiased estimator (The 
> proof is given in the comment). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to