[ 
https://issues.apache.org/jira/browse/CALCITE-4351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17652476#comment-17652476
 ] 

Liya Fan commented on CALCITE-4351:
-----------------------------------

[~julianhyde] Thanks for your feedback.
If I am understanding correctly, the value of numDistinctVals(dSize, 2.0) 
should be getting closer to 2.0 as the domain size increases?
This is because the first value is always distinct, and as the domain size 
increases, the chance of selecting a duplicate value in the second attempt 
becomes smaller. 
So I added the following test cases:
{code}
assertEquals(numDistinctVals(100.0, 2.0), 1.99, delta);
assertEquals(numDistinctVals(1000.0, 2.0), 1.999, delta);
assertEquals(numDistinctVals(10000.0, 2.0), 1.9999, delta);
{code}
Please note that the above cases are covered by our original code path, as the 
domain size and number of selections are relatively small. 
To test our new code paths, I also added some test cases in 
RelMdUtilTest#testNumDistinctValsWithLargeDomain:
{code}
assertEquals(numDistinctVals(dSize, 2.0), 2.0, delta);
{code}

> RelMdUtil#numDistinctVals always returns 0 for large inputs
> -----------------------------------------------------------
>
>                 Key: CALCITE-4351
>                 URL: https://issues.apache.org/jira/browse/CALCITE-4351
>             Project: Calcite
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 1.26.0
>            Reporter: Caizhi Weng
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Previous implementation of {{RelMdUtil#numDistinctVals}} uses the 
> approximation {{ln(1 + x) ~= x}} when {{x}} is small.
> However CALCITE-4132 remove this approximation to make the result more 
> accurate. This causes the function to calculate an incorrect result for large 
> inputs (for example, when {{domainSize = 1e18}} and {{numSelected = 1e10}} 
> the result is 0) due to precision problems.
> What I would suggest is to treat small and large inputs in different ways. 
> For small inputs we use the new, more precise function and for large inputs 
> we use the old, approximated function.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to