[ 
https://issues.apache.org/jira/browse/CALCITE-4351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17652343#comment-17652343
 ] 

Julian Hyde commented on CALCITE-4351:
--------------------------------------

[~fan_li_ya], I didn't follow the math, but I trust you.

In addition to the tests you've added, can you add some tests that show the 
estimate getting closer to 1.0 as the domain size increases? It will give us 
confidence that the function is smooth and continuous. E.g.
{code}
assertThat(numDistinctVals(100, 2.0), isWithin(2.0, delta));
assertThat(numDistinctVals(1000, 2.0), isWithin(1.4, delta));
assertThat(numDistinctVals(10000, 2.0), isWithin(1.1, delta));
{code}



> RelMdUtil#numDistinctVals always returns 0 for large inputs
> -----------------------------------------------------------
>
>                 Key: CALCITE-4351
>                 URL: https://issues.apache.org/jira/browse/CALCITE-4351
>             Project: Calcite
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 1.26.0
>            Reporter: Caizhi Weng
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Previous implementation of {{RelMdUtil#numDistinctVals}} uses the 
> approximation {{ln(1 + x) ~= x}} when {{x}} is small.
> However CALCITE-4132 remove this approximation to make the result more 
> accurate. This causes the function to calculate an incorrect result for large 
> inputs (for example, when {{domainSize = 1e18}} and {{numSelected = 1e10}} 
> the result is 0) due to precision problems.
> What I would suggest is to treat small and large inputs in different ways. 
> For small inputs we use the new, more precise function and for large inputs 
> we use the old, approximated function.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to