Paul Rogers created IMPALA-7944:
-----------------------------------

             Summary: count(*) correctly has NDV=1 via being labeled as constant
                 Key: IMPALA-7944
                 URL: https://issues.apache.org/jira/browse/IMPALA-7944
             Project: IMPALA
          Issue Type: Bug
          Components: Frontend
    Affects Versions: Impala 3.0
            Reporter: Paul Rogers
            Assignee: Paul Rogers


The {{count\(*)}} function has an NDV of 1: the function always returns a 
single value. This is important because it tells us that the query:

{code:sql}
SELECT COUNT(*) FROM foo
{code}

Returns just one row. All good.

In the analyzer, we set a value of NDV=1 via an incorrect process: by labeling 
{{count\(*)}} as constant:

* For historical reasons, NDV calculations occur before a node is analyzed.
* We use the default NDV calc: if the node is constant, set NDV = 1, else 
compute it.
* Since the function node for {{count\(*)}} is not analyzed, we determine 
constant-ness from an inspection.
* All checks for non-constantness fail, leaving the final check: a function is 
constant if either a) it has no arguments, or b) all its arguments are constant.
* Since {{count\(*)}} has no expression arguments, and is not marked as 
non-deterministic, we infer it must be costant.
* Therefore, it's NDV is set to 1.

This, of course, highly unstable for multiple reasons:

* NDV calculations are done before the node is analyzed. This means, NDV 
calculations for a {{SlotRef}} would fail because the ref has not yet been 
resolved to a column. (The {{SlotRef}} has special code to work around this 
fact.)
* The "treat zero-argument functions as constants and so use NDV=1" rule works 
for {{count\(*)}}, but not for {{count(c)}}, nor or {{sum(c)}}, both of which 
should have NDV=1.
* {{count\(*)}} is not really a constant; its NDV=1 setting should not really 
on (benignly) assuming it is.
* The NDV check const-ness is temporary; once the node is analyzed, it is 
correctly marked as non-const. So, the calcs rely on one path saying the the 
function is const, another path saying it is not const.

This should be cleaned up to provide a more reliable, understandable way of 
achieving the goal of NDV=1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to