[ 
https://issues.apache.org/jira/browse/IMPALA-7944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers updated IMPALA-7944:
--------------------------------
    Description: 
The {{count(*)}} function has an NDV of 1: the function always returns a single 
value. This is important because it tells us that the query:
{code:sql}
SELECT COUNT(*) FROM foo
{code}
Returns just one row. All good.

In the analyzer, we set a value of NDV=1 via an incorrect process: by labeling 
{{count(*)}} as constant:
 * For historical reasons, NDV calculations occur before a node is analyzed.
 * We use the default NDV calc: if the node is constant, set NDV = 1, else 
compute it.
 * Since the function node for {{count(*)}} is not analyzed, we determine 
constant-ness from an inspection.
 * All checks for non-constantness fail, leaving the final check: a function is 
constant if either a) it has no arguments, or b) all its arguments are constant.
 * Since {{count(*)}} has no expression arguments, and is not marked as 
non-deterministic, we infer it must be costant.
 * Therefore, it's NDV is set to 1.

This, of course, highly unstable for multiple reasons:
 * NDV calculations are done before the node is analyzed. This means, NDV 
calculations for a {{SlotRef}} would fail because the ref has not yet been 
resolved to a column. (The {{SlotRef}} has special code to work around this 
fact.)
 * The "treat zero-argument functions as constants and so use NDV=1" rule works 
for {{count(*)}}, but not for {{count(c)}}, nor or {{sum(c)}}, both of which 
should have NDV=1.
 * {{count(*)}} is not really a constant; its NDV=1 setting should not really 
on (benignly) assuming it is.
 * The NDV check const-ness is temporary; once the node is analyzed, it is 
correctly marked as non-const. So, the calcs rely on one path saying the the 
function is const, another path saying it is not const.

This should be cleaned up to provide a more reliable, understandable way of 
achieving the goal of NDV=1.

As it turns out, this seemed to have been a known issue in the code:

{code:java}
    // TODO: we can't correctly determine const-ness before analyzing 'fn_'. We 
should    
    // rework logic so that we do not call this function on unanalyzed exprs.   
          
    // Aggregate functions are never constant.                                  
          
{code}

  was:
The {{count\(*)}} function has an NDV of 1: the function always returns a 
single value. This is important because it tells us that the query:

{code:sql}
SELECT COUNT(*) FROM foo
{code}

Returns just one row. All good.

In the analyzer, we set a value of NDV=1 via an incorrect process: by labeling 
{{count\(*)}} as constant:

* For historical reasons, NDV calculations occur before a node is analyzed.
* We use the default NDV calc: if the node is constant, set NDV = 1, else 
compute it.
* Since the function node for {{count\(*)}} is not analyzed, we determine 
constant-ness from an inspection.
* All checks for non-constantness fail, leaving the final check: a function is 
constant if either a) it has no arguments, or b) all its arguments are constant.
* Since {{count\(*)}} has no expression arguments, and is not marked as 
non-deterministic, we infer it must be costant.
* Therefore, it's NDV is set to 1.

This, of course, highly unstable for multiple reasons:

* NDV calculations are done before the node is analyzed. This means, NDV 
calculations for a {{SlotRef}} would fail because the ref has not yet been 
resolved to a column. (The {{SlotRef}} has special code to work around this 
fact.)
* The "treat zero-argument functions as constants and so use NDV=1" rule works 
for {{count\(*)}}, but not for {{count(c)}}, nor or {{sum(c)}}, both of which 
should have NDV=1.
* {{count\(*)}} is not really a constant; its NDV=1 setting should not really 
on (benignly) assuming it is.
* The NDV check const-ness is temporary; once the node is analyzed, it is 
correctly marked as non-const. So, the calcs rely on one path saying the the 
function is const, another path saying it is not const.

This should be cleaned up to provide a more reliable, understandable way of 
achieving the goal of NDV=1.


> count(*) correctly has NDV=1 via being labeled as constant
> ----------------------------------------------------------
>
>                 Key: IMPALA-7944
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7944
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Frontend
>    Affects Versions: Impala 3.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>            Priority: Minor
>
> The {{count(*)}} function has an NDV of 1: the function always returns a 
> single value. This is important because it tells us that the query:
> {code:sql}
> SELECT COUNT(*) FROM foo
> {code}
> Returns just one row. All good.
> In the analyzer, we set a value of NDV=1 via an incorrect process: by 
> labeling {{count(*)}} as constant:
>  * For historical reasons, NDV calculations occur before a node is analyzed.
>  * We use the default NDV calc: if the node is constant, set NDV = 1, else 
> compute it.
>  * Since the function node for {{count(*)}} is not analyzed, we determine 
> constant-ness from an inspection.
>  * All checks for non-constantness fail, leaving the final check: a function 
> is constant if either a) it has no arguments, or b) all its arguments are 
> constant.
>  * Since {{count(*)}} has no expression arguments, and is not marked as 
> non-deterministic, we infer it must be costant.
>  * Therefore, it's NDV is set to 1.
> This, of course, highly unstable for multiple reasons:
>  * NDV calculations are done before the node is analyzed. This means, NDV 
> calculations for a {{SlotRef}} would fail because the ref has not yet been 
> resolved to a column. (The {{SlotRef}} has special code to work around this 
> fact.)
>  * The "treat zero-argument functions as constants and so use NDV=1" rule 
> works for {{count(*)}}, but not for {{count(c)}}, nor or {{sum(c)}}, both of 
> which should have NDV=1.
>  * {{count(*)}} is not really a constant; its NDV=1 setting should not really 
> on (benignly) assuming it is.
>  * The NDV check const-ness is temporary; once the node is analyzed, it is 
> correctly marked as non-const. So, the calcs rely on one path saying the the 
> function is const, another path saying it is not const.
> This should be cleaned up to provide a more reliable, understandable way of 
> achieving the goal of NDV=1.
> As it turns out, this seemed to have been a known issue in the code:
> {code:java}
>     // TODO: we can't correctly determine const-ness before analyzing 'fn_'. 
> We should    
>     // rework logic so that we do not call this function on unanalyzed exprs. 
>             
>     // Aggregate functions are never constant.                                
>             
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org
For additional commands, e-mail: issues-all-h...@impala.apache.org

Reply via email to