[jira] [Commented] (CALCITE-6561) Incorrect value calculated for AS MEASURE aggregate when grouping by a column with NULLs

Adam Markowitz (Jira) Fri, 06 Sep 2024 15:14:08 -0700


    [ 
https://issues.apache.org/jira/browse/CALCITE-6561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879982#comment-17879982
 ]


Adam Markowitz commented on CALCITE-6561:
-----------------------------------------

Here are relevant bits from the plan lifecycle:

 
{code:java}
2024-09-06 15:02:35,040 [main] DEBUG - Plan after converting SqlNode to RelNode
LogicalProject(MGR=[$0], SUM_SAL=[$1])
  LogicalAggregate(group=[{0}], agg#0=[AGG_M2V($1)])
    LogicalProject(MGR=[$3], SUM_SAL=[V2M(SUM($5))])
      LogicalTableScan(table=[[scott, EMP]]) 

2024-09-06 15:02:35,118 [main] DEBUG - Plan after extracting correlated 
computations:
LogicalProject(MGR=[$0], $f1=[$2])
  LogicalProject(MGR=[$0], $f1=[$1], $f0=[$2])
    LogicalCorrelate(correlation=[$cor0], joinType=[left], 
requiredColumns=[{0}])
      LogicalAggregate(group=[{0}], agg#0=[AGG_M2M($1)])
        LogicalProject(MGR=[$3], SUM_SAL=[V2M(SUM($5))])
          LogicalTableScan(table=[[scott, EMP]])
      LogicalAggregate(group=[{}], agg#0=[SUM($5)])
        LogicalFilter(condition=[=($3, $cor0.MGR)])
          LogicalTableScan(table=[[scott, EMP]])

2024-09-06 15:02:35,137 [main] DEBUG - Plan after trimming unused fields
LogicalProject(MGR=[$0], $f1=[$2])
  LogicalJoin(condition=[=($0, $1)], joinType=[left])
    LogicalAggregate(group=[{0}])
      LogicalProject(MGR=[$3])
        LogicalTableScan(table=[[scott, EMP]])
    LogicalAggregate(group=[{0}], agg#0=[SUM($1)])
      LogicalProject(MGR=[$1], SAL=[$2])
        LogicalFilter(condition=[IS NOT NULL($1)])
          LogicalProject(EMPNO=[$0], MGR=[$3], SAL=[$5])
            LogicalTableScan(table=[[scott, EMP]])

2024-09-06 15:02:35,219 [main] DEBUG - Plan after physical tweaks:
EnumerableCalc(expr#0..2=[{inputs}], MGR=[$t0], $f1=[$t2]): rowcount = 1.4546, 
cumulative cost = {49.00390466008148 rows, 289.273 cpu, 0.0 io}, id = 304
  EnumerableHashJoin(condition=[=($0, $1)], joinType=[left]): rowcount = 
1.4546, cumulative cost = {47.54930466008148 rows, 282.0 cpu, 0.0 io}, id = 298
    EnumerableAggregate(group=[{3}]): rowcount = 1.4, cumulative cost = {15.4 
rows, 15.0 cpu, 0.0 io}, id = 291
      EnumerableTableScan(table=[[scott, EMP]]): rowcount = 14.0, cumulative 
cost = {14.0 rows, 15.0 cpu, 0.0 io}, id = 186
    EnumerableAggregate(group=[{3}], agg#0=[SUM($5)]): rowcount = 1.26, 
cumulative cost = {28.033250060081482 rows, 267.0 cpu, 0.0 io}, id = 296
      EnumerableCalc(expr#0..7=[{inputs}], expr#8=[IS NOT NULL($t3)], 
proj#0..7=[{exprs}], $condition=[$t8]): rowcount = 12.6, cumulative cost = 
{26.6 rows, 267.0 cpu, 0.0 io}, id = 302
        EnumerableTableScan(table=[[scott, EMP]]): rowcount = 14.0, cumulative 
cost = {14.0 rows, 15.0 cpu, 0.0 io}, id = 186
{code}
 

We can see in this logical plan the following:

 
{code:java}
2024-09-06 15:02:35,137 [main] DEBUG - Plan after trimming unused fields
LogicalProject(MGR=[$0], $f1=[$2])
  LogicalJoin(condition=[=($0, $1)], joinType=[left])
    LogicalAggregate(group=[{0}])
      LogicalProject(MGR=[$3])
        LogicalTableScan(table=[[scott, EMP]])
    LogicalAggregate(group=[{0}], agg#0=[SUM($1)])
      LogicalProject(MGR=[$1], SAL=[$2])
        LogicalFilter(condition=[IS NOT NULL($1)])
          LogicalProject(EMPNO=[$0], MGR=[$3], SAL=[$5])
            LogicalTableScan(table=[[scott, EMP]]){code}
Note the *condition* in {*}LogicalJoin(condition=[=($0, $1)], 
joinType=[left]){*}.  This is an equality condition, however the test case here 
is comparing NULL from the left with NULL from the right and NULL == NULL is 
{_}not true{_}.  This should be a IS NOT DISTINCT FROM instead.

However, even after tweaking the code to produce IS NOT DISTINCT FROM in the 
logical plan, the physical plan still uses equality:

 
{code:java}
LogicalProject(MGR=[$0], $f1=[$2])
  LogicalJoin(condition=[IS NOT DISTINCT FROM($0, $1)], joinType=[left])
    LogicalAggregate(group=[{0}])
      LogicalProject(MGR=[$3])
        LogicalTableScan(table=[[scott, EMP]])
    LogicalAggregate(group=[{0}], agg#0=[SUM($1)])
      LogicalProject(MGR=[$1], SAL=[$2])
        LogicalFilter(condition=[IS NOT NULL($1)])
          LogicalProject(EMPNO=[$0], MGR=[$3], SAL=[$5])
            LogicalTableScan(table=[[scott, EMP]])
...
2024-09-06 15:10:31,911 [main] DEBUG - Plan after physical tweaks:
EnumerableCalc(expr#0..2=[{inputs}], MGR=[$t0], $f1=[$t2]): rowcount = 1.4546, 
cumulative cost = {49.00390466008148 rows, 289.273 cpu, 0.0 io}, id = 304
  EnumerableHashJoin(condition=[=($0, $1)], joinType=[left]): rowcount = 
1.4546, cumulative cost = {47.54930466008148 rows, 282.0 cpu, 0.0 io}, id = 298
    EnumerableAggregate(group=[{3}]): rowcount = 1.4, cumulative cost = {15.4 
rows, 15.0 cpu, 0.0 io}, id = 291
      EnumerableTableScan(table=[[scott, EMP]]): rowcount = 14.0, cumulative 
cost = {14.0 rows, 15.0 cpu, 0.0 io}, id = 186
    EnumerableAggregate(group=[{3}], agg#0=[SUM($5)]): rowcount = 1.26, 
cumulative cost = {28.033250060081482 rows, 267.0 cpu, 0.0 io}, id = 296
      EnumerableCalc(expr#0..7=[{inputs}], expr#8=[IS NOT NULL($t3)], 
proj#0..7=[{exprs}], $condition=[$t8]): rowcount = 12.6, cumulative cost = 
{26.6 rows, 267.0 cpu, 0.0 io}, id = 302
        EnumerableTableScan(table=[[scott, EMP]]): rowcount = 14.0, cumulative 
cost = {14.0 rows, 15.0 cpu, 0.0 io}, id = 186
 {code}
It's likely that this may be related to 
[CALCITE-6452|https://issues.apache.org/jira/browse/CALCITE-6452], so I'll be 
redirecting my efforts to that issue for now.

 

 

 

> Incorrect value calculated for AS MEASURE aggregate when grouping by a column 
> with NULLs
> ----------------------------------------------------------------------------------------
>
>                 Key: CALCITE-6561
>                 URL: https://issues.apache.org/jira/browse/CALCITE-6561
>             Project: Calcite
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 1.37.0
>            Reporter: Adam Markowitz
>            Priority: Minor
>              Labels: pull-request-available
>
> Test case using an `AS MEASURE` aggregate with expected data:
>  
> {code:java}
> !set outputformat mysql
> !use scott
> create view empm as
> select *, avg(sal) as measure avg_sal
> from emp;
> (0 rows modified)
> !update
> # GROUP BY a dimension with NULLs
> SELECT mgr, avg_sal FROM empm
> GROUP BY mgr;
> +------+---------+
> | MGR  | AVG_SAL |
> +------+---------+
> | 7566 | 3000.00 |
> | 7698 | 1310.00 |
> | 7782 | 1300.00 |
> | 7788 | 1100.00 |
> | 7839 | 2758.33 |
> | 7902 |  800.00 |
> |      | 5000.00 |
> +------+---------+
> (7 rows)
> !ok
> {code}
> The NULL value does not get the aggregate value causing a failure:
> {code:java}
> < |      | 5000.00 |
> ---
> > |      |         | {code}
> However, swapping out the `AS MEASURE` column reference with an inline 
> aggregate results in the correct data:
> {code:java}
> SELECT mgr, avg(sal) as avg_sal FROM empm
> GROUP BY mgr;
> +------+---------+
> | MGR  | AVG_SAL |
> +------+---------+
> | 7566 | 3000.00 |
> | 7698 | 1310.00 |
> | 7782 | 1300.00 |
> | 7788 | 1100.00 |
> | 7839 | 2758.33 |
> | 7902 |  800.00 |
> |      | 5000.00 |
> +------+---------+
> (7 rows)
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (CALCITE-6561) Incorrect value calculated for AS MEASURE aggregate when grouping by a column with NULLs

Reply via email to