[
https://issues.apache.org/jira/browse/CALCITE-6561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879982#comment-17879982
]
Adam Markowitz commented on CALCITE-6561:
-----------------------------------------
Here are relevant bits from the plan lifecycle:
{code:java}
2024-09-06 15:02:35,040 [main] DEBUG - Plan after converting SqlNode to RelNode
LogicalProject(MGR=[$0], SUM_SAL=[$1])
LogicalAggregate(group=[{0}], agg#0=[AGG_M2V($1)])
LogicalProject(MGR=[$3], SUM_SAL=[V2M(SUM($5))])
LogicalTableScan(table=[[scott, EMP]])
2024-09-06 15:02:35,118 [main] DEBUG - Plan after extracting correlated
computations:
LogicalProject(MGR=[$0], $f1=[$2])
LogicalProject(MGR=[$0], $f1=[$1], $f0=[$2])
LogicalCorrelate(correlation=[$cor0], joinType=[left],
requiredColumns=[{0}])
LogicalAggregate(group=[{0}], agg#0=[AGG_M2M($1)])
LogicalProject(MGR=[$3], SUM_SAL=[V2M(SUM($5))])
LogicalTableScan(table=[[scott, EMP]])
LogicalAggregate(group=[{}], agg#0=[SUM($5)])
LogicalFilter(condition=[=($3, $cor0.MGR)])
LogicalTableScan(table=[[scott, EMP]])
2024-09-06 15:02:35,137 [main] DEBUG - Plan after trimming unused fields
LogicalProject(MGR=[$0], $f1=[$2])
LogicalJoin(condition=[=($0, $1)], joinType=[left])
LogicalAggregate(group=[{0}])
LogicalProject(MGR=[$3])
LogicalTableScan(table=[[scott, EMP]])
LogicalAggregate(group=[{0}], agg#0=[SUM($1)])
LogicalProject(MGR=[$1], SAL=[$2])
LogicalFilter(condition=[IS NOT NULL($1)])
LogicalProject(EMPNO=[$0], MGR=[$3], SAL=[$5])
LogicalTableScan(table=[[scott, EMP]])
2024-09-06 15:02:35,219 [main] DEBUG - Plan after physical tweaks:
EnumerableCalc(expr#0..2=[{inputs}], MGR=[$t0], $f1=[$t2]): rowcount = 1.4546,
cumulative cost = {49.00390466008148 rows, 289.273 cpu, 0.0 io}, id = 304
EnumerableHashJoin(condition=[=($0, $1)], joinType=[left]): rowcount =
1.4546, cumulative cost = {47.54930466008148 rows, 282.0 cpu, 0.0 io}, id = 298
EnumerableAggregate(group=[{3}]): rowcount = 1.4, cumulative cost = {15.4
rows, 15.0 cpu, 0.0 io}, id = 291
EnumerableTableScan(table=[[scott, EMP]]): rowcount = 14.0, cumulative
cost = {14.0 rows, 15.0 cpu, 0.0 io}, id = 186
EnumerableAggregate(group=[{3}], agg#0=[SUM($5)]): rowcount = 1.26,
cumulative cost = {28.033250060081482 rows, 267.0 cpu, 0.0 io}, id = 296
EnumerableCalc(expr#0..7=[{inputs}], expr#8=[IS NOT NULL($t3)],
proj#0..7=[{exprs}], $condition=[$t8]): rowcount = 12.6, cumulative cost =
{26.6 rows, 267.0 cpu, 0.0 io}, id = 302
EnumerableTableScan(table=[[scott, EMP]]): rowcount = 14.0, cumulative
cost = {14.0 rows, 15.0 cpu, 0.0 io}, id = 186
{code}
We can see in this logical plan the following:
{code:java}
2024-09-06 15:02:35,137 [main] DEBUG - Plan after trimming unused fields
LogicalProject(MGR=[$0], $f1=[$2])
LogicalJoin(condition=[=($0, $1)], joinType=[left])
LogicalAggregate(group=[{0}])
LogicalProject(MGR=[$3])
LogicalTableScan(table=[[scott, EMP]])
LogicalAggregate(group=[{0}], agg#0=[SUM($1)])
LogicalProject(MGR=[$1], SAL=[$2])
LogicalFilter(condition=[IS NOT NULL($1)])
LogicalProject(EMPNO=[$0], MGR=[$3], SAL=[$5])
LogicalTableScan(table=[[scott, EMP]]){code}
Note the *condition* in {*}LogicalJoin(condition=[=($0, $1)],
joinType=[left]){*}. This is an equality condition, however the test case here
is comparing NULL from the left with NULL from the right and NULL == NULL is
{_}not true{_}. This should be a IS NOT DISTINCT FROM instead.
However, even after tweaking the code to produce IS NOT DISTINCT FROM in the
logical plan, the physical plan still uses equality:
{code:java}
LogicalProject(MGR=[$0], $f1=[$2])
LogicalJoin(condition=[IS NOT DISTINCT FROM($0, $1)], joinType=[left])
LogicalAggregate(group=[{0}])
LogicalProject(MGR=[$3])
LogicalTableScan(table=[[scott, EMP]])
LogicalAggregate(group=[{0}], agg#0=[SUM($1)])
LogicalProject(MGR=[$1], SAL=[$2])
LogicalFilter(condition=[IS NOT NULL($1)])
LogicalProject(EMPNO=[$0], MGR=[$3], SAL=[$5])
LogicalTableScan(table=[[scott, EMP]])
...
2024-09-06 15:10:31,911 [main] DEBUG - Plan after physical tweaks:
EnumerableCalc(expr#0..2=[{inputs}], MGR=[$t0], $f1=[$t2]): rowcount = 1.4546,
cumulative cost = {49.00390466008148 rows, 289.273 cpu, 0.0 io}, id = 304
EnumerableHashJoin(condition=[=($0, $1)], joinType=[left]): rowcount =
1.4546, cumulative cost = {47.54930466008148 rows, 282.0 cpu, 0.0 io}, id = 298
EnumerableAggregate(group=[{3}]): rowcount = 1.4, cumulative cost = {15.4
rows, 15.0 cpu, 0.0 io}, id = 291
EnumerableTableScan(table=[[scott, EMP]]): rowcount = 14.0, cumulative
cost = {14.0 rows, 15.0 cpu, 0.0 io}, id = 186
EnumerableAggregate(group=[{3}], agg#0=[SUM($5)]): rowcount = 1.26,
cumulative cost = {28.033250060081482 rows, 267.0 cpu, 0.0 io}, id = 296
EnumerableCalc(expr#0..7=[{inputs}], expr#8=[IS NOT NULL($t3)],
proj#0..7=[{exprs}], $condition=[$t8]): rowcount = 12.6, cumulative cost =
{26.6 rows, 267.0 cpu, 0.0 io}, id = 302
EnumerableTableScan(table=[[scott, EMP]]): rowcount = 14.0, cumulative
cost = {14.0 rows, 15.0 cpu, 0.0 io}, id = 186
{code}
It's likely that this may be related to
[CALCITE-6452|https://issues.apache.org/jira/browse/CALCITE-6452], so I'll be
redirecting my efforts to that issue for now.
> Incorrect value calculated for AS MEASURE aggregate when grouping by a column
> with NULLs
> ----------------------------------------------------------------------------------------
>
> Key: CALCITE-6561
> URL: https://issues.apache.org/jira/browse/CALCITE-6561
> Project: Calcite
> Issue Type: Bug
> Components: core
> Affects Versions: 1.37.0
> Reporter: Adam Markowitz
> Priority: Minor
> Labels: pull-request-available
>
> Test case using an `AS MEASURE` aggregate with expected data:
>
> {code:java}
> !set outputformat mysql
> !use scott
> create view empm as
> select *, avg(sal) as measure avg_sal
> from emp;
> (0 rows modified)
> !update
> # GROUP BY a dimension with NULLs
> SELECT mgr, avg_sal FROM empm
> GROUP BY mgr;
> +------+---------+
> | MGR | AVG_SAL |
> +------+---------+
> | 7566 | 3000.00 |
> | 7698 | 1310.00 |
> | 7782 | 1300.00 |
> | 7788 | 1100.00 |
> | 7839 | 2758.33 |
> | 7902 | 800.00 |
> | | 5000.00 |
> +------+---------+
> (7 rows)
> !ok
> {code}
> The NULL value does not get the aggregate value causing a failure:
> {code:java}
> < | | 5000.00 |
> ---
> > | | | {code}
> However, swapping out the `AS MEASURE` column reference with an inline
> aggregate results in the correct data:
> {code:java}
> SELECT mgr, avg(sal) as avg_sal FROM empm
> GROUP BY mgr;
> +------+---------+
> | MGR | AVG_SAL |
> +------+---------+
> | 7566 | 3000.00 |
> | 7698 | 1310.00 |
> | 7782 | 1300.00 |
> | 7788 | 1100.00 |
> | 7839 | 2758.33 |
> | 7902 | 800.00 |
> | | 5000.00 |
> +------+---------+
> (7 rows)
> {code}
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)