[jira] [Commented] (CALCITE-6452) Scalar sub-query that uses IS NOT DISTINCT FROM returns incorrect result

Adam Markowitz (Jira) Fri, 13 Sep 2024 10:24:05 -0700


    [ 
https://issues.apache.org/jira/browse/CALCITE-6452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881640#comment-17881640
 ]


Adam Markowitz commented on CALCITE-6452:
-----------------------------------------

Just had a discussion with [~julianhyde] about this.  We settled on changing 
the unconditional _relBuilder.equals()_ to a _relBuilder.isNotDistinctFrom()_ 
in 
[RelDecorrelator.java|https://github.com/apache/calcite/blob/91fe118d4f9592215ada50287a7039a27e0f2723/core/src/main/java/org/apache/calcite/sql2rel/RelDecorrelator.java#L1232]
 as using equals does not deal with NULLs properly.  This change results in the 
following logical plan which appears correct even though there may be some 
cruft (eg: in the below example the NULL check is planned for 2x in the 
condition).
{code:java}
select e.ename,
  (select count(*) from emp as f
     where f.comm is not distinct from e.comm) as c
from emp as e; {code}
{code:java}
LogicalProject(ENAME=[$1], C=[CASE(IS NULL($6), 0:BIGINT, $6)])
  LogicalJoin(condition=[AND(OR(AND(IS NULL($2), IS NULL($4)), =($2, $4)), 
=($3, $5))], joinType=[left])
    LogicalProject(EMPNO=[$0], ENAME=[$1], COMM=[$6], $f8=[IS NULL($6)])
      LogicalTableScan(table=[[scott, EMP]])
    LogicalAggregate(group=[{0, 1}], EXPR$0=[COUNT()])
      LogicalProject(COMM0=[$2], $f8=[$3])
        LogicalJoin(condition=[OR(AND(IS NULL($1), $3), =($1, $2))], 
joinType=[inner])
          LogicalProject(EMPNO=[$0], COMM=[$6])
            LogicalTableScan(table=[[scott, EMP]])
          LogicalAggregate(group=[{0, 1}])
            LogicalProject(COMM=[$6], $f8=[IS NULL($6)])
              LogicalTableScan(table=[[scott, EMP]]) {code}
walk through Example 1:
{code:java}
$2 = NULL
$4 = NULL
$3 = NULL IS NULL => TRUE
$5 = NULL IS NULL => TRUE 

AND(OR(AND(IS NULL($2), IS NULL($4)), =($2, $4)), =($3, $5))
=> AND(OR(AND(TRUE, TRUE), =(NULL, NULL), =(TRUE, TRUE)
=> AND(OR(AND(TRUE, TRUE), FALSE), TRUE)
=> AND(OR(TRUE, FALSE), TRUE)
=> AND(TRUE, TRUE)
=> TRUE (correct){code}
walk through Example 2:
{code:java}
$2 = 300
$4 = NULL
$3 = 300 IS NULL => FALSE
$5 = NULL IS NULL => TRUE

AND(OR(AND(IS NULL($2), IS NULL($4)), =($2, $4)), =($3, $5))
=> AND(OR(AND(FALSE, TRUE), =(300, NULL), =(FALSE, TRUE)
=> AND(OR(AND(FALSE, TRUE), FALSE), FALSE)
=> AND(OR(FALSE, FALSE), FALSE)
=> AND(FALSE, FALSE)
=> FALSE (correct){code}
walk through Example 3:
{code:java}
$2 = 300
$4 = 500
$3 = 300 IS NULL => FALSE
$5 = 500 IS NULL => FALSE

AND(OR(AND(IS NULL($2), IS NULL($4)), =($2, $4)), =($3, $5))
=> AND(OR(AND(FALSE, FALSE), =(300, 500), =(FALSE, FALSE)
=> AND(OR(AND(FALSE, FALSE), FALSE), TRUE)
=> AND(OR(FALSE, FALSE), TRUE)
=> AND(FALSE, TRUE)
=> FALSE (correct){code}
walk through Example 4:
{code:java}
$2 = 300
$4 = 300
$3 = 300 IS NULL => FALSE
$5 = 300 IS NULL => FALSE

AND(OR(AND(IS NULL($2), IS NULL($4)), =($2, $4)), =($3, $5))
=> AND(OR(AND(FALSE, FALSE), =(300, 300), =(FALSE, FALSE)
=> AND(OR(AND(FALSE, FALSE), TRUE), TRUE)
=> AND(OR(FALSE, TRUE), TRUE)
=> AND(TRUE, TRUE)
=> TRUE (correct){code}
 

So, the logical plan at this point appears correct however the physical plan is 
now re-simplifying the expression back to equals:
{code:java}
EnumerableCalc(expr#0..6=[{inputs}], expr#7=[IS NULL($t6)], expr#8=[0:BIGINT], 
expr#9=[CASE($t7, $t8, $t6)], ENAME=[$t1], C=[$t9]): rowcount = 14.0, 
cumulative cost = {193.57399999999998 rows, 1747.5104732091813 cpu, 0.0 io}, id 
= 235
  EnumerableMergeJoin(condition=[AND(=($2, $4), =($3, $5))], joinType=[left]): 
rowcount = 14.0, cumulative cost = {179.57399999999998 rows, 1579.5104732091813 
cpu, 0.0 io}, id = 231
    EnumerableSort(sort0=[$2], sort1=[$3], dir0=[ASC], dir1=[ASC]): rowcount = 
14.0, cumulative cost = {42.0 rows, 1231.5104732091813 cpu, 0.0 io}, id = 215
      EnumerableCalc(expr#0..7=[{inputs}], expr#8=[IS NULL($t6)], 
proj#0..1=[{exprs}], COMM=[$t6], $f8=[$t8]): rowcount = 14.0, cumulative cost = 
{28.0 rows, 197.0 cpu, 0.0 io}, id = 237
        EnumerableTableScan(table=[[scott, EMP]]): rowcount = 14.0, cumulative 
cost = {14.0 rows, 15.0 cpu, 0.0 io}, id = 142
    EnumerableSort(sort0=[$0], sort1=[$1], dir0=[ASC], dir1=[ASC]): rowcount = 
1.0, cumulative cost = {108.57399999999998 rows, 348.0 cpu, 0.0 io}, id = 229
      EnumerableAggregate(group=[{2, 3}], EXPR$0=[COUNT()]): rowcount = 1.0, 
cumulative cost = {107.57399999999998 rows, 324.0 cpu, 0.0 io}, id = 227
        EnumerableNestedLoopJoin(condition=[OR(AND(IS NULL($1), $3), =($1, 
$2))], joinType=[inner]): rowcount = 4.8999999999999995, cumulative cost = 
{106.44899999999998 rows, 324.0 cpu, 0.0 io}, id = 225
          EnumerableCalc(expr#0..7=[{inputs}], EMPNO=[$t0], COMM=[$t6]): 
rowcount = 14.0, cumulative cost = {28.0 rows, 155.0 cpu, 0.0 io}, id = 239
            EnumerableTableScan(table=[[scott, EMP]]): rowcount = 14.0, 
cumulative cost = {14.0 rows, 15.0 cpu, 0.0 io}, id = 142
          EnumerableAggregate(group=[{0, 1}]): rowcount = 1.4, cumulative cost 
= {29.4 rows, 169.0 cpu, 0.0 io}, id = 223
            EnumerableCalc(expr#0..7=[{inputs}], expr#8=[IS NULL($t6)], 
COMM=[$t6], $f8=[$t8]): rowcount = 14.0, cumulative cost = {28.0 rows, 169.0 
cpu, 0.0 io}, id = 241
              EnumerableTableScan(table=[[scott, EMP]]): rowcount = 14.0, 
cumulative cost = {14.0 rows, 15.0 cpu, 0.0 io}, id = 142
 {code}
key bit is here:
{code:java}
  EnumerableMergeJoin(condition=[AND(=($2, $4), =($3, $5))], joinType=[left]) 
{code}
which, I think, should be:
{code:java}
  EnumerableMergeJoin(condition=[AND(OR(AND(IS NULL($2), IS NULL($4)), =($2, 
$4)), =($3, $5))], joinType=[left]) {code}

> Scalar sub-query that uses IS NOT DISTINCT FROM returns incorrect result
> ------------------------------------------------------------------------
>
>                 Key: CALCITE-6452
>                 URL: https://issues.apache.org/jira/browse/CALCITE-6452
>             Project: Calcite
>          Issue Type: Improvement
>            Reporter: Julian Hyde
>            Assignee: Adam Markowitz
>            Priority: Major
>
> Scalar sub-query that uses an {{IS NOT DISTINCT FROM}} condition returns 
> incorrect result. For example,
> {code:java}
> select e.ename,
>   (select count(*)
>     from emp as f
>     where f.comm is not distinct from e.comm) as c
> from emp as e{code}
> returns
> {noformat}
> +--------+----+
> | ENAME  | C  |
> +--------+----+
> | ADAMS  |  0 |
> | ALLEN  |  1 |
> | BLAKE  |  0 |
> | CLARK  |  0 |
> | FORD   |  0 |
> | JAMES  |  0 |
> | JONES  |  0 |
> | KING   |  0 |
> | MARTIN |  1 |
> | MILLER |  0 |
> | SCOTT  |  0 |
> | SMITH  |  0 |
> | TURNER |  1 |
> | WARD   |  1 |
> +--------+----+
> (14 rows)
> {noformat}
> but should return
> {noformat}
> +--------+----+
> | ENAME  | C  |
> +--------+----+
> | ADAMS  |  0 |
> | ALLEN  | 10 |
> | BLAKE  |  0 |
> | CLARK  |  0 |
> | FORD   |  0 |
> | JAMES  |  0 |
> | JONES  |  0 |
> | KING   |  0 |
> | MARTIN | 10 |
> | MILLER |  0 |
> | SCOTT  |  0 |
> | SMITH  |  0 |
> | TURNER | 10 |
> | WARD   | 10 |
> +--------+----+
> (14 rows)
> {noformat}
> Also, and perhaps related, if we add a {{WHERE}} to the above query, like 
> this:
> {code:java}
> select e.ename,
>   (select count(*)
>     from emp as f
>     where f.comm is not distinct from e.comm) as c
> from emp as e
> where e.deptno = 10{code}
> Calcite throws:
> {noformat}
> Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 6 out of bounds 
> for length 5
> >     at 
> > com.google.common.collect.RegularImmutableList.get(RegularImmutableList.java:77)
> >     at org.apache.calcite.rex.RexBuilder.makeInputRef(RexBuilder.java:1037)
> >     at 
> > org.apache.calcite.sql2rel.RelDecorrelator$AdjustProjectForCountAggregateRule.lambda$onMatch2$4(RelDecorrelator.java:2679)
> {noformat}
> I ran into this error while trying to rewrite queries that had measures and 
> used {{ROLLUP}}. (Since {{ROLLUP}} will generate group keys whose value is 
> NULL, we cannot use regular {{=}} when doing a self-join.)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (CALCITE-6452) Scalar sub-query that uses IS NOT DISTINCT FROM returns incorrect result

Reply via email to