[jira] [Comment Edited] (CALCITE-6452) Scalar sub-query that uses IS NOT DISTINCT FROM returns incorrect result

Adam Markowitz (Jira) Fri, 13 Sep 2024 09:37:06 -0700


    [ 
https://issues.apache.org/jira/browse/CALCITE-6452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881617#comment-17881617
 ]


Adam Markowitz edited comment on CALCITE-6452 at 9/13/24 4:36 PM:
------------------------------------------------------------------

Digging deeper, I'm hopeful the following will be more clear:
 * The decorrelator reviews each correlated variable in the Frame for the right 
relation
 * each of these correlates are then used to _unconditionally_ create a 
_relBuilder.equals()_ condition on the new inputs for both the left and right 
relations to be used as a JOIN condition
 * each of these conditions are then _RexUtil.composeConjunction() -_ composed 
as an AND

So, you effectively always end up with expressions of the form `AND(=($2, $4), 
=($3, $5))`.

However, this is incorrect for at least the following 2 cases.
 # The correlated subquery has a filter that includes a disjunction (i.e. OR) 
between 2 expressions
 # The correlated subquery has a filter that includes a conjunction (i.e. AND) 
between 2 operands

Examples of failures to decorrelate properly during logical planning:

Example 1 => using = instead of AND:

 
{code:java}
select e.ename,
  (select count(*)
    from emp as f 
      where f.comm IS NULL AND e.comm IS NULL
  ) as c 
    from emp as e; {code}
produces the following logical plan:
{code:java}
LogicalProject(ENAME=[$1], C=[CASE(IS NULL($4), 0:BIGINT, $4)])
  LogicalJoin(condition=[=($2, $3)], joinType=[left])
    LogicalProject(EMPNO=[$0], ENAME=[$1], $f8=[IS NULL($6)])
      LogicalTableScan(table=[[scott, EMP]])
    LogicalAggregate(group=[{0}], EXPR$0=[COUNT()])
      LogicalProject($f8=[$2])
        LogicalJoin(condition=[true], joinType=[inner])
          LogicalFilter(condition=[IS NULL($1)])
            LogicalProject(EMPNO=[$0], COMM=[$6])
              LogicalTableScan(table=[[scott, EMP]])
          LogicalFilter(condition=[$0])
            LogicalAggregate(group=[{0}])
              LogicalProject($f8=[IS NULL($6)])
                LogicalTableScan(table=[[scott, EMP]]) {code}
The top level join generated during decorrelation has `condition=[=($2, $3)]` 
instead of `condition=[AND($2, $3)]`.  Not only does the generated janino code 
not compile, but if it did the caller will get incorrect results as the 
condition should only be true for TRUE AND TRUE however the current generated 
condition is true for both TRUE = TRUE as well as FALSE = FALSE.

Example 2 => using conjunction instead of disjunction:
{code:java}
select e.ename,
  (select count(*) from emp as f
     where f.comm = e.comm OR (e.sal > 500.00)) as c
from emp as e; {code}
produces the following logical plan:
{code:java}
LogicalProject(ENAME=[$1], C=[CASE(IS NULL($6), 0:BIGINT, $6)])
  LogicalJoin(condition=[AND(=($2, $4), =($3, $5))], joinType=[left])
    LogicalProject(EMPNO=[$0], ENAME=[$1], COMM=[$6], $f9=[>($5, 
500.00:DECIMAL(5, 2))])
      LogicalTableScan(table=[[scott, EMP]])
    LogicalAggregate(group=[{0, 1}], EXPR$0=[COUNT()])
      LogicalProject(COMM0=[$2], $f9=[$3])
        LogicalJoin(condition=[OR(=($1, $2), $3)], joinType=[inner])
          LogicalProject(EMPNO=[$0], COMM=[$6])
            LogicalTableScan(table=[[scott, EMP]])
          LogicalAggregate(group=[{0, 1}])
            LogicalProject(COMM=[$6], $f9=[>($5, 500.00:DECIMAL(5, 2))])
              LogicalTableScan(table=[[scott, EMP]]) {code}
The top level join generated during decorrelation has `condition=[AND(=($2, 
$4), =($3, $5))]` instead of `condition=[OR(=($2, $4), =($3, $5))]` or 
leveraging a more complex structure of joins.  The caller will get incorrect 
results as the condition should only be true for EITHER =($2, $4) _or_ =($3, 
$5) however the currently generated condition is true only when both 
expressions are true.

The original test case here triggers both of these issues as
{code:java}
select e.ename,
  (select count(*) from emp as f
     where f.comm is not distinct from e.comm) as c
from emp as e; {code}
is the same as
{code:java}
select e.ename,
  (select count(*) from emp as f
     where f.comm = e.comm OR -- <-- disjunction failure in decorrelation
     f.comm IS NULL AND e.comm IS NULL -- <-- AND vs = failure in decorrelation
  ) as c
from emp as e;
 {code}
 


was (Author: JIRAUSER305909):
Digging deeper, I'n hopeful the following will be more clear:
 * The decorrelator reviews each correlated variable in the Frame for the right 
relation
 * each of these correlates are then used to _unconditionally_ create a 
_relBuilder.equals()_ condition on the new inputs for both the left and right 
relations to be used as a JOIN condition
 * each of these conditions are then _RexUtil.composeConjunction() -_ composed 
as an AND

So, you effectively always end up with expressions of the form `AND(=($2, $4), 
=($3, $5))`.

However, this is incorrect for at least the following 2 cases.
 # The correlated subquery has a filter that includes a disjunction (i.e. OR) 
between 2 expressions
 # The correlated subquery has a filter that includes a conjunction (i.e. AND) 
between 2 operands

Examples of failures to decorrelate properly during logical planning:

Example 1 => using = instead of AND:

 
{code:java}
select e.ename,
  (select count(*)
    from emp as f 
      where f.comm IS NULL AND e.comm IS NULL
  ) as c 
    from emp as e; {code}
produces the following logical plan:
{code:java}
LogicalProject(ENAME=[$1], C=[CASE(IS NULL($4), 0:BIGINT, $4)])
  LogicalJoin(condition=[=($2, $3)], joinType=[left])
    LogicalProject(EMPNO=[$0], ENAME=[$1], $f8=[IS NULL($6)])
      LogicalTableScan(table=[[scott, EMP]])
    LogicalAggregate(group=[{0}], EXPR$0=[COUNT()])
      LogicalProject($f8=[$2])
        LogicalJoin(condition=[true], joinType=[inner])
          LogicalFilter(condition=[IS NULL($1)])
            LogicalProject(EMPNO=[$0], COMM=[$6])
              LogicalTableScan(table=[[scott, EMP]])
          LogicalFilter(condition=[$0])
            LogicalAggregate(group=[{0}])
              LogicalProject($f8=[IS NULL($6)])
                LogicalTableScan(table=[[scott, EMP]]) {code}
The top level join generated during decorrelation has `condition=[=($2, $3)]` 
instead of `condition=[AND($2, $3)]`.  Not only does the generated janino code 
not compile, but if it did the caller will get incorrect results as the 
condition should only be true for TRUE AND TRUE however the current generated 
condition is true for both TRUE = TRUE as well as FALSE = FALSE.

Example 2 => using conjunction instead of disjunction:
{code:java}
select e.ename,
  (select count(*) from emp as f
     where f.comm = e.comm OR (e.sal > 500.00)) as c
from emp as e; {code}
produces the following logical plan:
{code:java}
LogicalProject(ENAME=[$1], C=[CASE(IS NULL($6), 0:BIGINT, $6)])
  LogicalJoin(condition=[AND(=($2, $4), =($3, $5))], joinType=[left])
    LogicalProject(EMPNO=[$0], ENAME=[$1], COMM=[$6], $f9=[>($5, 
500.00:DECIMAL(5, 2))])
      LogicalTableScan(table=[[scott, EMP]])
    LogicalAggregate(group=[{0, 1}], EXPR$0=[COUNT()])
      LogicalProject(COMM0=[$2], $f9=[$3])
        LogicalJoin(condition=[OR(=($1, $2), $3)], joinType=[inner])
          LogicalProject(EMPNO=[$0], COMM=[$6])
            LogicalTableScan(table=[[scott, EMP]])
          LogicalAggregate(group=[{0, 1}])
            LogicalProject(COMM=[$6], $f9=[>($5, 500.00:DECIMAL(5, 2))])
              LogicalTableScan(table=[[scott, EMP]]) {code}
The top level join generated during decorrelation has `condition=[AND(=($2, 
$4), =($3, $5))]` instead of `condition=[OR(=($2, $4), =($3, $5))]` or 
leveraging a more complex structure of joins.  The caller will get incorrect 
results as the condition should only be true for EITHER =($2, $4) _or_ =($3, 
$5) however the currently generated condition is true only when both 
expressions are true.

The original test case here triggers both of these issues as
{code:java}
select e.ename,
  (select count(*) from emp as f
     where f.comm is not distinct from e.comm) as c
from emp as e; {code}
is the same as
{code:java}
select e.ename,
  (select count(*) from emp as f
     where f.comm = e.comm OR -- <-- disjunction failure in decorrelation
     f.comm IS NULL AND e.comm IS NULL -- <-- AND vs = failure in decorrelation
  ) as c
from emp as e;
 {code}
 

> Scalar sub-query that uses IS NOT DISTINCT FROM returns incorrect result
> ------------------------------------------------------------------------
>
>                 Key: CALCITE-6452
>                 URL: https://issues.apache.org/jira/browse/CALCITE-6452
>             Project: Calcite
>          Issue Type: Improvement
>            Reporter: Julian Hyde
>            Assignee: Adam Markowitz
>            Priority: Major
>
> Scalar sub-query that uses an {{IS NOT DISTINCT FROM}} condition returns 
> incorrect result. For example,
> {code:java}
> select e.ename,
>   (select count(*)
>     from emp as f
>     where f.comm is not distinct from e.comm) as c
> from emp as e{code}
> returns
> {noformat}
> +--------+----+
> | ENAME  | C  |
> +--------+----+
> | ADAMS  |  0 |
> | ALLEN  |  1 |
> | BLAKE  |  0 |
> | CLARK  |  0 |
> | FORD   |  0 |
> | JAMES  |  0 |
> | JONES  |  0 |
> | KING   |  0 |
> | MARTIN |  1 |
> | MILLER |  0 |
> | SCOTT  |  0 |
> | SMITH  |  0 |
> | TURNER |  1 |
> | WARD   |  1 |
> +--------+----+
> (14 rows)
> {noformat}
> but should return
> {noformat}
> +--------+----+
> | ENAME  | C  |
> +--------+----+
> | ADAMS  |  0 |
> | ALLEN  | 10 |
> | BLAKE  |  0 |
> | CLARK  |  0 |
> | FORD   |  0 |
> | JAMES  |  0 |
> | JONES  |  0 |
> | KING   |  0 |
> | MARTIN | 10 |
> | MILLER |  0 |
> | SCOTT  |  0 |
> | SMITH  |  0 |
> | TURNER | 10 |
> | WARD   | 10 |
> +--------+----+
> (14 rows)
> {noformat}
> Also, and perhaps related, if we add a {{WHERE}} to the above query, like 
> this:
> {code:java}
> select e.ename,
>   (select count(*)
>     from emp as f
>     where f.comm is not distinct from e.comm) as c
> from emp as e
> where e.deptno = 10{code}
> Calcite throws:
> {noformat}
> Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 6 out of bounds 
> for length 5
> >     at 
> > com.google.common.collect.RegularImmutableList.get(RegularImmutableList.java:77)
> >     at org.apache.calcite.rex.RexBuilder.makeInputRef(RexBuilder.java:1037)
> >     at 
> > org.apache.calcite.sql2rel.RelDecorrelator$AdjustProjectForCountAggregateRule.lambda$onMatch2$4(RelDecorrelator.java:2679)
> {noformat}
> I ran into this error while trying to rewrite queries that had measures and 
> used {{ROLLUP}}. (Since {{ROLLUP}} will generate group keys whose value is 
> NULL, we cannot use regular {{=}} when doing a self-join.)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (CALCITE-6452) Scalar sub-query that uses IS NOT DISTINCT FROM returns incorrect result

Reply via email to