[
https://issues.apache.org/jira/browse/CALCITE-6452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881617#comment-17881617
]
Adam Markowitz edited comment on CALCITE-6452 at 9/13/24 4:36 PM:
------------------------------------------------------------------
Digging deeper, I'm hopeful the following will be more clear:
* The decorrelator reviews each correlated variable in the Frame for the right
relation
* each of these correlates are then used to _unconditionally_ create a
_relBuilder.equals()_ condition on the new inputs for both the left and right
relations to be used as a JOIN condition
* each of these conditions are then _RexUtil.composeConjunction() -_ composed
as an AND
So, you effectively always end up with expressions of the form `AND(=($2, $4),
=($3, $5))`.
However, this is incorrect for at least the following 2 cases.
# The correlated subquery has a filter that includes a disjunction (i.e. OR)
between 2 expressions
# The correlated subquery has a filter that includes a conjunction (i.e. AND)
between 2 operands
Examples of failures to decorrelate properly during logical planning:
Example 1 => using = instead of AND:
{code:java}
select e.ename,
(select count(*)
from emp as f
where f.comm IS NULL AND e.comm IS NULL
) as c
from emp as e; {code}
produces the following logical plan:
{code:java}
LogicalProject(ENAME=[$1], C=[CASE(IS NULL($4), 0:BIGINT, $4)])
LogicalJoin(condition=[=($2, $3)], joinType=[left])
LogicalProject(EMPNO=[$0], ENAME=[$1], $f8=[IS NULL($6)])
LogicalTableScan(table=[[scott, EMP]])
LogicalAggregate(group=[{0}], EXPR$0=[COUNT()])
LogicalProject($f8=[$2])
LogicalJoin(condition=[true], joinType=[inner])
LogicalFilter(condition=[IS NULL($1)])
LogicalProject(EMPNO=[$0], COMM=[$6])
LogicalTableScan(table=[[scott, EMP]])
LogicalFilter(condition=[$0])
LogicalAggregate(group=[{0}])
LogicalProject($f8=[IS NULL($6)])
LogicalTableScan(table=[[scott, EMP]]) {code}
The top level join generated during decorrelation has `condition=[=($2, $3)]`
instead of `condition=[AND($2, $3)]`. Not only does the generated janino code
not compile, but if it did the caller will get incorrect results as the
condition should only be true for TRUE AND TRUE however the current generated
condition is true for both TRUE = TRUE as well as FALSE = FALSE.
Example 2 => using conjunction instead of disjunction:
{code:java}
select e.ename,
(select count(*) from emp as f
where f.comm = e.comm OR (e.sal > 500.00)) as c
from emp as e; {code}
produces the following logical plan:
{code:java}
LogicalProject(ENAME=[$1], C=[CASE(IS NULL($6), 0:BIGINT, $6)])
LogicalJoin(condition=[AND(=($2, $4), =($3, $5))], joinType=[left])
LogicalProject(EMPNO=[$0], ENAME=[$1], COMM=[$6], $f9=[>($5,
500.00:DECIMAL(5, 2))])
LogicalTableScan(table=[[scott, EMP]])
LogicalAggregate(group=[{0, 1}], EXPR$0=[COUNT()])
LogicalProject(COMM0=[$2], $f9=[$3])
LogicalJoin(condition=[OR(=($1, $2), $3)], joinType=[inner])
LogicalProject(EMPNO=[$0], COMM=[$6])
LogicalTableScan(table=[[scott, EMP]])
LogicalAggregate(group=[{0, 1}])
LogicalProject(COMM=[$6], $f9=[>($5, 500.00:DECIMAL(5, 2))])
LogicalTableScan(table=[[scott, EMP]]) {code}
The top level join generated during decorrelation has `condition=[AND(=($2,
$4), =($3, $5))]` instead of `condition=[OR(=($2, $4), =($3, $5))]` or
leveraging a more complex structure of joins. The caller will get incorrect
results as the condition should only be true for EITHER =($2, $4) _or_ =($3,
$5) however the currently generated condition is true only when both
expressions are true.
The original test case here triggers both of these issues as
{code:java}
select e.ename,
(select count(*) from emp as f
where f.comm is not distinct from e.comm) as c
from emp as e; {code}
is the same as
{code:java}
select e.ename,
(select count(*) from emp as f
where f.comm = e.comm OR -- <-- disjunction failure in decorrelation
f.comm IS NULL AND e.comm IS NULL -- <-- AND vs = failure in decorrelation
) as c
from emp as e;
{code}
was (Author: JIRAUSER305909):
Digging deeper, I'n hopeful the following will be more clear:
* The decorrelator reviews each correlated variable in the Frame for the right
relation
* each of these correlates are then used to _unconditionally_ create a
_relBuilder.equals()_ condition on the new inputs for both the left and right
relations to be used as a JOIN condition
* each of these conditions are then _RexUtil.composeConjunction() -_ composed
as an AND
So, you effectively always end up with expressions of the form `AND(=($2, $4),
=($3, $5))`.
However, this is incorrect for at least the following 2 cases.
# The correlated subquery has a filter that includes a disjunction (i.e. OR)
between 2 expressions
# The correlated subquery has a filter that includes a conjunction (i.e. AND)
between 2 operands
Examples of failures to decorrelate properly during logical planning:
Example 1 => using = instead of AND:
{code:java}
select e.ename,
(select count(*)
from emp as f
where f.comm IS NULL AND e.comm IS NULL
) as c
from emp as e; {code}
produces the following logical plan:
{code:java}
LogicalProject(ENAME=[$1], C=[CASE(IS NULL($4), 0:BIGINT, $4)])
LogicalJoin(condition=[=($2, $3)], joinType=[left])
LogicalProject(EMPNO=[$0], ENAME=[$1], $f8=[IS NULL($6)])
LogicalTableScan(table=[[scott, EMP]])
LogicalAggregate(group=[{0}], EXPR$0=[COUNT()])
LogicalProject($f8=[$2])
LogicalJoin(condition=[true], joinType=[inner])
LogicalFilter(condition=[IS NULL($1)])
LogicalProject(EMPNO=[$0], COMM=[$6])
LogicalTableScan(table=[[scott, EMP]])
LogicalFilter(condition=[$0])
LogicalAggregate(group=[{0}])
LogicalProject($f8=[IS NULL($6)])
LogicalTableScan(table=[[scott, EMP]]) {code}
The top level join generated during decorrelation has `condition=[=($2, $3)]`
instead of `condition=[AND($2, $3)]`. Not only does the generated janino code
not compile, but if it did the caller will get incorrect results as the
condition should only be true for TRUE AND TRUE however the current generated
condition is true for both TRUE = TRUE as well as FALSE = FALSE.
Example 2 => using conjunction instead of disjunction:
{code:java}
select e.ename,
(select count(*) from emp as f
where f.comm = e.comm OR (e.sal > 500.00)) as c
from emp as e; {code}
produces the following logical plan:
{code:java}
LogicalProject(ENAME=[$1], C=[CASE(IS NULL($6), 0:BIGINT, $6)])
LogicalJoin(condition=[AND(=($2, $4), =($3, $5))], joinType=[left])
LogicalProject(EMPNO=[$0], ENAME=[$1], COMM=[$6], $f9=[>($5,
500.00:DECIMAL(5, 2))])
LogicalTableScan(table=[[scott, EMP]])
LogicalAggregate(group=[{0, 1}], EXPR$0=[COUNT()])
LogicalProject(COMM0=[$2], $f9=[$3])
LogicalJoin(condition=[OR(=($1, $2), $3)], joinType=[inner])
LogicalProject(EMPNO=[$0], COMM=[$6])
LogicalTableScan(table=[[scott, EMP]])
LogicalAggregate(group=[{0, 1}])
LogicalProject(COMM=[$6], $f9=[>($5, 500.00:DECIMAL(5, 2))])
LogicalTableScan(table=[[scott, EMP]]) {code}
The top level join generated during decorrelation has `condition=[AND(=($2,
$4), =($3, $5))]` instead of `condition=[OR(=($2, $4), =($3, $5))]` or
leveraging a more complex structure of joins. The caller will get incorrect
results as the condition should only be true for EITHER =($2, $4) _or_ =($3,
$5) however the currently generated condition is true only when both
expressions are true.
The original test case here triggers both of these issues as
{code:java}
select e.ename,
(select count(*) from emp as f
where f.comm is not distinct from e.comm) as c
from emp as e; {code}
is the same as
{code:java}
select e.ename,
(select count(*) from emp as f
where f.comm = e.comm OR -- <-- disjunction failure in decorrelation
f.comm IS NULL AND e.comm IS NULL -- <-- AND vs = failure in decorrelation
) as c
from emp as e;
{code}
> Scalar sub-query that uses IS NOT DISTINCT FROM returns incorrect result
> ------------------------------------------------------------------------
>
> Key: CALCITE-6452
> URL: https://issues.apache.org/jira/browse/CALCITE-6452
> Project: Calcite
> Issue Type: Improvement
> Reporter: Julian Hyde
> Assignee: Adam Markowitz
> Priority: Major
>
> Scalar sub-query that uses an {{IS NOT DISTINCT FROM}} condition returns
> incorrect result. For example,
> {code:java}
> select e.ename,
> (select count(*)
> from emp as f
> where f.comm is not distinct from e.comm) as c
> from emp as e{code}
> returns
> {noformat}
> +--------+----+
> | ENAME | C |
> +--------+----+
> | ADAMS | 0 |
> | ALLEN | 1 |
> | BLAKE | 0 |
> | CLARK | 0 |
> | FORD | 0 |
> | JAMES | 0 |
> | JONES | 0 |
> | KING | 0 |
> | MARTIN | 1 |
> | MILLER | 0 |
> | SCOTT | 0 |
> | SMITH | 0 |
> | TURNER | 1 |
> | WARD | 1 |
> +--------+----+
> (14 rows)
> {noformat}
> but should return
> {noformat}
> +--------+----+
> | ENAME | C |
> +--------+----+
> | ADAMS | 0 |
> | ALLEN | 10 |
> | BLAKE | 0 |
> | CLARK | 0 |
> | FORD | 0 |
> | JAMES | 0 |
> | JONES | 0 |
> | KING | 0 |
> | MARTIN | 10 |
> | MILLER | 0 |
> | SCOTT | 0 |
> | SMITH | 0 |
> | TURNER | 10 |
> | WARD | 10 |
> +--------+----+
> (14 rows)
> {noformat}
> Also, and perhaps related, if we add a {{WHERE}} to the above query, like
> this:
> {code:java}
> select e.ename,
> (select count(*)
> from emp as f
> where f.comm is not distinct from e.comm) as c
> from emp as e
> where e.deptno = 10{code}
> Calcite throws:
> {noformat}
> Caused by: java.lang.ArrayIndexOutOfBoundsException: Index 6 out of bounds
> for length 5
> > at
> > com.google.common.collect.RegularImmutableList.get(RegularImmutableList.java:77)
> > at org.apache.calcite.rex.RexBuilder.makeInputRef(RexBuilder.java:1037)
> > at
> > org.apache.calcite.sql2rel.RelDecorrelator$AdjustProjectForCountAggregateRule.lambda$onMatch2$4(RelDecorrelator.java:2679)
> {noformat}
> I ran into this error while trying to rewrite queries that had measures and
> used {{ROLLUP}}. (Since {{ROLLUP}} will generate group keys whose value is
> NULL, we cannot use regular {{=}} when doing a self-join.)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)