[
https://issues.apache.org/jira/browse/CALCITE-7340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18047227#comment-18047227
]
Silun Dong edited comment on CALCITE-7340 at 12/23/25 10:32 AM:
----------------------------------------------------------------
Thanks for pointing out CALCITE-5390. When a node contains multiple correlated
subqueries, this can produce two Correlate with the same CorrelationId — I
missed that. If I understand correctly, we have a consensus: there should not
be multiple Correlate nodes that use the same CorrelationId. If this is what
you and [~mbudiu] mean, I fully agree!
I've listed the core issues about CorrelationId below — please correct me if
I've missed anything:
# When a node has multiple correlated subqueries, the SubQueryRemoveRule might
generate multiple Correlate nodes with the same CorrelationId, leading to
incorrect decorrelation. For example:
{code:java}
select empno,
( select count(empno) from emp where b.deptno = emp.deptno and b.ename =
emp.ename ) cnt,
( select sum(sal) from emp where b.deptno = emp.deptno and b.ename =
emp.ename) sm
from emp_b b
// initial plan
LogicalProject(variablesSet=[[$cor0]], EMPNO=[$0], CNT=[$SCALAR_QUERY({
LogicalAggregate(group=[{}], EXPR$0=[COUNT()])
LogicalProject(EMPNO=[$0])
LogicalFilter(condition=[AND(=($cor0.DEPTNO, $7), =($cor0.ENAME, $1))])
LogicalTableScan(table=[[CATALOG, SALES, EMP]])
})], SM=[$SCALAR_QUERY({
LogicalAggregate(group=[{}], EXPR$0=[SUM($0)])
LogicalProject(SAL=[$5])
LogicalFilter(condition=[AND(=($cor0.DEPTNO, $7), =($cor0.ENAME, $1))])
LogicalTableScan(table=[[CATALOG, SALES, EMP]])
})])
LogicalTableScan(table=[[CATALOG, SALES, EMP_B]])
// after removing subqueries
LogicalProject(EMPNO=[$0], CNT=[$10], SM=[$11])
LogicalCorrelate(correlation=[$cor0], joinType=[left], requiredColumns=[{1,
7}])
LogicalCorrelate(correlation=[$cor0], joinType=[left], requiredColumns=[{1,
7}])
LogicalTableScan(table=[[CATALOG, SALES, EMP_B]])
LogicalAggregate(group=[{}], EXPR$0=[COUNT()])
LogicalProject(EMPNO=[$0])
LogicalFilter(condition=[AND(=($cor0.DEPTNO, $7), =($cor0.ENAME,
$1))])
LogicalTableScan(table=[[CATALOG, SALES, EMP]])
LogicalAggregate(group=[{}], EXPR$0=[SUM($0)])
LogicalProject(SAL=[$5])
LogicalFilter(condition=[AND(=($cor0.DEPTNO, $7), =($cor0.ENAME, $1))])
LogicalTableScan(table=[[CATALOG, SALES, EMP]]){code}
# When a correlated subquery appears in a join ON condition, we don't know
where the free variables in the subquery come from. Unlike Project/Filter, we
cannot use the variablesSet to determine whether the free variables in the
subquery belong to the current scope.
For the first issue, our goal after removing subqueries would be to produce a
plan like:
{code:java}
LogicalProject(EMPNO=[$0], CNT=[$10], SM=[$11])
LogicalCorrelate(correlation=[$cor1], joinType=[left], requiredColumns=[{1,
7}])
LogicalCorrelate(correlation=[$cor0], joinType=[left], requiredColumns=[{1,
7}])
LogicalTableScan(table=[[CATALOG, SALES, EMP_B]])
LogicalAggregate(group=[{}], EXPR$0=[COUNT()])
LogicalProject(EMPNO=[$0])
LogicalFilter(condition=[AND(=($cor0.DEPTNO, $7), =($cor0.ENAME,
$1))])
LogicalTableScan(table=[[CATALOG, SALES, EMP]])
LogicalAggregate(group=[{}], EXPR$0=[SUM($0)])
LogicalProject(SAL=[$5])
LogicalFilter(condition=[AND(=($cor1.DEPTNO, $7), =($cor1.ENAME, $1))])
LogicalTableScan(table=[[CATALOG, SALES, EMP]]){code}
As you mentioned, we can use a visitor to normalize the plan from the root.
Alternatively, could we consider making the change inside SubQueryRemoveRule?
In the initial plan, the CorrelationIds in the variablesSet of Project/Filter
are unique; it is the SubQueryRemoveRule that ends up reusing those
CorrelationIds when generating Correlate nodes.
For the second issue, a straightforward idea is to add a field to LogicalJoin
with the same type ({{{}ImmutableSet<CorrelationId>{}}}) as the variablesSet.
This field would represent the variable belongs to the current scope (i.e., the
concatenation of the lhs and rhs rows) referenced by the correlated subqueries
in the ON condition. This way, SubQueryRemoveRule can use this field—just as it
does for Project/Filter—to determine whether the free variables in the
subqueries come from the current scope.
was (Author: JIRAUSER308615):
Thanks for pointing out CALCITE-5390. When a node contains multiple correlated
subqueries, this can produce two Correlate with the same CorrelationId — I
missed that. If I understand correctly, we have a consensus: there should not
be multiple Correlate nodes that use the same CorrelationId. If this is what
you and [~mbudiu] mean, I fully agree!
I've listed the core issues about CorrelationId below — please correct me if
I've missed anything:
# When a node has multiple correlated subqueries, the SubQueryRemoveRule might
generate multiple Correlate nodes with the same CorrelationId, leading to
incorrect decorrelation. For example:
{code:java}
select empno,
( select count(empno) from emp where b.deptno = emp.deptno and b.ename =
emp.ename ) cnt,
( select sum(sal) from emp where b.deptno = emp.deptno and b.ename =
emp.ename) sm
from emp_b b
// initial plan
LogicalProject(variablesSet=[[$cor0]], EMPNO=[$0], CNT=[$SCALAR_QUERY({
LogicalAggregate(group=[{}], EXPR$0=[COUNT()])
LogicalProject(EMPNO=[$0])
LogicalFilter(condition=[AND(=($cor0.DEPTNO, $7), =($cor0.ENAME, $1))])
LogicalTableScan(table=[[CATALOG, SALES, EMP]])
})], SM=[$SCALAR_QUERY({
LogicalAggregate(group=[{}], EXPR$0=[SUM($0)])
LogicalProject(SAL=[$5])
LogicalFilter(condition=[AND(=($cor0.DEPTNO, $7), =($cor0.ENAME, $1))])
LogicalTableScan(table=[[CATALOG, SALES, EMP]])
})])
LogicalTableScan(table=[[CATALOG, SALES, EMP_B]])
// after removing subqueries
LogicalProject(EMPNO=[$0], CNT=[$10], SM=[$11])
LogicalCorrelate(correlation=[$cor0], joinType=[left], requiredColumns=[{1,
7}])
LogicalCorrelate(correlation=[$cor0], joinType=[left], requiredColumns=[{1,
7}])
LogicalTableScan(table=[[CATALOG, SALES, EMP_B]])
LogicalAggregate(group=[{}], EXPR$0=[COUNT()])
LogicalProject(EMPNO=[$0])
LogicalFilter(condition=[AND(=($cor0.DEPTNO, $7), =($cor0.ENAME,
$1))])
LogicalTableScan(table=[[CATALOG, SALES, EMP]])
LogicalAggregate(group=[{}], EXPR$0=[SUM($0)])
LogicalProject(SAL=[$5])
LogicalFilter(condition=[AND(=($cor0.DEPTNO, $7), =($cor0.ENAME, $1))])
LogicalTableScan(table=[[CATALOG, SALES, EMP]]){code}
# When a correlated subquery appears in a join ON condition, we don't know
where the free variables in the subquery come from. Unlike Project/Filter, we
cannot use the variablesSet to determine whether the free variables in the
subquery belong to the current scope.
For the first issue, our goal after removing subqueries would be to produce a
plan like:
{code:java}
LogicalProject(EMPNO=[$0], CNT=[$10], SM=[$11])
LogicalCorrelate(correlation=[$cor1], joinType=[left], requiredColumns=[{1,
7}])
LogicalCorrelate(correlation=[$cor0], joinType=[left], requiredColumns=[{1,
7}])
LogicalTableScan(table=[[CATALOG, SALES, EMP_B]])
LogicalAggregate(group=[{}], EXPR$0=[COUNT()])
LogicalProject(EMPNO=[$0])
LogicalFilter(condition=[AND(=($cor0.DEPTNO, $7), =($cor0.ENAME,
$1))])
LogicalTableScan(table=[[CATALOG, SALES, EMP]])
LogicalAggregate(group=[{}], EXPR$0=[SUM($0)])
LogicalProject(SAL=[$5])
LogicalFilter(condition=[AND(=($cor1.DEPTNO, $7), =($cor1.ENAME, $1))])
LogicalTableScan(table=[[CATALOG, SALES, EMP]]){code}
As you mentioned, we can use a visitor to normalize the plan from the root.
Alternatively, could we consider making the change inside SubQueryRemoveRule?
In the initial plan, the CorrelationIds in the variablesSet of Project/Filter
are unique; it is the SubQueryRemoveRule that ends up reusing those
CorrelationIds when generating Correlate nodes.
For the second issue, a straightforward idea is to add a field to LogicalJoin
with the same type ({{{}ImmutableSet<CorrelationId>{}}}) as the variablesSet.
This field would represent the variable belongs to the current scope (i.e., the
concatenation of the lhs and rhs rows) referenced by the correlated subqueries
in the ON condition. This way, SubQueryRemoveRule can use this field—just as it
does for Project/Filter—to determine whether the free variables in the
subqueries come from the current scope.
> The rules governing the use of CorrelationId values in plans are not fully
> specified
> ------------------------------------------------------------------------------------
>
> Key: CALCITE-7340
> URL: https://issues.apache.org/jira/browse/CALCITE-7340
> Project: Calcite
> Issue Type: Bug
> Components: core
> Affects Versions: 1.41.0
> Reporter: Mihai Budiu
> Priority: Minor
>
> This issue is really about the Calcite internal representation of Rel nodes.
> There have been several recent discussions about manipulating plans that
> contain CorrelationId values, and the conclusion seems to be that the rules
> governing the use of such variables is not clear.
> Ideally these rules should be spelled out in a specification, and there
> should be a tool to enforce them by validating plans. The JavaDoc for this
> tool may be the right place to write the specification. I don't expect that
> the specification will be long or complicated.
> RelBuilder may not be the right place to enforce such rules, because it
> usually does not have visibility over the entire plan, and some of these
> rules have to apply globally over entire plans.
> See CALCITE-5784, CALCITE-7045 and the discussion in github over CALCITE-7336
> for examples.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)