[jira] [Comment Edited] (CALCITE-7340) The rules governing the use of CorrelationId values in plans are not fully specified

Silun Dong (Jira) Tue, 23 Dec 2025 02:33:15 -0800


    [ 
https://issues.apache.org/jira/browse/CALCITE-7340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18047227#comment-18047227
 ]


Silun Dong edited comment on CALCITE-7340 at 12/23/25 10:32 AM:
----------------------------------------------------------------

Thanks for pointing out CALCITE-5390. When a node contains multiple correlated 
subqueries, this can produce two Correlate with the same CorrelationId — I 
missed that. If I understand correctly, we have a consensus: there should not 
be multiple Correlate nodes that use the same CorrelationId. If this is what 
you and [~mbudiu] mean, I fully agree!

I've listed the core issues about CorrelationId below — please correct me if 
I've missed anything:
 # When a node has multiple correlated subqueries, the SubQueryRemoveRule might 
generate multiple Correlate nodes with the same CorrelationId, leading to 
incorrect decorrelation. For example:
{code:java}
select empno, 
  ( select count(empno) from emp where b.deptno = emp.deptno and b.ename = 
emp.ename ) cnt, 
  ( select sum(sal) from emp where b.deptno = emp.deptno and b.ename = 
emp.ename) sm 
from emp_b b


// initial plan
LogicalProject(variablesSet=[[$cor0]], EMPNO=[$0], CNT=[$SCALAR_QUERY({
LogicalAggregate(group=[{}], EXPR$0=[COUNT()])
  LogicalProject(EMPNO=[$0])
    LogicalFilter(condition=[AND(=($cor0.DEPTNO, $7), =($cor0.ENAME, $1))])
      LogicalTableScan(table=[[CATALOG, SALES, EMP]])
})], SM=[$SCALAR_QUERY({
LogicalAggregate(group=[{}], EXPR$0=[SUM($0)])
  LogicalProject(SAL=[$5])
    LogicalFilter(condition=[AND(=($cor0.DEPTNO, $7), =($cor0.ENAME, $1))])
      LogicalTableScan(table=[[CATALOG, SALES, EMP]])
})])
  LogicalTableScan(table=[[CATALOG, SALES, EMP_B]])


// after removing subqueries
LogicalProject(EMPNO=[$0], CNT=[$10], SM=[$11])
  LogicalCorrelate(correlation=[$cor0], joinType=[left], requiredColumns=[{1, 
7}])
    LogicalCorrelate(correlation=[$cor0], joinType=[left], requiredColumns=[{1, 
7}])
      LogicalTableScan(table=[[CATALOG, SALES, EMP_B]])
      LogicalAggregate(group=[{}], EXPR$0=[COUNT()])
        LogicalProject(EMPNO=[$0])
          LogicalFilter(condition=[AND(=($cor0.DEPTNO, $7), =($cor0.ENAME, 
$1))])
            LogicalTableScan(table=[[CATALOG, SALES, EMP]])
    LogicalAggregate(group=[{}], EXPR$0=[SUM($0)])
      LogicalProject(SAL=[$5])
        LogicalFilter(condition=[AND(=($cor0.DEPTNO, $7), =($cor0.ENAME, $1))])
          LogicalTableScan(table=[[CATALOG, SALES, EMP]]){code}

 # When a correlated subquery appears in a join ON condition, we don't know 
where the free variables in the subquery come from. Unlike Project/Filter, we 
cannot use the variablesSet to determine whether the free variables in the 
subquery belong to the current scope.

For the first issue, our goal after removing subqueries would be to produce a 
plan like:

 
{code:java}
LogicalProject(EMPNO=[$0], CNT=[$10], SM=[$11])
  LogicalCorrelate(correlation=[$cor1], joinType=[left], requiredColumns=[{1, 
7}])
    LogicalCorrelate(correlation=[$cor0], joinType=[left], requiredColumns=[{1, 
7}])
      LogicalTableScan(table=[[CATALOG, SALES, EMP_B]])
      LogicalAggregate(group=[{}], EXPR$0=[COUNT()])
        LogicalProject(EMPNO=[$0])
          LogicalFilter(condition=[AND(=($cor0.DEPTNO, $7), =($cor0.ENAME, 
$1))])
            LogicalTableScan(table=[[CATALOG, SALES, EMP]])
    LogicalAggregate(group=[{}], EXPR$0=[SUM($0)])
      LogicalProject(SAL=[$5])
        LogicalFilter(condition=[AND(=($cor1.DEPTNO, $7), =($cor1.ENAME, $1))])
          LogicalTableScan(table=[[CATALOG, SALES, EMP]]){code}
As you mentioned, we can use a visitor to normalize the plan from the root. 
Alternatively, could we consider making the change inside SubQueryRemoveRule? 
In the initial plan, the CorrelationIds in the variablesSet of Project/Filter 
are unique; it is the SubQueryRemoveRule that ends up reusing those 
CorrelationIds when generating Correlate nodes. 

 

For the second issue, a straightforward idea is to add a field to LogicalJoin 
with the same type ({{{}ImmutableSet<CorrelationId>{}}}) as the variablesSet. 
This field would represent the variable belongs to the current scope (i.e., the 
concatenation of the lhs and rhs rows) referenced by the correlated subqueries 
in the ON condition. This way, SubQueryRemoveRule can use this field—just as it 
does for Project/Filter—to determine whether the free variables in the 
subqueries come from the current scope.


was (Author: JIRAUSER308615):
Thanks for pointing out CALCITE-5390. When a node contains multiple correlated 
subqueries, this can produce two Correlate with the same CorrelationId — I 
missed that. If I understand correctly, we have a consensus: there should not 
be multiple Correlate nodes that use the same CorrelationId. If this is what 
you and [~mbudiu] mean, I fully agree!

I've listed the core issues about CorrelationId below — please correct me if 
I've missed anything:
 # When a node has multiple correlated subqueries, the SubQueryRemoveRule might 
generate multiple Correlate nodes with the same CorrelationId, leading to 
incorrect decorrelation. For example:
{code:java}
select empno, 
  ( select count(empno) from emp where b.deptno = emp.deptno and b.ename = 
emp.ename ) cnt, 
  ( select sum(sal) from emp where b.deptno = emp.deptno and b.ename = 
emp.ename) sm 
from emp_b b


// initial plan
LogicalProject(variablesSet=[[$cor0]], EMPNO=[$0], CNT=[$SCALAR_QUERY({
LogicalAggregate(group=[{}], EXPR$0=[COUNT()])
  LogicalProject(EMPNO=[$0])
    LogicalFilter(condition=[AND(=($cor0.DEPTNO, $7), =($cor0.ENAME, $1))])
      LogicalTableScan(table=[[CATALOG, SALES, EMP]])
})], SM=[$SCALAR_QUERY({
LogicalAggregate(group=[{}], EXPR$0=[SUM($0)])
  LogicalProject(SAL=[$5])
    LogicalFilter(condition=[AND(=($cor0.DEPTNO, $7), =($cor0.ENAME, $1))])
      LogicalTableScan(table=[[CATALOG, SALES, EMP]])
})])
  LogicalTableScan(table=[[CATALOG, SALES, EMP_B]])


// after removing subqueries
LogicalProject(EMPNO=[$0], CNT=[$10], SM=[$11])
  LogicalCorrelate(correlation=[$cor0], joinType=[left], requiredColumns=[{1, 
7}])
    LogicalCorrelate(correlation=[$cor0], joinType=[left], requiredColumns=[{1, 
7}])
      LogicalTableScan(table=[[CATALOG, SALES, EMP_B]])
      LogicalAggregate(group=[{}], EXPR$0=[COUNT()])
        LogicalProject(EMPNO=[$0])
          LogicalFilter(condition=[AND(=($cor0.DEPTNO, $7), =($cor0.ENAME, 
$1))])
            LogicalTableScan(table=[[CATALOG, SALES, EMP]])
    LogicalAggregate(group=[{}], EXPR$0=[SUM($0)])
      LogicalProject(SAL=[$5])
        LogicalFilter(condition=[AND(=($cor0.DEPTNO, $7), =($cor0.ENAME, $1))])
          LogicalTableScan(table=[[CATALOG, SALES, EMP]]){code}

 # When a correlated subquery appears in a join ON condition, we don't know 
where the free variables in the subquery come from. Unlike Project/Filter, we 
cannot use the variablesSet to determine whether the free variables in the 
subquery belong to the current scope.

For the first issue, our goal after removing subqueries would be to produce a 
plan like:

 
{code:java}
LogicalProject(EMPNO=[$0], CNT=[$10], SM=[$11])
  LogicalCorrelate(correlation=[$cor1], joinType=[left], requiredColumns=[{1, 
7}])
    LogicalCorrelate(correlation=[$cor0], joinType=[left], requiredColumns=[{1, 
7}])
      LogicalTableScan(table=[[CATALOG, SALES, EMP_B]])
      LogicalAggregate(group=[{}], EXPR$0=[COUNT()])
        LogicalProject(EMPNO=[$0])
          LogicalFilter(condition=[AND(=($cor0.DEPTNO, $7), =($cor0.ENAME, 
$1))])
            LogicalTableScan(table=[[CATALOG, SALES, EMP]])
    LogicalAggregate(group=[{}], EXPR$0=[SUM($0)])
      LogicalProject(SAL=[$5])
        LogicalFilter(condition=[AND(=($cor1.DEPTNO, $7), =($cor1.ENAME, $1))])
          LogicalTableScan(table=[[CATALOG, SALES, EMP]]){code}
As you mentioned, we can use a visitor to normalize the plan from the root. 
Alternatively, could we consider making the change inside SubQueryRemoveRule? 
In the initial plan, the CorrelationIds in the variablesSet of Project/Filter 
are unique; it is the SubQueryRemoveRule that ends up reusing those 
CorrelationIds when generating Correlate nodes. 

 

For the second issue, a straightforward idea is to add a field to LogicalJoin 
with the same type ({{{}ImmutableSet<CorrelationId>{}}}) as the variablesSet. 
This field would represent the variable belongs to the current scope (i.e., the 
concatenation of the lhs and rhs rows) referenced by the correlated subqueries 
in the ON condition. This way, SubQueryRemoveRule can use this field—just as it 
does for Project/Filter—to determine whether the free variables in the 
subqueries come from the current scope.

> The rules governing the use of CorrelationId values in plans are not fully 
> specified
> ------------------------------------------------------------------------------------
>
>                 Key: CALCITE-7340
>                 URL: https://issues.apache.org/jira/browse/CALCITE-7340
>             Project: Calcite
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 1.41.0
>            Reporter: Mihai Budiu
>            Priority: Minor
>
> This issue is really about the Calcite internal representation of Rel nodes.
> There have been several recent discussions about manipulating plans that 
> contain CorrelationId values, and the conclusion seems to be that the rules 
> governing the use of such variables is not clear.
> Ideally these rules should be spelled out in a specification, and there 
> should be a tool to enforce them by validating plans. The JavaDoc for this 
> tool may be the right place to write the specification. I don't expect that 
> the specification will be long or complicated.
> RelBuilder may not be the right place to enforce such rules, because it 
> usually does not have visibility over the entire plan, and some of these 
> rules have to apply globally over entire plans. 
> See CALCITE-5784, CALCITE-7045 and the discussion in github over CALCITE-7336 
> for examples.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (CALCITE-7340) The rules governing the use of CorrelationId values in plans are not fully specified

Reply via email to