[ 
https://issues.apache.org/jira/browse/CALCITE-7340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18047183#comment-18047183
 ] 

Silun Dong edited comment on CALCITE-7340 at 12/23/25 4:18 AM:
---------------------------------------------------------------

I'm not sure if I fully understand your solution, here's my personal 
understanding:

It seems to me that CorrelationId and scope are closely related. The 
variablesSet property of Project and Filter usually contains either 0 or 1 
element (I haven’t encountered other cases so far), if there is one 
CorrelationId, this id represents a variable coming from the input that is 
referenced by expressions (for my understanding of the Join, please refer to 
the [discussion in 
github|https://github.com/apache/calcite/pull/4691#discussion_r2635417739] over 
CALCITE-7336). In other words, that scope corresponds to a unique 
CorrelationId. This idea is also confirmed by the logic of SqlToRelConverter: 
when generating Project/Filter, it collects the CorrelationIds used by the 
node’s expressions. If an id belongs to the current scope, it will be added to 
the variablesSet property; otherwise, the variablesSet is empty, meaning the 
free variable belongs to an outer scope.

Regarding the use of CorrelationId, my main focus is on subquery removal and 
decorrelation. Taking Project and Filter as examples: when removing subqueries, 
we collect the CorrelationIds used in the node’s expressions and check whether 
they intersect with the variablesSet. If there is an intersection, the 
expression is correlated with the current scope and a Correlate is produced; 
otherwise, the expression is not correlated with the current scope and a Join 
is produced (the Correlate has already been generated in an outer scope). The 
decorrelation algorithm (at least the new one) is also based on this and can 
correctly handle complex nested correlations.

In my opinion, during the plan-rewrite phase, it's good for each scope to 
correspond to a unique CorrelationId (in the physical implementation phase, 
perhaps refer to the EnumerableBatchNestedLoopJoinRule, which will not be 
discussed here), maybe the only remaining issue is the handling of Join (refer 
to the [discussion in 
github|https://github.com/apache/calcite/pull/4691#discussion_r2635417739] over 
CALCITE-7336). If this concept is changed, subquery removal and decorrelation 
would likely be heavily affected.

Perhaps others have better insights; this is just for reference.


was (Author: JIRAUSER308615):
I'm not sure if I fully understand your solution, here's my personal 
understanding:

It seems to me that CorrelationId and scope are closely related. The 
variablesSet property of Project and Filter usually contains either 0 or 1 
element (I haven’t encountered other cases so far), if there is one 
CorrelationId, this id represents a variable coming from the input that is 
referenced by expressions (for my understanding of the Join, please refer to 
the [discussion in 
github|https://github.com/apache/calcite/pull/4691#discussion_r2635417739] over 
CALCITE-7336). In other words, that scope corresponds to a unique 
CorrelationId. This idea is also confirmed by the logic of SqlToRelConverter: 
when generating Project/Filter, it collects the CorrelationIds used by the 
node’s expressions. If an id belongs to the current scope, it will be added to 
the variablesSet property; otherwise, the variablesSet is empty, meaning the 
free variable belongs to an outer scope.

Regarding the use of CorrelationId, my main focus is on subquery removal and 
decorrelation. Taking Project and Filter as examples: when removing subqueries, 
we collect the CorrelationIds used in the node’s expressions and check whether 
they intersect with the variablesSet. If there is an intersection, the 
expression is correlated with the current scope and a Correlate is produced; 
otherwise, the expression is not correlated with the current scope and a Join 
is produced (the Correlate has already been generated in an outer scope). The 
decorrelation algorithm (at least the new one) is also based on this and can 
correctly handle complex nested correlations.

In my opinion, during the plan-rewrite phase, it's good for each scope to 
correspond to a unique CorrelationId (in the physical implementation phase, 
perhaps refer to the EnumerableBatchNestedLoopJoinRule, which will not be 
discussed here). If this concept is changed, subquery removal and decorrelation 
would likely be heavily affected.

Perhaps others have better insights; this is just for reference.

> The rules governing the use of CorrelationId values in plans are not fully 
> specified
> ------------------------------------------------------------------------------------
>
>                 Key: CALCITE-7340
>                 URL: https://issues.apache.org/jira/browse/CALCITE-7340
>             Project: Calcite
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 1.41.0
>            Reporter: Mihai Budiu
>            Priority: Minor
>
> This issue is really about the Calcite internal representation of Rel nodes.
> There have been several recent discussions about manipulating plans that 
> contain CorrelationId values, and the conclusion seems to be that the rules 
> governing the use of such variables is not clear.
> Ideally these rules should be spelled out in a specification, and there 
> should be a tool to enforce them by validating plans. The JavaDoc for this 
> tool may be the right place to write the specification. I don't expect that 
> the specification will be long or complicated.
> RelBuilder may not be the right place to enforce such rules, because it 
> usually does not have visibility over the entire plan, and some of these 
> rules have to apply globally over entire plans. 
> See CALCITE-5784, CALCITE-7045 and the discussion in github over CALCITE-7336 
> for examples.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to