[jira] [Comment Edited] (CALCITE-7031) Implement the general decorrelation algorithm (Neumann & Kemper)

Silun Dong (Jira) Fri, 21 Nov 2025 19:34:05 -0800


    [ 
https://issues.apache.org/jira/browse/CALCITE-7031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18036100#comment-18036100
 ]


Silun Dong edited comment on CALCITE-7031 at 11/22/25 3:33 AM:
---------------------------------------------------------------

For points 2 and 3, I believe the existing framework, once extended and fixed, 
will support them. Regarding mark join and literal_agg(), they are two 
rewriting strategies with the same final effect. From my observation, in a few 
cases, such as NOT IN subquery, the plan produced by the existing framework 
appears relatively more complex at first glance.
{code:java}
select empno from emp where empno not in (select empno from emp_b where 
emp.ename = emp_b.ename) 

// existing
LogicalProject(EMPNO=[$0])
  LogicalFilter(condition=[OR(=($9, 0), IS NOT TRUE(OR(IS NOT NULL($13), <($10, 
$9))))])
    LogicalJoin(condition=[AND(=($0, $12), =($1, $14))], joinType=[left])
      LogicalJoin(condition=[=($1, $11)], joinType=[left])
        LogicalTableScan(table=[[CATALOG, SALES, EMP]])
        LogicalProject(c=[CASE(IS NOT NULL($2), $2, 0)], ck=[CASE(IS NOT 
NULL($2), $2, 0)], ENAME=[$0])
          LogicalJoin(condition=[IS NOT DISTINCT FROM($0, $1)], joinType=[left])
            LogicalAggregate(group=[{1}])
              LogicalTableScan(table=[[CATALOG, SALES, EMP]])
            LogicalAggregate(group=[{0}], c=[COUNT()])
              LogicalProject(ENAME=[$1])
                LogicalTableScan(table=[[CATALOG, SALES, EMP_B]])
      LogicalProject(EMPNO=[$0], i=[true], ENAME=[$1])
        LogicalTableScan(table=[[CATALOG, SALES, EMP_B]])

// new
LogicalProject(EMPNO=[$0])
  LogicalFilter(condition=[NOT($9)])
    LogicalJoin(condition=[AND(=($0, $9), IS NOT DISTINCT FROM($1, $10))], 
joinType=[mark])
      LogicalTableScan(table=[[CATALOG, SALES, EMP]])
      LogicalProject(EMPNO=[$0], ENAME=[$1])
        LogicalTableScan(table=[[CATALOG, SALES, EMP_B]]){code}
In most typical scenarios, there may be no absolute superiority between the 
two. It's more about the choice of rewriting strategies (using aggregate or 
mark join). I believe that existing framework are time-tested and excellent. 
One possible direction is to consider introducing the new algorithm so that 
users have more options under different scenarios.
Not all mark join can be simplified. The rule for simplifying mark join is 
applied after decorrelation.


was (Author: JIRAUSER308615):
For points 2 and 3, I believe the existing framework, once extended and fixed, 
will support them. Regarding mark join and literal_agg(), they are two 
rewriting strategies with the same final effect. From my observation, in a few 
cases, such as NOT IN subquery, the plan produced by the existing framework 
appears relatively more complex at first glance.
{code:java}
select empno from emp where empno not in (select empno from emp_b where 
emp.ename = emp_b.ename) 

// existing
LogicalProject(EMPNO=[$0])
  LogicalFilter(condition=[OR(=($9, 0), IS NOT TRUE(OR(IS NOT NULL($13), <($10, 
$9))))])
    LogicalJoin(condition=[AND(=($0, $12), =($1, $14))], joinType=[left])
      LogicalJoin(condition=[=($1, $11)], joinType=[left])
        LogicalTableScan(table=[[CATALOG, SALES, EMP]])
        LogicalProject(c=[CASE(IS NOT NULL($2), $2, 0)], ck=[CASE(IS NOT 
NULL($2), $2, 0)], ENAME=[$0])
          LogicalJoin(condition=[IS NOT DISTINCT FROM($0, $1)], joinType=[left])
            LogicalAggregate(group=[{1}])
              LogicalTableScan(table=[[CATALOG, SALES, EMP]])
            LogicalAggregate(group=[{0}], c=[COUNT()])
              LogicalProject(ENAME=[$1])
                LogicalTableScan(table=[[CATALOG, SALES, EMP_B]])
      LogicalProject(EMPNO=[$0], i=[true], ENAME=[$1])
        LogicalTableScan(table=[[CATALOG, SALES, EMP_B]])

// new
LogicalProject(EMPNO=[$0])
  LogicalFilter(condition=[NOT($9)])
    LogicalJoin(condition=[AND(=($0, $9), IS NOT DISTINCT FROM($1, $10))], 
joinType=[mark])
      LogicalTableScan(table=[[CATALOG, SALES, EMP]])
      LogicalProject(EMPNO=[$0], ENAME0=[$10])
        LogicalJoin(condition=[=($10, $1)], joinType=[inner])
          LogicalTableScan(table=[[CATALOG, SALES, EMP_B]])
          LogicalAggregate(group=[{0}])
            LogicalProject(ENAME=[$1])
              LogicalTableScan(table=[[CATALOG, SALES, EMP]]){code}
In most typical scenarios, there may be no absolute superiority between the 
two. It's more about the choice of rewriting strategies (using aggregate or 
mark join). I believe that existing framework are time-tested and excellent. 
One possible direction is to consider introducing the new algorithm so that 
users have more options under different scenarios.
Not all mark join can be simplified. The rule for simplifying mark join is 
applied after decorrelation.

> Implement the general decorrelation algorithm (Neumann & Kemper)
> ----------------------------------------------------------------
>
>                 Key: CALCITE-7031
>                 URL: https://issues.apache.org/jira/browse/CALCITE-7031
>             Project: Calcite
>          Issue Type: Wish
>          Components: core
>    Affects Versions: 1.39.0
>            Reporter: Mihai Budiu
>            Priority: Minor
>              Labels: pull-request-available
>
> Today Calcite uses a heuristic decorrelator which looks for some specific 
> patterns and replaces them with equivalent ones. This can handle a restricted 
> set of queries.
> There is a general algorithm for doing this, at least for the standard 
> operators: 
> https://github.com/lonng/db-papers/blob/main/papers/nested-query/unnesting-arbitrary-queries.pdf
> The algorithm is a series of rewrites rules, each of which pushes the 
> decorrelation operators down in the plan towards leaves, until they can be 
> eliminated. 
> It would be great if Calcite had an implementation of this algorithm. This 
> could be contributed in pieces, where each rewrite rule is a separate PR. I 
> think these rules could be easily implemented in the existing planner rewrite 
> framework.
> There are three design problems that I can foresee:
> 1) The plans generated by this algorithm are in general DAGs and not trees. I 
> think trees would work, but they have the potential to be much larger than 
> the corresponding DAGs (perhaps exponentially larger in the size of the 
> original plan). I understand that there is a Calcite operator for 
> representing DAGs; I don't know if that's the goal of the Spool operator - it 
> seems to imply some kind of materialization of the result. The most important 
> decision is how to represent DAG plans such that the rewriting framework 
> continues to operate correctly.
> 2) A second issue is that the rewrite rules in the paper use a restricted 
> form of relation D which has some nice properties (e.g., it is a set). I am 
> not sure how such information can be represented in a Calcite plan, but I 
> suspect this can be done.
> 3) Third, while the algorithm in the paper handles many SQL-like operators, 
> the Calcite IR is even richer, supporting operators like Window, Cubes, 
> Unnest, recursive queries, etc. I don't know how the algorithm would extend 
> to plans containing such operators. But even if it doesn't handle all such 
> operators, a general-purpose decorrelator would be a significant improvement 
> over the existing one.
> This project would also give us the chance to close many issues related to 
> the current decorrelator, some of which have been unsolved over many years.
> Please comment here if you are interested in this project. I think the most 
> important problem to address is no 1 above, the handling of DAGs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (CALCITE-7031) Implement the general decorrelation algorithm (Neumann & Kemper)

Reply via email to