[jira] [Comment Edited] (CALCITE-5213) PROJECT_TO_SUBQUERY producing Incorrect tree from nested correlated subqueries in projections with correlations in filters.

Ian Bertolacci (Jira) Tue, 20 Aug 2024 09:18:29 -0700


    [ 
https://issues.apache.org/jira/browse/CALCITE-5213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17875234#comment-17875234
 ]


Ian Bertolacci edited comment on CALCITE-5213 at 8/20/24 4:17 PM:
------------------------------------------------------------------

Here is some replication code in scala (all or our unit tests are in Scala, so 
its what I have on hand)
{code:scala}
// Constructing the tree
val rb = /* construct a relBuilder in whatever way you want */
rb
  .scan("T3")
  .variable{ middleVar: RexCorrelVariable =>
    rb.project(
      RexSubQuery.scalar(
        rb.scan("T2")
          .filter(rb.call(COps.EQUALS, rb.field("ID"), rb.dot(middleVar, 
"C302")))
          .variable{ innerVar: RexCorrelVariable =>
            rb.project(
              rb.call(
                COps.PLUS,
                rb.field("C202"),
                RexSubQuery.scalar(
                  rb.scan("T1")
                    .filter(rb.call(COps.EQUALS, rb.field("ID"), 
rb.dot(innerVar, "C201")))
                    .aggregate(rb.groupKey(), rb.aggregateCall(COps.SUM, 
rb.field("C101")))
                    .build()
                )
              )
            ).aggregate(rb.groupKey(), rb.aggregateCall(COps.SUM, rb.field(0)))
          }.build()
      )
    )
  }

val originalNode = rb.build()
println(originalNode.explain())

// The planner definition
val hepPlanner: HepPlanner = {
  import org.apache.calcite.rel.rules.CoreRules
  import scala.collection.JavaConverters._
  val program = 
HepProgram.builder().addRuleCollection(List[RelOptRule](CoreRules.PROJECT_SUB_QUERY_TO_CORRELATE).asJava).build()
  val planner = new HepPlanner(program, rb.getCluster.getPlanner.getContext)
  planner
}

// Invoke the planner
hepPlanner.setRoot(originalNode)
val newNode = hepPlanner.findBestExp()

println(newNode.explaion())
{code}

Explain from original
{code:}
8:LogicalProject($f0=[$SCALAR_QUERY(
│    7:LogicalAggregate(group=[{}], agg#0=[SUM($0)])
│    └── 6:LogicalProject($f0=[+($2, $SCALAR_QUERY(
│        │    5:LogicalAggregate(group=[{}], agg#0=[SUM($1)])
│        │    └── 4:LogicalFilter(condition=[=($0, $cor1.C201)])
│        │        └── 3:QueryTableScan(table=[T1], Schema=[ID:Dimension, 
C101:Decimal(0, 38)])
│        │))])
│        └── 2:LogicalFilter(condition=[=($0, $cor0.C302)])
│            └── 1:QueryTableScan(table=[T2], Schema=[ID:Dimension, 
C201:Dimension, C202:Decimal(0, 38), C203:Decimal(0, 38), C204:Decimal(2, 38)])
│)])
└── 0:QueryTableScan(table=[T3], Schema=[ID:Dimension, C301:Dimension, 
C302:Dimension])
{code}

Explain after planner:
{code}
23:LogicalProject($f0=[$3])
└── 21:LogicalJoin(condition=[true], joinType=[left], variablesSet=[[$cor1, 
$cor0]])
    ├── 0:QueryTableScan(table=[T3], Schema=[ID:Dimension, C301:Dimension, 
C302:Dimension])
    └── 19:LogicalAggregate(group=[{}], agg#0=[SUM($0)])
        └── 34:LogicalProject($f0=[+($2, $5)])
            └── 32:LogicalCorrelate(correlation=[$cor1], joinType=[left], 
requiredColumns=[{1}])
                ├── 15:LogicalFilter(condition=[=($0, $cor0.C302)])
                │   └── 1:QueryTableScan(table=[T2], Schema=[ID:Dimension, 
C201:Dimension, C202:Decimal(0, 38), C203:Decimal(0, 38), C204:Decimal(2, 38)])
                └── 30:LogicalAggregate(group=[{}], agg#0=[SUM($1)])
                    └── 28:LogicalFilter(condition=[=($0, $cor1.C201)])
                        └── 3:QueryTableScan(table=[T1], Schema=[ID:Dimension, 
C101:Decimal(0, 38)])
{code}


was (Author: ian.bertolacci):
Here is some replication code in scala (all or our unit tests are in Scala, so 
its what I have on hand)
{code:scala}
val rb = /* construct a relBuilder in whatever way you want */
rb
  .scan("T3")
  .variable{ middleVar: RexCorrelVariable =>
    rb.project(
      RexSubQuery.scalar(
        rb.scan("T2")
          .filter(rb.call(COps.EQUALS, rb.field("ID"), rb.dot(middleVar, 
"C302")))
          .variable{ innerVar: RexCorrelVariable =>
            rb.project(
              rb.call(
                COps.PLUS,
                rb.field("C202"),
                RexSubQuery.scalar(
                  rb.scan("T1")
                    .filter(rb.call(COps.EQUALS, rb.field("ID"), 
rb.dot(innerVar, "C201")))
                    .aggregate(rb.groupKey(), rb.aggregateCall(COps.SUM, 
rb.field("C101")))
                    .build()
                )
              )
            ).aggregate(rb.groupKey(), rb.aggregateCall(COps.SUM, rb.field(0)))
          }.build()
      )
    )
  }

val originalNode = rb.build()
println(originalNode.explain())

val hepPlanner: HepPlanner = {
  import org.apache.calcite.rel.rules.CoreRules
  import scala.collection.JavaConverters._
  val program = 
HepProgram.builder().addRuleCollection(List[RelOptRule](CoreRules.PROJECT_SUB_QUERY_TO_CORRELATE).asJava).build()
  val planner = new HepPlanner(program, rb.getCluster.getPlanner.getContext)
  planner
}

hepPlanner.setRoot(originalNode)
val newNode = hepPlanner.findBestExp()

println(newNode.explaion())
{code}

Explain from original
{code:}
8:LogicalProject($f0=[$SCALAR_QUERY(
│    7:LogicalAggregate(group=[{}], agg#0=[SUM($0)])
│    └── 6:LogicalProject($f0=[+($2, $SCALAR_QUERY(
│        │    5:LogicalAggregate(group=[{}], agg#0=[SUM($1)])
│        │    └── 4:LogicalFilter(condition=[=($0, $cor1.C201)])
│        │        └── 3:QueryTableScan(table=[T1], Schema=[ID:Dimension, 
C101:Decimal(0, 38)])
│        │))])
│        └── 2:LogicalFilter(condition=[=($0, $cor0.C302)])
│            └── 1:QueryTableScan(table=[T2], Schema=[ID:Dimension, 
C201:Dimension, C202:Decimal(0, 38), C203:Decimal(0, 38), C204:Decimal(2, 38)])
│)])
└── 0:QueryTableScan(table=[T3], Schema=[ID:Dimension, C301:Dimension, 
C302:Dimension])
{code}

Explain after planner:
{code}
23:LogicalProject($f0=[$3])
└── 21:LogicalJoin(condition=[true], joinType=[left], variablesSet=[[$cor1, 
$cor0]])
    ├── 0:QueryTableScan(table=[T3], Schema=[ID:Dimension, C301:Dimension, 
C302:Dimension])
    └── 19:LogicalAggregate(group=[{}], agg#0=[SUM($0)])
        └── 34:LogicalProject($f0=[+($2, $5)])
            └── 32:LogicalCorrelate(correlation=[$cor1], joinType=[left], 
requiredColumns=[{1}])
                ├── 15:LogicalFilter(condition=[=($0, $cor0.C302)])
                │   └── 1:QueryTableScan(table=[T2], Schema=[ID:Dimension, 
C201:Dimension, C202:Decimal(0, 38), C203:Decimal(0, 38), C204:Decimal(2, 38)])
                └── 30:LogicalAggregate(group=[{}], agg#0=[SUM($1)])
                    └── 28:LogicalFilter(condition=[=($0, $cor1.C201)])
                        └── 3:QueryTableScan(table=[T1], Schema=[ID:Dimension, 
C101:Decimal(0, 38)])
{code}

> PROJECT_TO_SUBQUERY producing Incorrect tree from nested correlated 
> subqueries in projections with correlations in filters.
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CALCITE-5213
>                 URL: https://issues.apache.org/jira/browse/CALCITE-5213
>             Project: Calcite
>          Issue Type: Bug
>    Affects Versions: 1.30.0, 1.36.0
>            Reporter: Ian Bertolacci
>            Priority: Major
>
> CoreRules.PROJECT_SUB_QUERY_TO_CORRELATE produces (what I believe to be) 
> incorrect trees from nested correlated subqueries in projections.
> I'm hoping that I'm just doing something wrong and maybe someone will point 
> it out.
> For example:
> {code:sql}
> SELECT (SELECT Sum(C202
>                    + (SELECT Sum(C101)
>                       FROM   T1
>                       WHERE  T1.ID = T2.C201))
>         FROM   T2
>         WHERE  T2.ID = T3.C302)
> FROM   T3  {code}
> The initial RelNode tree produced from this SQL is:
> {code}
> 232:LogicalProject(EXPR$0=[
> |  $SCALAR_QUERY({
> |    LogicalAggregate(group=[{}], EXPR$0=[SUM($0)])
> |    └──LogicalProject($f0=[+($2, 
> |       |  $SCALAR_QUERY({
> |       |    LogicalAggregate(group=[{}], EXPR$0=[SUM($0)])
> |       |    └──LogicalProject(C101=[$1])
> |       |       └──LogicalFilter(condition=[=($0, $cor1.C201)])
> |       |          └──TableScan(table=[[QUERY, T1]], fields=[[ID, C101]])
> |       |  }))])
> |       └──LogicalFilter(condition=[=($0, $cor0.C302)])
> |          └──TableScan(table=[[QUERY, T2]], fields=[[ID, C201, C202, C203, 
> C204]])
> |  })])
> └──223:TableScan(table=[[QUERY, T3]], fields=[[ID, C301, C302]])
> {code}
> This looks ok so far, but it is important to notice the lack of variableSets 
> in the projection nodes (which would appear in the filter nodes having 
> correlated subqueries in their conditions).
> After applying the CoreRules.PROJECT_SUB_QUERY_TO_CORRELATE rule via a HEP 
> program the resulting tree is:
> {code}
> 270:LogicalProject(EXPR$0=[$3])
> |   // Unexpected Join instead of correlate binding $cor0
> └──268:LogicalJoin(condition=[true], joinType=[left])
>    ├──246:TableScan(table=[[QUERY, T3]], fields=[[ID, C301, C302]])
>    └──266:LogicalAggregate(group=[{}], EXPR$0=[SUM($0)])
>       └──283:LogicalProject($f0=[+($2, $5)])
>          |  // Correlate node correctly binding $cor1
>          └──281:LogicalCorrelate(correlation=[$cor1], joinType=[left], 
> requiredColumns=[{1}])
>             |  // $cor0 is not bound by any parent correlate node
>             ├──262:LogicalFilter(condition=[=($0, $cor0.C302)])
>             |  └──247:TableScan(table=[[QUERY, T2]], fields=[[ID, C201, C202, 
> C203, C204]])
>             └──279:LogicalAggregate(group=[{}], EXPR$0=[SUM($0)])
>                └──277:LogicalProject(C101=[$1])
>                   |  // $cor1 bound by #281
>                   └──275:LogicalFilter(condition=[=($0, $cor1.C201)])
>                      └──249:TableScan(table=[[QUERY, T1]], fields=[[ID, 
> C101]])
> {code}
>  
> Essentially, instead of a correlate node binding $cor0 there is a join (#268) 
> and there is nothing binding $cor0.
> I would have expected this:
> {code}
> 270:LogicalProject(EXPR$0=[$3])
> |  // Correlate binding $cor0 and requiring C302 from the LHS (#246)
> └──299:LogicalCorrelate(correlation=[$cor0], joinType=[left], 
> requiredColumns=[{2}])
>    ├──246:TableScan(table=[[QUERY, T3]], fields=[[ID, C301, C302]])
>    └──266:LogicalAggregate(group=[{}], EXPR$0=[SUM($0)])
>       └──283:LogicalProject($f0=[+($2, $5)])
>          |  // Correlate node correctly binding $cor1
>          └──281:LogicalCorrelate(correlation=[$cor1], joinType=[left], 
> requiredColumns=[{1}])
>             |  // $cor0 bound by #299
>             ├──262:LogicalFilter(condition=[=($0, $cor0.C302)])
>             |  └──247:TableScan(table=[[QUERY, T2]], fields=[[ID, C201, C202, 
> C203, C204]])
>             └──279:LogicalAggregate(group=[{}], EXPR$0=[SUM($0)])
>                └──277:LogicalProject(C101=[$1])
>                   |  // $cor1 bound by #281
>                   └──275:LogicalFilter(condition=[=($0, $cor1.C201)])
>                      └──249:TableScan(table=[[QUERY, T1]], fields=[[ID, 
> C101]])
> {code}
> Further, when adding CoreRules.JOIN_TO_CORRELATE in an attempt to convert the 
> erroneous join into a correlation we get:
> {code}
> 322:LogicalProject(EXPR$0=[$3])
> |  // Correlate incorrectly binding $cor2 (which does not appear anywhere)
> |  // when it should be binding $cor0, and requiring no columns from the left 
> side
> └──324:LogicalCorrelate(correlation=[$cor2], joinType=[left], 
> requiredColumns=[{}])
>    ├──298:TableScan(table=[[QUERY, T3]], fields=[[ID, C301, C302]])
>    └──318:LogicalAggregate(group=[{}], EXPR$0=[SUM($0)])
>       └──337:LogicalProject($f0=[+($2, $5)])
>          |  // Correlate node correctly binding $cor1
>          └──335:LogicalCorrelate(correlation=[$cor1], joinType=[left], 
> requiredColumns=[{1}])
>             |  // $cor0 is not bound by any parent correlate node
>             ├──314:LogicalFilter(condition=[=($0, $cor0.C302)])
>             |  └──299:TableScan(table=[[QUERY, T2]], fields=[[ID, C201, C202, 
> C203, C204]])
>             └──333:LogicalAggregate(group=[{}], EXPR$0=[SUM($0)])
>                └──331:LogicalProject(C101=[$1])
>                   |  // $cor1 bound by #335
>                   └──329:LogicalFilter(condition=[=($0, $cor1.C201)])
>                      └──301:TableScan(table=[[QUERY, T1]], fields=[[ID, 
> C101]])
> {code}
> which does replace the join with a correlate, but the correlate is incorrect; 
> binding an undefined correlation variable and requiring no columns.
> Even further, (and this might be a separate issue all together), applying 
> RelDecorrelator as an additional program in the sequence produces a very 
> incorrect tree (both with and without the JOIN_TO_CORRELATE rule):
> {code}
> 1041:LogicalProject(EXPR$0=[$5], ID6=[$4])
> └──1039:LogicalJoin(condition=[true], joinType=[left])
>    ├──1006:TableScan(table=[[QUERY, T3]], fields=[[ID, C301, C302, C303]])
>    └──1037:LogicalAggregate(group=[{0}], EXPR$0=[SUM($1)])
>       └──1035:LogicalProject(ID6=[$0], $f0=[+($2, $6)])
>          └──1033:LogicalJoin(condition=[=($1, $5)], joinType=[left])
>             ├──1024:LogicalFilter(condition=[=($0, $0)])
>             |  └──1007:TableScan(table=[[QUERY, T2]], 
>             |                    fields=[[ID, C201, C202, C203, C204]])
>             └──1031:LogicalAggregate(group=[{0}], EXPR$0=[SUM($1)])
>                └──1029:LogicalProject(ID=[$0], C101=[$1])
>                   └──1027:LogicalFilter(condition=[=($0, $0)])
>                      └──1009:TableScan(table=[[QUERY, T1]], fields=[[ID, 
> C101]])
> {code}
> This tree now has more projection expressions than the original query, which 
> is fully incorrect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (CALCITE-5213) PROJECT_TO_SUBQUERY producing Incorrect tree from nested correlated subqueries in projections with correlations in filters.

Reply via email to