[jira] [Comment Edited] (CALCITE-7093) DPhyp algorithm should accept cost model as parameter

Silun Dong (Jira) Tue, 22 Jul 2025 19:07:45 -0700


    [ 
https://issues.apache.org/jira/browse/CALCITE-7093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18009120#comment-18009120
 ]


Silun Dong edited comment on CALCITE-7093 at 7/23/25 2:06 AM:
--------------------------------------------------------------

The algorithm is a dynamic programming process. For a connected subgraph, it 
has only one optimal result. When enumerating a larger connected subgraph that 
contains this subgraph, only the optimal result of this subgraph will be used. 
I think it can be said that the algorithm calculates the optimal result of each 
connected subgraph from small to large based on dynamic programming.

I tested a chain joins of 5 tables in DphypJoinReorderTest, and the initial 
plan was a left-deep tree:
{code:java}
LogicalJoin(condition=[=($6, $8)], joinType=[inner])
  LogicalJoin(condition=[=($4, $6)], joinType=[inner])
    LogicalJoin(condition=[=($2, $4)], joinType=[inner])
      LogicalJoin(condition=[=($0, $2)], joinType=[inner])
        LogicalValues(tuples=[[{ 1, 't1' }, { 2, null }, { 3, 't1' }]])
        LogicalValues(tuples=[[{ 2, 't2' }]])
      LogicalValues(tuples=[[{ 2, 't3' }, { 3, 't3' }, { 4, 't3' }]])
    LogicalValues(tuples=[[{ 2, 't4' }, { 5, 't4' }]])
  LogicalValues(tuples=[[{ 2, 't5' }, { 5, 't5' }]]){code}
After dphyp, there are 8 candidate plans for the \{t1, t2, t3, t4, t5}, 
including 4 bushy trees (because the cost is the same, bushy trees did not win):
{code:java}
// no.1
LogicalProject(id1=[$2], name1=[$3], id2=[$0], name2=[$1], id3=[$8], 
name3=[$9], id4=[$6], name4=[$7], id5=[$4], name5=[$5])
  LogicalJoin(condition=[=($0, $8)], joinType=[inner])
    LogicalJoin(condition=[=($2, $0)], joinType=[inner])
      LogicalValues(tuples=[[{ 2, 't2' }]])
      LogicalValues(tuples=[[{ 1, 't1' }, { 2, null }, { 3, 't1' }]])
    LogicalJoin(condition=[=($4, $2)], joinType=[inner])
      LogicalJoin(condition=[=($2, $0)], joinType=[inner])
        LogicalValues(tuples=[[{ 2, 't5' }, { 5, 't5' }]])
        LogicalValues(tuples=[[{ 2, 't4' }, { 5, 't4' }]])
      LogicalValues(tuples=[[{ 2, 't3' }, { 3, 't3' }, { 4, 't3' }]])

// no.2, reverse the top Join in no.1

// no.3
LogicalProject(id1=[$4], name1=[$5], id2=[$2], name2=[$3], id3=[$0], 
name3=[$1], id4=[$8], name4=[$9], id5=[$6], name5=[$7])
  LogicalJoin(condition=[=($0, $8)], joinType=[inner])
    LogicalJoin(condition=[=($4, $2)], joinType=[inner])
      LogicalJoin(condition=[=($2, $0)], joinType=[inner])
        LogicalValues(tuples=[[{ 2, 't3' }, { 3, 't3' }, { 4, 't3' }]])
        LogicalValues(tuples=[[{ 2, 't2' }]])
      LogicalValues(tuples=[[{ 1, 't1' }, { 2, null }, { 3, 't1' }]])
    LogicalJoin(condition=[=($2, $0)], joinType=[inner])
      LogicalValues(tuples=[[{ 2, 't5' }, { 5, 't5' }]])
      LogicalValues(tuples=[[{ 2, 't4' }, { 5, 't4' }]])

// no.4, reverse the top Join in no.3{code}
As mentioned above, these 8 candidate plans are not the true all legal join 
orders. For example, for \{t4, t5}, the best result is t5-t4, and only this 
result will be used in subsequent enumerations.


was (Author: JIRAUSER308615):
The algorithm is a dynamic programming process. For a connected subgraph, it 
has only one optimal result. When enumerating a larger connected subgraph that 
contains this subgraph, only the optimal result of this subgraph will be used. 
I think it can be said that the algorithm calculates the optimal result of each 
connected subgraph from small to large based on dynamic programming.

I tested a chain joins of 5 tables in DphypJoinReorderTest, and the initial 
plan was a left-deep tree:
{code:java}
LogicalJoin(condition=[=($6, $8)], joinType=[inner])
  LogicalJoin(condition=[=($4, $6)], joinType=[inner])
    LogicalJoin(condition=[=($2, $4)], joinType=[inner])
      LogicalJoin(condition=[=($0, $2)], joinType=[inner])
        LogicalValues(tuples=[[{ 1, 't1' }, { 2, null }, { 3, 't1' }]])
        LogicalValues(tuples=[[{ 2, 't2' }]])
      LogicalValues(tuples=[[{ 2, 't3' }, { 3, 't3' }, { 4, 't3' }]])
    LogicalValues(tuples=[[{ 2, 't4' }, { 5, 't4' }]])
  LogicalValues(tuples=[[{ 2, 't5' }, { 5, 't5' }]]){code}
After dphyp, there are 8 candidate plans for the \{t1, t2, t3, t4, t5}, 
including 4 bushy trees (because the cost is the same, bushy trees did not win):
{code:java}
// no.1
LogicalProject(id1=[$2], name1=[$3], id2=[$0], name2=[$1], id3=[$8], 
name3=[$9], id4=[$6], name4=[$7], id5=[$4], name5=[$5])
  LogicalJoin(condition=[=($0, $8)], joinType=[inner])
    LogicalJoin(condition=[=($2, $0)], joinType=[inner])
      LogicalValues(tuples=[[{ 2, 't2' }]])
      LogicalValues(tuples=[[{ 1, 't1' }, { 2, null }, { 3, 't1' }]])
    LogicalJoin(condition=[=($4, $2)], joinType=[inner])
      LogicalJoin(condition=[=($2, $0)], joinType=[inner])
        LogicalValues(tuples=[[{ 2, 't5' }, { 5, 't5' }]])
        LogicalValues(tuples=[[{ 2, 't4' }, { 5, 't4' }]])
      LogicalValues(tuples=[[{ 2, 't3' }, { 3, 't3' }, { 4, 't3' }]])

// no.2, reverse the top Join in no.1

// no.3
LogicalProject(id1=[$4], name1=[$5], id2=[$2], name2=[$3], id3=[$0], 
name3=[$1], id4=[$8], name4=[$9], id5=[$6], name5=[$7])
  LogicalJoin(condition=[=($0, $8)], joinType=[inner])
    LogicalJoin(condition=[=($4, $2)], joinType=[inner])
      LogicalJoin(condition=[=($2, $0)], joinType=[inner])
        LogicalValues(tuples=[[{ 2, 't3' }, { 3, 't3' }, { 4, 't3' }]])
        LogicalValues(tuples=[[{ 2, 't2' }]])
      LogicalValues(tuples=[[{ 1, 't1' }, { 2, null }, { 3, 't1' }]])
    LogicalJoin(condition=[=($2, $0)], joinType=[inner])
      LogicalValues(tuples=[[{ 2, 't5' }, { 5, 't5' }]])
      LogicalValues(tuples=[[{ 2, 't4' }, { 5, 't4' }]])

// no.4, reverse the top Join in no.3{code}
As mentioned above, these 8 candidate plans are not the true all legal join 
orders. For example, for \{t4, t5}, the best result is t5-t4, and only this 
result will be used in subsequent enumerations.

 

 

> DPhyp algorithm should accept cost model as parameter
> -----------------------------------------------------
>
>                 Key: CALCITE-7093
>                 URL: https://issues.apache.org/jira/browse/CALCITE-7093
>             Project: Calcite
>          Issue Type: Wish
>          Components: core
>    Affects Versions: 1.40.0
>            Reporter: Mihai Budiu
>            Priority: Minor
>
> This is about the hypergraph-based optimization introduced in [CALCITE-6846] 
> by [~dongsl].
> The algorithm makes calls to a cost model using getCumulativeCost() (function 
> chooseBetterPlan()); the cost model is obtained from the Rel nodes.
> It could be useful to allow add to the the rule configuration an API to 
> specify a cost model, similar to the MULTI_JOIN_OPTIMIZE rule 
> (LoptOptimizeJoinRule), whose Config has as "withCostFunction" api.
> For example, this would allow the algorithm to emulate bushy join 
> optimization using a cost model that optimizes for plan depth.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (CALCITE-7093) DPhyp algorithm should accept cost model as parameter

Reply via email to