[
https://issues.apache.org/jira/browse/CALCITE-7093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18009120#comment-18009120
]
Silun Dong edited comment on CALCITE-7093 at 7/23/25 2:06 AM:
--------------------------------------------------------------
The algorithm is a dynamic programming process. For a connected subgraph, it
has only one optimal result. When enumerating a larger connected subgraph that
contains this subgraph, only the optimal result of this subgraph will be used.
I think it can be said that the algorithm calculates the optimal result of each
connected subgraph from small to large based on dynamic programming.
I tested a chain joins of 5 tables in DphypJoinReorderTest, and the initial
plan was a left-deep tree:
{code:java}
LogicalJoin(condition=[=($6, $8)], joinType=[inner])
LogicalJoin(condition=[=($4, $6)], joinType=[inner])
LogicalJoin(condition=[=($2, $4)], joinType=[inner])
LogicalJoin(condition=[=($0, $2)], joinType=[inner])
LogicalValues(tuples=[[{ 1, 't1' }, { 2, null }, { 3, 't1' }]])
LogicalValues(tuples=[[{ 2, 't2' }]])
LogicalValues(tuples=[[{ 2, 't3' }, { 3, 't3' }, { 4, 't3' }]])
LogicalValues(tuples=[[{ 2, 't4' }, { 5, 't4' }]])
LogicalValues(tuples=[[{ 2, 't5' }, { 5, 't5' }]]){code}
After dphyp, there are 8 candidate plans for the \{t1, t2, t3, t4, t5},
including 4 bushy trees (because the cost is the same, bushy trees did not win):
{code:java}
// no.1
LogicalProject(id1=[$2], name1=[$3], id2=[$0], name2=[$1], id3=[$8],
name3=[$9], id4=[$6], name4=[$7], id5=[$4], name5=[$5])
LogicalJoin(condition=[=($0, $8)], joinType=[inner])
LogicalJoin(condition=[=($2, $0)], joinType=[inner])
LogicalValues(tuples=[[{ 2, 't2' }]])
LogicalValues(tuples=[[{ 1, 't1' }, { 2, null }, { 3, 't1' }]])
LogicalJoin(condition=[=($4, $2)], joinType=[inner])
LogicalJoin(condition=[=($2, $0)], joinType=[inner])
LogicalValues(tuples=[[{ 2, 't5' }, { 5, 't5' }]])
LogicalValues(tuples=[[{ 2, 't4' }, { 5, 't4' }]])
LogicalValues(tuples=[[{ 2, 't3' }, { 3, 't3' }, { 4, 't3' }]])
// no.2, reverse the top Join in no.1
// no.3
LogicalProject(id1=[$4], name1=[$5], id2=[$2], name2=[$3], id3=[$0],
name3=[$1], id4=[$8], name4=[$9], id5=[$6], name5=[$7])
LogicalJoin(condition=[=($0, $8)], joinType=[inner])
LogicalJoin(condition=[=($4, $2)], joinType=[inner])
LogicalJoin(condition=[=($2, $0)], joinType=[inner])
LogicalValues(tuples=[[{ 2, 't3' }, { 3, 't3' }, { 4, 't3' }]])
LogicalValues(tuples=[[{ 2, 't2' }]])
LogicalValues(tuples=[[{ 1, 't1' }, { 2, null }, { 3, 't1' }]])
LogicalJoin(condition=[=($2, $0)], joinType=[inner])
LogicalValues(tuples=[[{ 2, 't5' }, { 5, 't5' }]])
LogicalValues(tuples=[[{ 2, 't4' }, { 5, 't4' }]])
// no.4, reverse the top Join in no.3{code}
As mentioned above, these 8 candidate plans are not the true all legal join
orders. For example, for \{t4, t5}, the best result is t5-t4, and only this
result will be used in subsequent enumerations.
was (Author: JIRAUSER308615):
The algorithm is a dynamic programming process. For a connected subgraph, it
has only one optimal result. When enumerating a larger connected subgraph that
contains this subgraph, only the optimal result of this subgraph will be used.
I think it can be said that the algorithm calculates the optimal result of each
connected subgraph from small to large based on dynamic programming.
I tested a chain joins of 5 tables in DphypJoinReorderTest, and the initial
plan was a left-deep tree:
{code:java}
LogicalJoin(condition=[=($6, $8)], joinType=[inner])
LogicalJoin(condition=[=($4, $6)], joinType=[inner])
LogicalJoin(condition=[=($2, $4)], joinType=[inner])
LogicalJoin(condition=[=($0, $2)], joinType=[inner])
LogicalValues(tuples=[[{ 1, 't1' }, { 2, null }, { 3, 't1' }]])
LogicalValues(tuples=[[{ 2, 't2' }]])
LogicalValues(tuples=[[{ 2, 't3' }, { 3, 't3' }, { 4, 't3' }]])
LogicalValues(tuples=[[{ 2, 't4' }, { 5, 't4' }]])
LogicalValues(tuples=[[{ 2, 't5' }, { 5, 't5' }]]){code}
After dphyp, there are 8 candidate plans for the \{t1, t2, t3, t4, t5},
including 4 bushy trees (because the cost is the same, bushy trees did not win):
{code:java}
// no.1
LogicalProject(id1=[$2], name1=[$3], id2=[$0], name2=[$1], id3=[$8],
name3=[$9], id4=[$6], name4=[$7], id5=[$4], name5=[$5])
LogicalJoin(condition=[=($0, $8)], joinType=[inner])
LogicalJoin(condition=[=($2, $0)], joinType=[inner])
LogicalValues(tuples=[[{ 2, 't2' }]])
LogicalValues(tuples=[[{ 1, 't1' }, { 2, null }, { 3, 't1' }]])
LogicalJoin(condition=[=($4, $2)], joinType=[inner])
LogicalJoin(condition=[=($2, $0)], joinType=[inner])
LogicalValues(tuples=[[{ 2, 't5' }, { 5, 't5' }]])
LogicalValues(tuples=[[{ 2, 't4' }, { 5, 't4' }]])
LogicalValues(tuples=[[{ 2, 't3' }, { 3, 't3' }, { 4, 't3' }]])
// no.2, reverse the top Join in no.1
// no.3
LogicalProject(id1=[$4], name1=[$5], id2=[$2], name2=[$3], id3=[$0],
name3=[$1], id4=[$8], name4=[$9], id5=[$6], name5=[$7])
LogicalJoin(condition=[=($0, $8)], joinType=[inner])
LogicalJoin(condition=[=($4, $2)], joinType=[inner])
LogicalJoin(condition=[=($2, $0)], joinType=[inner])
LogicalValues(tuples=[[{ 2, 't3' }, { 3, 't3' }, { 4, 't3' }]])
LogicalValues(tuples=[[{ 2, 't2' }]])
LogicalValues(tuples=[[{ 1, 't1' }, { 2, null }, { 3, 't1' }]])
LogicalJoin(condition=[=($2, $0)], joinType=[inner])
LogicalValues(tuples=[[{ 2, 't5' }, { 5, 't5' }]])
LogicalValues(tuples=[[{ 2, 't4' }, { 5, 't4' }]])
// no.4, reverse the top Join in no.3{code}
As mentioned above, these 8 candidate plans are not the true all legal join
orders. For example, for \{t4, t5}, the best result is t5-t4, and only this
result will be used in subsequent enumerations.
> DPhyp algorithm should accept cost model as parameter
> -----------------------------------------------------
>
> Key: CALCITE-7093
> URL: https://issues.apache.org/jira/browse/CALCITE-7093
> Project: Calcite
> Issue Type: Wish
> Components: core
> Affects Versions: 1.40.0
> Reporter: Mihai Budiu
> Priority: Minor
>
> This is about the hypergraph-based optimization introduced in [CALCITE-6846]
> by [~dongsl].
> The algorithm makes calls to a cost model using getCumulativeCost() (function
> chooseBetterPlan()); the cost model is obtained from the Rel nodes.
> It could be useful to allow add to the the rule configuration an API to
> specify a cost model, similar to the MULTI_JOIN_OPTIMIZE rule
> (LoptOptimizeJoinRule), whose Config has as "withCostFunction" api.
> For example, this would allow the algorithm to emulate bushy join
> optimization using a cost model that optimizes for plan depth.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)