[ 
https://issues.apache.org/jira/browse/CALCITE-4542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Konstantin Orlov updated CALCITE-4542:
--------------------------------------
    Description: 
When TopDownRuleDriver is enabled, suboptimal plan is chosen for query with 
join.

We have our own convention and implementation of all necessary relations. A 
distributed join is considered less expensive by our cost system than a 
single-distributed join, and the merge join is considered less expensive than 
nested loop if index over join condition is present. Nevertheless the merge 
join with a single distribution is chosen by optimizer.

The query is:
{code:java}
select e1."empid", e1."deptno" from "emps" e1 join "emps" e2 on e1."empid" = 
e2."empid"
{code}
Actual plan:
{code:java}
MyProject(subset=[rel#16:RelSubset#2.MY.single.[]], empid=[$0], deptno=[$1]): 
rowcount = 1500.0, cumulative cost = {1500.0 rows, 3000.0 cpu, 0.0 io}, id = 21
  MyMergeJoin(subset=[rel#20:RelSubset#1.MY.single.[]], condition=[=($0, $5)], 
joinType=[inner]): rowcount = 1500.0, cumulative cost = {150.0 rows, 0.0 cpu, 
0.0 io}, id = 50
    MyExchange(subset=[rel#25:RelSubset#0.MY.single.[]], 
distribution=[single]): rowcount = 100.0, cumulative cost = {9210.340371976183 
rows, 100.0 cpu, 0.0 io}, id = 30
      MyTableScan(subset=[rel#29:RelSubset#0.MY.any.[0]], table=[[hr, emps]]): 
rowcount = 100.0, cumulative cost = {100.0 rows, 101.0 cpu, 0.0 io}, id = 27
    MyExchange(subset=[rel#25:RelSubset#0.MY.single.[]], 
distribution=[single]): rowcount = 100.0, cumulative cost = {9210.340371976183 
rows, 100.0 cpu, 0.0 io}, id = 30
      MyTableScan(subset=[rel#29:RelSubset#0.MY.any.[0]], table=[[hr, emps]]): 
rowcount = 100.0, cumulative cost = {100.0 rows, 101.0 cpu, 0.0 io}, id = 27
{code}
Expected plan is
{code:java}
MyProject(subset=[rel#16:RelSubset#2.MY.single.[]], empid=[$0], deptno=[$1]): 
rowcount = 1500.0, cumulative cost = {1500.0 rows, 3000.0 cpu, 0.0 io}, id = 21
  MyExchange(subset=[rel#20:RelSubset#1.MY.single.[]], distribution=[single]): 
rowcount = 1500.0, cumulative cost = {18420.680743952365 rows, 100.0 cpu, 0.0 
io}, id = 24
    MyMergeJoin(subset=[rel#23:RelSubset#1.MY.any.[]], condition=[=($0, $5)], 
joinType=[inner]): rowcount = 1500.0, cumulative cost = {0.15 rows, 0.0 cpu, 
0.0 io}, id = 50
      MyTableScan(subset=[rel#26:RelSubset#0.MY.hash[0].[0]], table=[[hr, 
emps]]): rowcount = 100.0, cumulative cost = {100.0 rows, 101.0 cpu, 0.0 io}, 
id = 25
      MyTableScan(subset=[rel#26:RelSubset#0.MY.hash[0].[0]], table=[[hr, 
emps]]): rowcount = 100.0, cumulative cost = {100.0 rows, 101.0 cpu, 0.0 io}, 
id = 25
{code}
Planner [^dump.txt] doesn't contain the join with proper distribution.

Reproducer could be found 
[here|https://github.com/korlov42/calcite/tree/derive-not-being-called-repoducer].
 Please run {{org.apache.calcite.tools.PlannerTest#test}}

  was:
When TopDownRuleDriver is enabled, suboptimal plan is chosen for query with 
join.

We have our own convention and implementation of all necessary relations. A 
distributed join is considered less expensive by our cost system than a 
single-distributed join, and the merge join is considered less expensive than 
nested loop if index over join condition is present. Nevertheless the merge 
join with a single distribution is chosen by optimizer.

The query is:

{code:java}
select e1."empid", e1."deptno" from "emps" e1 join "emps" e2 on e1."empid" = 
e2."empid"
{code}

Actual plan:

{code:java}
MyProject(subset=[rel#16:RelSubset#2.MY.single.[]], empid=[$0], deptno=[$1]): 
rowcount = 1500.0, cumulative cost = {1500.0 rows, 3000.0 cpu, 0.0 io}, id = 21
  MyMergeJoin(subset=[rel#20:RelSubset#1.MY.single.[]], condition=[=($0, $5)], 
joinType=[inner]): rowcount = 1500.0, cumulative cost = {150.0 rows, 0.0 cpu, 
0.0 io}, id = 50
    MyExchange(subset=[rel#25:RelSubset#0.MY.single.[]], 
distribution=[single]): rowcount = 100.0, cumulative cost = {9210.340371976183 
rows, 100.0 cpu, 0.0 io}, id = 30
      MyTableScan(subset=[rel#29:RelSubset#0.MY.any.[0]], table=[[hr, emps]]): 
rowcount = 100.0, cumulative cost = {100.0 rows, 101.0 cpu, 0.0 io}, id = 27
    MyExchange(subset=[rel#25:RelSubset#0.MY.single.[]], 
distribution=[single]): rowcount = 100.0, cumulative cost = {9210.340371976183 
rows, 100.0 cpu, 0.0 io}, id = 30
      MyTableScan(subset=[rel#29:RelSubset#0.MY.any.[0]], table=[[hr, emps]]): 
rowcount = 100.0, cumulative cost = {100.0 rows, 101.0 cpu, 0.0 io}, id = 27
{code}

Expected plan is

{code:java}
MyProject(subset=[rel#16:RelSubset#2.MY.single.[]], empid=[$0], deptno=[$1]): 
rowcount = 1500.0, cumulative cost = {1500.0 rows, 3000.0 cpu, 0.0 io}, id = 21
  MyExchange(subset=[rel#20:RelSubset#1.MY.single.[]], distribution=[single]): 
rowcount = 1500.0, cumulative cost = {18420.680743952365 rows, 100.0 cpu, 0.0 
io}, id = 24
    MyMergeJoin(subset=[rel#23:RelSubset#1.MY.any.[]], condition=[=($0, $5)], 
joinType=[inner]): rowcount = 1500.0, cumulative cost = {0.15 rows, 0.0 cpu, 
0.0 io}, id = 50
      MyTableScan(subset=[rel#26:RelSubset#0.MY.hash[0].[0]], table=[[hr, 
emps]]): rowcount = 100.0, cumulative cost = {100.0 rows, 101.0 cpu, 0.0 io}, 
id = 25
      MyTableScan(subset=[rel#26:RelSubset#0.MY.hash[0].[0]], table=[[hr, 
emps]]): rowcount = 100.0, cumulative cost = {100.0 rows, 101.0 cpu, 0.0 io}, 
id = 25
{code}

Planner dump doesn't contain the join with proper distribution.

Reproducer could be found 
[here|https://github.com/korlov42/calcite/tree/derive-not-being-called-repoducer].
 Please run {{org.apache.calcite.tools.PlannerTest#test}}





> Suboptimal plan is chosen when TopDownRuleDriver is enabled 
> ------------------------------------------------------------
>
>                 Key: CALCITE-4542
>                 URL: https://issues.apache.org/jira/browse/CALCITE-4542
>             Project: Calcite
>          Issue Type: Bug
>    Affects Versions: 1.26.0
>            Reporter: Konstantin Orlov
>            Priority: Major
>         Attachments: dump.txt
>
>
> When TopDownRuleDriver is enabled, suboptimal plan is chosen for query with 
> join.
> We have our own convention and implementation of all necessary relations. A 
> distributed join is considered less expensive by our cost system than a 
> single-distributed join, and the merge join is considered less expensive than 
> nested loop if index over join condition is present. Nevertheless the merge 
> join with a single distribution is chosen by optimizer.
> The query is:
> {code:java}
> select e1."empid", e1."deptno" from "emps" e1 join "emps" e2 on e1."empid" = 
> e2."empid"
> {code}
> Actual plan:
> {code:java}
> MyProject(subset=[rel#16:RelSubset#2.MY.single.[]], empid=[$0], deptno=[$1]): 
> rowcount = 1500.0, cumulative cost = {1500.0 rows, 3000.0 cpu, 0.0 io}, id = 
> 21
>   MyMergeJoin(subset=[rel#20:RelSubset#1.MY.single.[]], condition=[=($0, 
> $5)], joinType=[inner]): rowcount = 1500.0, cumulative cost = {150.0 rows, 
> 0.0 cpu, 0.0 io}, id = 50
>     MyExchange(subset=[rel#25:RelSubset#0.MY.single.[]], 
> distribution=[single]): rowcount = 100.0, cumulative cost = 
> {9210.340371976183 rows, 100.0 cpu, 0.0 io}, id = 30
>       MyTableScan(subset=[rel#29:RelSubset#0.MY.any.[0]], table=[[hr, 
> emps]]): rowcount = 100.0, cumulative cost = {100.0 rows, 101.0 cpu, 0.0 io}, 
> id = 27
>     MyExchange(subset=[rel#25:RelSubset#0.MY.single.[]], 
> distribution=[single]): rowcount = 100.0, cumulative cost = 
> {9210.340371976183 rows, 100.0 cpu, 0.0 io}, id = 30
>       MyTableScan(subset=[rel#29:RelSubset#0.MY.any.[0]], table=[[hr, 
> emps]]): rowcount = 100.0, cumulative cost = {100.0 rows, 101.0 cpu, 0.0 io}, 
> id = 27
> {code}
> Expected plan is
> {code:java}
> MyProject(subset=[rel#16:RelSubset#2.MY.single.[]], empid=[$0], deptno=[$1]): 
> rowcount = 1500.0, cumulative cost = {1500.0 rows, 3000.0 cpu, 0.0 io}, id = 
> 21
>   MyExchange(subset=[rel#20:RelSubset#1.MY.single.[]], 
> distribution=[single]): rowcount = 1500.0, cumulative cost = 
> {18420.680743952365 rows, 100.0 cpu, 0.0 io}, id = 24
>     MyMergeJoin(subset=[rel#23:RelSubset#1.MY.any.[]], condition=[=($0, $5)], 
> joinType=[inner]): rowcount = 1500.0, cumulative cost = {0.15 rows, 0.0 cpu, 
> 0.0 io}, id = 50
>       MyTableScan(subset=[rel#26:RelSubset#0.MY.hash[0].[0]], table=[[hr, 
> emps]]): rowcount = 100.0, cumulative cost = {100.0 rows, 101.0 cpu, 0.0 io}, 
> id = 25
>       MyTableScan(subset=[rel#26:RelSubset#0.MY.hash[0].[0]], table=[[hr, 
> emps]]): rowcount = 100.0, cumulative cost = {100.0 rows, 101.0 cpu, 0.0 io}, 
> id = 25
> {code}
> Planner [^dump.txt] doesn't contain the join with proper distribution.
> Reproducer could be found 
> [here|https://github.com/korlov42/calcite/tree/derive-not-being-called-repoducer].
>  Please run {{org.apache.calcite.tools.PlannerTest#test}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to