[ 
https://issues.apache.org/jira/browse/CALCITE-4682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZiLin Chen updated CALCITE-4682:
--------------------------------
    Description: 
Total cost assumption means each operator wound run to completion, so the 
actual running cost of a plan will be equal to the cumulative cost. However, 
cost of limit operator may breaks this assumption.

Any way calcite can provide to solve the problem about cost of limit operator?

Limit operator can affect how the join below to choose specific algorithm by 
start up cost instead of total cost.

However volcano planner can only deal with the total cost, which works well in 
OLAP system, but OLTP system need some kind of cost model to solve this problem.

 

Given a sql like: select * from A join B on A.id = B.id limit 1.

Limit

 - Join 

  - TableScan A  (10,000,000 rows)

  - TableScan B (10,000,000 rows)

 

Consider two join algorithms HashJoin and IndexNestedLoopJoin.

It is more efficient to use IndexNestedLoopJoin instead of HashJoin, because of 
we just need to fetch some rows from A then index look up B. when the number of 
join output reach limit fetch size, it is over. However using hash join need to 
build a hash table with 10,000,000 rows. 

 

Now coming to VolcanoPlanner, the best cost of this sql will be computed as 
cost of limit operator plus cumulative cost of join. If we only consider 
cumulative cost, then HashJoin wound likely be chosen for two large tables 
join. 

 

  was:
Total cost assumption means each operator wound run to completion, so the 
actual running cost of a plan will be equal to the cumulative cost. However, 
cost of limit operator may breaks this assumption.

Any way calcite can provide to solve the problem about cost of limit operator?

Limit operator can affect how the join below to choose specific algorithm by 
start up cost instead of cumulative cost.

However volcano planner can only deal with the total cost, which works well in 
OLAP system, but OLTP system need some kind of cost model to solve this problem.

 

Given a sql like: select * from A join B on A.id = B.id limit 1.

Limit

 - Join 

  - TableScan A  (10,000,000 rows)

  - TableScan B (10,000,000 rows)

 

Consider two join algorithms HashJoin and IndexNestedLoopJoin.

It is more efficient to use IndexNestedLoopJoin instead of HashJoin, because of 
we just need to fetch some rows from A then index look up B. when the number of 
join output reach limit fetch size, it is over. However using hash join need to 
build a hash table with 10,000,000 rows. 

 

Now coming to VolcanoPlanner, the best cost of this sql will be computed as 
cost of limit operator plus cumulative cost of join. If we only consider 
cumulative cost, then HashJoin wound likely be chosen for two large tables 
join. 

 


> Cost of limit operator may breaks total cost assumption
> -------------------------------------------------------
>
>                 Key: CALCITE-4682
>                 URL: https://issues.apache.org/jira/browse/CALCITE-4682
>             Project: Calcite
>          Issue Type: Improvement
>          Components: core
>            Reporter: ZiLin Chen
>            Priority: Major
>
> Total cost assumption means each operator wound run to completion, so the 
> actual running cost of a plan will be equal to the cumulative cost. However, 
> cost of limit operator may breaks this assumption.
> Any way calcite can provide to solve the problem about cost of limit operator?
> Limit operator can affect how the join below to choose specific algorithm by 
> start up cost instead of total cost.
> However volcano planner can only deal with the total cost, which works well 
> in OLAP system, but OLTP system need some kind of cost model to solve this 
> problem.
>  
> Given a sql like: select * from A join B on A.id = B.id limit 1.
> Limit
>  - Join 
>   - TableScan A  (10,000,000 rows)
>   - TableScan B (10,000,000 rows)
>  
> Consider two join algorithms HashJoin and IndexNestedLoopJoin.
> It is more efficient to use IndexNestedLoopJoin instead of HashJoin, because 
> of we just need to fetch some rows from A then index look up B. when the 
> number of join output reach limit fetch size, it is over. However using hash 
> join need to build a hash table with 10,000,000 rows. 
>  
> Now coming to VolcanoPlanner, the best cost of this sql will be computed as 
> cost of limit operator plus cumulative cost of join. If we only consider 
> cumulative cost, then HashJoin wound likely be chosen for two large tables 
> join. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to