mingmwang commented on issue #1972:
URL: 
https://github.com/apache/arrow-datafusion/issues/1972#issuecomment-1154749123

   Nice discussion!
   
   I see many people raised the points to build the volcano/cascades style 
planner for DataFusion, either on Egg or to build from scratch. To implement a 
cascades style planner, usually there are three parts:
   1: The optimizer framework, including the memo structure which consists of 
group of expressions, the search and prune tasks, duplicate detection etc.
   2: Rule sets consist of the transformation rules and implementation rules. 
   3: Cost model, stats estimation.
   
   The whole search and prune algorithm is performed by the specific tasks. It 
defines five different types of optimization tasks: Group Optimization; Group 
Exploration; Expression Optimization; Input Optimization; Rule Application. 
   
   <img width="327" alt="image" 
src="https://user-images.githubusercontent.com/10591925/173490995-c81c9a3e-df1b-4bd4-b065-6a8dc3d84b41.png";>
   
   Compared to a traditional heuristic planner, a cascades style optimizer is 
much more complex. But the most difficult and challenge work is in the Part 3, 
the stats estimation.  Even we can collect all the necessary stats like rows 
counts, cardinality and histogram for all the columns, make them correct and up 
to date, it is still hard to derive the correct stats from bottom leaf 
operators(table scan) to the top operators. There are many papers regarding 
stats estimation, lot of maths and magic.
   
   And in production, the cascade style optimizer is also hard to triage and 
debug, because of the complexity of the search space, the unreliable stats, or 
just because there is some bug in the rule.implementation,  it is hard to debug 
and figure out why the planner choose such a plan.
   
   Because of the unreliable stats,  it is hard to come up with a real optimal 
plan before execution. Some other engines take another approach and try to 
adaptively adjust the execution plan in runtime, SparkSQL's AQE is an example. 
Another example is Snowflake's adaptive aggregation operator placement: 
https://patents.justia.com/patent/20210089533
   It is not that general but the adaptive execution can solve the major 
physical plan selection problem.
   
   Today, some open source Query/DB engines had implemented the 
volcano/cascades optimizer like Apache Calcite,  Greenplum Orca. They take many 
years to make them solid and mature. For Apache calcite, most engines just 
leverage Calcite as a SQL parser and heuristic planner,  Hive try to leverage 
Calcite to implement an CBO optimizer, but I don't think it is a success story.
   
   My opinion for DataFusion's optimizer framework is that we should continue 
focus on the heuristic planner approach in current phase, implement an 
optimizer framework like SparkSQL's catalyst optimizer, make it relatively easy 
to add new rules.  In future, we can go with the adaptive execution approach.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to