mingmwang commented on issue #1972: URL: https://github.com/apache/arrow-datafusion/issues/1972#issuecomment-1154749123
Nice discussion! I see many people raised the points to build the volcano/cascades style planner for DataFusion, either on Egg or to build from scratch. To implement a cascades style planner, usually there are three parts: 1: The optimizer framework, including the memo structure which consists of group of expressions, the search and prune tasks, duplicate detection etc. 2: Rule sets consist of the transformation rules and implementation rules. 3: Cost model, stats estimation. The whole search and prune algorithm is performed by the specific tasks. It defines five different types of optimization tasks: Group Optimization; Group Exploration; Expression Optimization; Input Optimization; Rule Application. <img width="327" alt="image" src="https://user-images.githubusercontent.com/10591925/173490995-c81c9a3e-df1b-4bd4-b065-6a8dc3d84b41.png"> Compared to a traditional heuristic planner, a cascades style optimizer is much more complex. But the most difficult and challenge work is in the Part 3, the stats estimation. Even we can collect all the necessary stats like rows counts, cardinality and histogram for all the columns, make them correct and up to date, it is still hard to derive the correct stats from bottom leaf operators(table scan) to the top operators. There are many papers regarding stats estimation, lot of maths and magic. And in production, the cascade style optimizer is also hard to triage and debug, because of the complexity of the search space, the unreliable stats, or just because there is some bug in the rule.implementation, it is hard to debug and figure out why the planner choose such a plan. Because of the unreliable stats, it is hard to come up with a real optimal plan before execution. Some other engines take another approach and try to adaptively adjust the execution plan in runtime, SparkSQL's AQE is an example. Another example is Snowflake's adaptive aggregation operator placement: https://patents.justia.com/patent/20210089533 It is not that general but the adaptive execution can solve the major physical plan selection problem. Today, some open source Query/DB engines had implemented the volcano/cascades optimizer like Apache Calcite, Greenplum Orca. They take many years to make them solid and mature. For Apache calcite, most engines just leverage Calcite as a SQL parser and heuristic planner, Hive try to leverage Calcite to implement an CBO optimizer, but I don't think it is a success story. My opinion for DataFusion's optimizer framework is that we should continue focus on the heuristic planner approach in current phase, implement an optimizer framework like SparkSQL's catalyst optimizer, make it relatively easy to add new rules. In future, we can go with the adaptive execution approach. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
