Hi All, Current Spark CBO implements a cost based multi-way join reordering algorithm based on the System-R’s paper [Access Path-SIGMOD’79] <http://x86.cs.duke.edu/courses/spring03/cps216/papers/selinger-etal-1979.pdf>. When building m-way joins, it uses a bottom-up approach and put all items (basic joined nodes) into level 0, then build all two-way joins at level 1 from plans at level 0 (single items), then build all 3-way joins ... etc. The algorithm also considers all combinations including left-deep trees, bushy trees, and right-deep-trees. It also prunes cartesian product candidates.
While we still found many *limitations* of current CBO implementation: 1. The current CBO is a rule in logic phase, it only outputs one logical plan to physical phase optimize, while we cannot make sure the best plan in logical phase is still the best after physical optimize. 2. In current bottom-up approach, we keeps only one best plan for each level, while we cannot make sure to get the exact best plan for all from the best plan for each level. 3. Current cost formula cost = weight * cardinality + (1.0 - weight) * size from which the first portion roughly corresponds to the CPU cost and the second portion roughly corresponds to the I/O cost. The cost formula is over simplified and. It treats all the join implementations the same way and doesn't take shuffle and sort cost into consideration, while Shuffle Exchange is one of the heaviest physical operator in Spark SQL. 4. Equivalent join conditions are not supported. For example, (A join B join C on a=b and b=c) can be reordered to (A join C join B on a=c and c=b) which is possible to be more efficient. While in current implementation, we will not get condition "a=c" so will take "A join C" like a Cartesian Product and then exclude it. The bottom-up approach first came up from the System-R optimizer (1979). It quickly became a standard and many of the modern relation database optimizers are “System-R style”, for example, Oracle, PostgreSQL, MySQL, DB2. As time goes by, new styles optimizer were invented: Volcano(1993) and Cascades(1995). They are not that famous compared to System-R but still be wildly used in practice: Microsoft SQL Server, Greenplum Orca, Apache Calcite. They implement Top-down transformational search algorithm and provide extensible optimization framework. A top-down optimization framework can help us solve the above limitations since it has a more complete search space and combines the logical and physical phases to have a more accurate cost estimation. And about the efficiency of having all alternatives plans, Cascades also provides pruning to save the search space. What about implementing *a new Cascades style CBO for Spark SQL*? It could be a new rule in current "Planner" which reads a logical plan after heuristics rules and outputs a best physical plan with least cost after reorder and physical implementation rules. Xiaoju Wu Phone:+86 17717640807