zhztheplayer opened a new issue, #5057: URL: https://github.com/apache/incubator-gluten/issues/5057
### Description # Enhancement: Gluten Advanced Cost-based Query Optimization (Advanced CBO) ## Background Many of developers may already noticed that Apache Spark's Catalyst query planner basically heuristically applies rules without doing a global search for best plan (except for some local cases like join-reorder). This have been bringing a general issue to Gluten: With writing heuristic rules Gluten just gain limited capability on deciding offloading which part of computation to native engine, and users may frequently find the generated plan is sub-optimal and could cause performance issues. By this enhancement proposal we would like to introduce some techniques derived from the well-known Volcano / Cascades query optimizer theory to Gluten's code base to deal with this. To distinguish with vanilla Spark's CBO (cost-based optimization), we'd name the new thing **Advanced CBO** (hereinafter also called ACBO). Note, naming it with "advanced" doesn't actually mean it's required to be literally more advanced than existing query optimization strategy in Apache Spark, but should have the potential to be more suitable for native accelerator like Gluten's adoption scenarios. ## Design The rough design of ACBO will follow these principles: ### Plan enumeration ACBO should basically have the capability to do plan enumeration. In the worst case, the search space of a N node plan in Gluten would be at least O(2^N) (assume each node has vanilla / native versions of implementation). So memorized-search should be adopted to minimize the overall search space. ### General purpose core module Make the general part of ACBO's optimizer general. It should be able to handle any generic plan representation, neither limited to Gluten, nor Spark. This is because what we are doing is not developing an optimizer for a new DBMS but to integrate an existing system into the new optimizer. If the the design is general enough, then we can: 1. Smoothly know whether a further issue is because of Gluten's code or caused by ACBO during development. 2. Keep the optimizer kernal away from backend layers. ACBO's search engine will have less compile-time dependencies with having Gluten / Spark's jars excluded. It eases maintenance for long-term. ACBO will be a new Maven module of gluten. Will use 'gluten-cbo'. ### Individually work with Spark's CBO ACBO could be turn on/off despite whether user turns Spark's CBO on or off. Though we can get better statistics from Spark's CBO and ACBO could leverage that when Spark CBO is also turned on. ### As a Spark columnar rule ACBO will only operate on Spark's physical plan. We can decide whether to enable it in logical plan optimization in future. But that would not be a goal for long while. A smaller code entrance for ACBO would lower the chance of adding maintenance burden, or facing problems caused by Gluten's legacy code issues. ## Goal The following are some goals of the initial version of ACBO: 1. Deliver a general purpose CBO framework with global plan enumeration and search enabled; 3. Selectively enable some rules in Gluten's rule list in ACBO; Provide user the option to turn the ACBO on/off. The option should be by default off. 4. If possible, replace the main single-op validation rule (it's now "RegularTransformRule") and C2R / R2C adding rule with ACBO version. We don't have to expect real performance speed-up in the 1st move, since with no matter any CBO implementation, performance should be tightly bounded to cost model. We don't yet decide to deliver a solid cost model at the 1st version of ACBO so it will be expected that performance will not be very much focused. On the other hand, one of the most important purpose at this period is to integrate ACBO with current optimizer framework "successfully", and make the integration work as stably as possible. Writing ACBO rules will be comparatively easier than writing heuristic rules: developer just tells the optimizer one plan can be replaced by another, and doesn't have to guarantee the replacement provides any profit or not. Thus, removing the RBO rules and migrating it to ACBO will always simplify the code. Thus one of ACBO's bonus is to make Gluten's code more maintainable. ## Non-goal It's worth noting again that we don't plan to replace catalyst's query optimization by ACBO. ACBO would work as one or several individual rules in Spark's query execution procedure. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
