[I] Enhancement: Gluten Advanced Cost-based Query Optimization (Advanced CBO) [incubator-gluten]

via GitHub Wed, 20 Mar 2024 17:10:13 -0700


zhztheplayer opened a new issue, #5057:
URL: https://github.com/apache/incubator-gluten/issues/5057


   ### Description
   
   # Enhancement: Gluten Advanced Cost-based Query Optimization (Advanced CBO)
   
   ## Background
   
   Many of developers may already noticed that Apache Spark's Catalyst query 
planner basically heuristically applies rules without doing a global search for 
best plan (except for some local cases like join-reorder). This have been 
bringing a general issue to Gluten: With writing heuristic rules Gluten just 
gain limited capability on deciding offloading which part of computation to 
native engine, and users may frequently find the generated plan is sub-optimal 
and could cause performance issues.
   
   By this enhancement proposal we would like to introduce some techniques 
derived from the well-known Volcano / Cascades query optimizer theory to 
Gluten's code base to deal with this. To distinguish with vanilla Spark's CBO 
(cost-based optimization), we'd name the new thing **Advanced CBO** 
(hereinafter also called ACBO). Note, naming it with "advanced" doesn't 
actually mean it's required to be literally more advanced than existing query 
optimization strategy in Apache Spark, but should have the potential to be more 
suitable for native accelerator like Gluten's adoption scenarios.
   
   ## Design
   
   The rough design of ACBO will follow these principles:
   
   ### Plan enumeration
   ACBO should basically have the capability to do plan enumeration. In the 
worst case, the search space of a N node plan in Gluten would be at least 
O(2^N) (assume each node has vanilla / native versions of implementation). So 
memorized-search should be adopted to minimize the overall search space. 
   
   ### General purpose core module
   Make the general part of ACBO's optimizer general. It should be able to 
handle any generic plan representation, neither limited to Gluten, nor Spark. 
This is because what we are doing is not developing an optimizer for a new DBMS 
but to integrate an existing system into the new optimizer. If the the design 
is general enough, then we can:
   
   1. Smoothly know whether a further issue is because of Gluten's code or 
caused by ACBO during development. 
   2. Keep the optimizer kernal away from backend layers. ACBO's search engine 
will have less compile-time dependencies with having Gluten / Spark's jars 
excluded. It eases maintenance for long-term.
   
   ACBO will be a new Maven module of gluten. Will use 'gluten-cbo'.
   
   ### Individually work with Spark's CBO
   ACBO could be turn on/off despite whether user turns Spark's CBO on or off. 
Though we can get better statistics from Spark's CBO and ACBO could leverage 
that when Spark CBO is also turned on.
   
   ### As a Spark columnar rule
   ACBO will only operate on Spark's physical plan. We can decide whether to 
enable it in logical plan optimization in future. But that would not be a goal 
for long while. A smaller code entrance for ACBO would lower the chance of 
adding maintenance burden, or facing problems caused by Gluten's legacy code 
issues.
   
   ## Goal
   
   The following are some goals of the initial version of ACBO:
   
   1. Deliver a general purpose CBO framework with global plan enumeration and 
search enabled;
   3. Selectively enable some rules in Gluten's rule list in ACBO; Provide user 
the option to turn the ACBO on/off.
   The option should be by default off.
   4. If possible, replace the main single-op validation rule (it's now 
"RegularTransformRule") and C2R / R2C adding rule with ACBO version.
   
   We don't have to expect real performance speed-up in the 1st move, since 
with no matter any CBO implementation, performance should be tightly bounded to 
cost model. We don't yet decide to deliver a solid cost model at the 1st 
version of ACBO so it will be expected that performance will not be very much 
focused. On the other hand, one of the most important purpose at this period is 
to integrate ACBO with current optimizer framework "successfully", and make the 
integration work as stably as possible. Writing ACBO rules will be 
comparatively easier than writing heuristic rules: developer just tells the 
optimizer one plan can be replaced by another, and doesn't have to guarantee 
the replacement provides any profit or not. Thus, removing the RBO rules and 
migrating it to ACBO will always simplify the code. Thus one of ACBO's bonus is 
to make Gluten's code more maintainable.
   
   ## Non-goal
   
   It's worth noting again that we don't plan to replace catalyst's query 
optimization by ACBO. ACBO would work as one or several individual rules in 
Spark's query execution procedure.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Enhancement: Gluten Advanced Cost-based Query Optimization (Advanced CBO) [incubator-gluten]

Reply via email to