Re: [PR] [DOC] Add document introducting StageLevel Resource Profile Adjust [incubator-gluten]

via GitHub Wed, 05 Mar 2025 20:13:26 -0800


jackylee-ch commented on code in PR #8908:
URL: https://github.com/apache/incubator-gluten/pull/8908#discussion_r1982656108



##########
docs/get-started/VeloxStageResourceAdj.md:
##########
@@ -0,0 +1,79 @@
+---
+layout: page
+title: Stage-Level Resource Adjustment in Velox Backend
+nav_order: 3
+parent: Getting-Started
+---
+## Using Stage-Level Resource Adjustment to Avoid OOM(Experimental)
+---
+
+### **Overview**
+One major advantage of Apache Gluten is its ability to significantly reduce 
memory requirements per executor—potentially by up to half—when entire stages 
are offloaded to the native engine. This engine primarily relies on off-heap 
memory with minimal on-heap usage. However, when stages contain fallback 
operators that utilize the JVM engine, the on-heap memory size must be 
increased, leading to even higher memory demands per executor. This challenge 
has posed significant barriers during the adoption of Apache Gluten.
+
+To address this issue, Apache Gluten introduces a stage-level resource 
auto-adjustment framework. This feature dynamically optimizes task and executor 
resource profiles, such as heap and off-heap memory allocation, based on the 
specific characteristics of each stage, including the presence of fallback 
operators. Additionally, this framework is designed with future enhancements in 
mind, allowing for adjustments to accommodate other requirements, such as heavy 
shuffle workloads(to be supported in the future).
+
+---
+
+### **Prerequisites**
+1. **Enable Adaptive Query Execution (AQE)**:
+   ```properties  
+   spark.sql.adaptive.enabled=true  
+   ```  
+2. **Enable Executor Dynamic Allocation**:
+   ```properties  
+   spark.dynamicAllocation.enabled=true  
+   ```  
+3. **Resource Scheduler Compatibility**:  
+   Ensure the underlying cluster resource manager (e.g., YARN, Kubernetes) 
supports dynamic resource allocation.
+
+---
+
+### **Key Configurations**
+Add the following configurations to your Spark application:
+
+
+| Parameters                                                        | 
Description                                                                     
                                                                              | 
Default |
+|-------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------|---------|
+| spark.gluten.auto.adjustStageResource.enabled                     | 
Experimental: If enabled, gluten will try to set the stage resource according 
to stage execution plan. NOTE: Only works when aqe is enabled at the same time. 
| false   |
+| spark.gluten.auto.adjustStageResources.heap.ratio                 | 
Experimental: Increase executor heap memory when match adjust stage resource 
rule.                                                                           
 | 2.0d    |
+| spark.gluten.auto.adjustStageResources.fallenNode.ratio.threshold | 
Experimental: Increase executor heap memory when stage contains fallen node 
count exceeds the total node count ratio.                                       
  | 0.5d    |
+#### **1. Enable Auto-Adjustment**
+```properties  
+spark.gluten.auto.AdjustStageResource.enabled=true  
+```
+### **How It Works**
+The framework analyzes each stage during query planning and adjusts resource 
profiles in following scenarios:
+
+#### **Scenario 1: Fallback Operators Exist**
+If a stage all operator fallback to vanilla Spark operator or  fallback 
operators (e.g., unsupported UDAFs) ratio exceed specified threshold, Gluten 
will automic increases heap memory allocation to handle the extra load.
+---

Review Comment:
   can we move these `---`?



##########
docs/Configuration.md:
##########
@@ -99,6 +99,9 @@ The following configurations are related to Velox settings.
 | spark.gluten.sql.columnar.backend.velox.orc.scan.enabled             | 
Enable velox orc scan. If disabled, vanilla spark orc scan will be used.        
                                                                   | true       
       |
 | spark.gluten.sql.complexType.scan.fallback.enabled                   | Force 
fallback for complex type scan, including struct, map, array.                   
                                                             | true             
 |
 | spark.gluten.velox.offHeapBroadcastBuildRelation.enabled             | 
Experimental: If enabled, broadcast build relation will use offheap memory. 
Otherwise, broadcast build relation will use onheap memory, default value is 
false |                   |
+| spark.gluten.auto.adjustStageResource.enabled                        | 
Experimental: If enabled, gluten will try to set the stage resource according 
to stage execution plan. NOTE: Only workes when aqe is enabled at the same 
time. | false   |
+| spark.gluten.auto.adjustStageResources.heap.ratio                    | 
Experimental: Increase executor heap memory when match adjust stage resource 
rule.                                                                        | 
2.0d    |****

Review Comment:
   `****`should be removed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [DOC] Add document introducting StageLevel Resource Profile Adjust [incubator-gluten]

Reply via email to