jackylee-ch commented on code in PR #8908:
URL: https://github.com/apache/incubator-gluten/pull/8908#discussion_r1982656108
##########
docs/get-started/VeloxStageResourceAdj.md:
##########
@@ -0,0 +1,79 @@
+---
+layout: page
+title: Stage-Level Resource Adjustment in Velox Backend
+nav_order: 3
+parent: Getting-Started
+---
+## Using Stage-Level Resource Adjustment to Avoid OOM(Experimental)
+---
+
+### **Overview**
+One major advantage of Apache Gluten is its ability to significantly reduce
memory requirements per executor—potentially by up to half—when entire stages
are offloaded to the native engine. This engine primarily relies on off-heap
memory with minimal on-heap usage. However, when stages contain fallback
operators that utilize the JVM engine, the on-heap memory size must be
increased, leading to even higher memory demands per executor. This challenge
has posed significant barriers during the adoption of Apache Gluten.
+
+To address this issue, Apache Gluten introduces a stage-level resource
auto-adjustment framework. This feature dynamically optimizes task and executor
resource profiles, such as heap and off-heap memory allocation, based on the
specific characteristics of each stage, including the presence of fallback
operators. Additionally, this framework is designed with future enhancements in
mind, allowing for adjustments to accommodate other requirements, such as heavy
shuffle workloads(to be supported in the future).
+
+---
+
+### **Prerequisites**
+1. **Enable Adaptive Query Execution (AQE)**:
+ ```properties
+ spark.sql.adaptive.enabled=true
+ ```
+2. **Enable Executor Dynamic Allocation**:
+ ```properties
+ spark.dynamicAllocation.enabled=true
+ ```
+3. **Resource Scheduler Compatibility**:
+ Ensure the underlying cluster resource manager (e.g., YARN, Kubernetes)
supports dynamic resource allocation.
+
+---
+
+### **Key Configurations**
+Add the following configurations to your Spark application:
+
+
+| Parameters |
Description
|
Default |
+|-------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------|---------|
+| spark.gluten.auto.adjustStageResource.enabled |
Experimental: If enabled, gluten will try to set the stage resource according
to stage execution plan. NOTE: Only works when aqe is enabled at the same time.
| false |
+| spark.gluten.auto.adjustStageResources.heap.ratio |
Experimental: Increase executor heap memory when match adjust stage resource
rule.
| 2.0d |
+| spark.gluten.auto.adjustStageResources.fallenNode.ratio.threshold |
Experimental: Increase executor heap memory when stage contains fallen node
count exceeds the total node count ratio.
| 0.5d |
+#### **1. Enable Auto-Adjustment**
+```properties
+spark.gluten.auto.AdjustStageResource.enabled=true
+```
+### **How It Works**
+The framework analyzes each stage during query planning and adjusts resource
profiles in following scenarios:
+
+#### **Scenario 1: Fallback Operators Exist**
+If a stage all operator fallback to vanilla Spark operator or fallback
operators (e.g., unsupported UDAFs) ratio exceed specified threshold, Gluten
will automic increases heap memory allocation to handle the extra load.
+---
Review Comment:
can we move these `---`?
##########
docs/Configuration.md:
##########
@@ -99,6 +99,9 @@ The following configurations are related to Velox settings.
| spark.gluten.sql.columnar.backend.velox.orc.scan.enabled |
Enable velox orc scan. If disabled, vanilla spark orc scan will be used.
| true
|
| spark.gluten.sql.complexType.scan.fallback.enabled | Force
fallback for complex type scan, including struct, map, array.
| true
|
| spark.gluten.velox.offHeapBroadcastBuildRelation.enabled |
Experimental: If enabled, broadcast build relation will use offheap memory.
Otherwise, broadcast build relation will use onheap memory, default value is
false | |
+| spark.gluten.auto.adjustStageResource.enabled |
Experimental: If enabled, gluten will try to set the stage resource according
to stage execution plan. NOTE: Only workes when aqe is enabled at the same
time. | false |
+| spark.gluten.auto.adjustStageResources.heap.ratio |
Experimental: Increase executor heap memory when match adjust stage resource
rule. |
2.0d |****
Review Comment:
`****`should be removed.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]