[
https://issues.apache.org/jira/browse/SPARK-55353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Huan Zheng updated SPARK-55353:
-------------------------------
Description:
We encountered a severe driver OOM issue when running complex SQL queries on
Spark 3.5.7, which represents a significant memory regression compared to Spark
3.2.2.
### Regression Details
- *{*}Spark 3.2.2{*}{*}: Same SQL query runs successfully with **2GB driver
memory{*}*
- *{*}Spark 3.5.7{*}{*}: Same SQL query fails with OOM even with **8GB driver
memory{*}*
- *{*}Memory Regression{*}{*}: **4x increase{*}* in driver memory requirements
### Environment
- Spark Version: 3.5.7 (with patches: SPARK-50229, SPARK-45439)
- Query Characteristics:
- 155 stages
- ~1000 tasks per stage
- Complex feature engineering with multiple UDFs
- 11 LEFT JOINs
- 100+ output columns
## Root Cause Analysis
### Heap Dump Analysis
Driver heap dump analysis revealed:
- *{*}15.6million{*}* `AccumulatorMetadata` objects
- *{*}~1.5GB{*}* memory consumed by accumulator metadata alone
- Primary memory holder: `SQLAppStatusListener`
### Memory Calculation
*{*}Spark 3.2.2 baseline:{*}*
- 200 stages × 1000 tasks × ~25 accumulators/task = ~5M accumulators
- Memory usage: ~500MB (fits in 2GB driver)
*{*}Spark 3.5.7 regression:{*}*
- 200 stages × 1000 tasks × ~100 accumulators/task = 15.5M accumulators
- Memory usage: ~1.5GB (causes OOM even with 8GB driver)
*{*}Accumulator count per task increased by 4x{*}* from version 3.2 to 3.5.
### Contributing JIRAs
The following JIRAs introduced new metrics between Spark 3.2 and 3.5,
significantly increasing the accumulator count:
1. *{*}SPARK-36620{*}* (Spark 3.3.0): Added push-based shuffle metrics
- New metrics: corruptMergedBlockChunks, mergedFetchFallbackCount,
remoteMergeBlocksFetched, etc.
2. *{*}SPARK-40711{*}* (Spark 3.4.0): Added window spill metrics
- New metrics: spillSize, memoryBytesSpilled, diskBytesSpilled for window
operations
3. *{*}SPARK-43214{*}* (Spark 3.5.0): Added driver-side metrics
- New metrics: driverAccumulatorValue for various operations
- Significantly increased accumulator count for complex queries
These metrics are collected by `SQLAppStatusListener` and stored in driver
memory, causing OOM for queries with many stages and tasks.
## Solution
### New Configuration
Add a new static configuration to control `SQLAppStatusListener` loading:
*{*}Config Name:{*}* `spark.sql.ui.appStatusListener.enabled`
*{*}Default Value:{*}* `true` (backward compatible)
*{*}Description:{*}* Whether to enable SQLAppStatusListener for collecting
SQL execution metrics. Disabling this can significantly reduce driver memory
usage for complex queries with many stages and tasks, but will result in
limited SQL execution information in the UI.
was:
## Problem Description
We encountered a severe driver OOM issue when running complex SQL queries on
Spark 3.5.7, which represents a significant memory regression compared to Spark
3.2.2.
### Regression Details
- **Spark 3.2.2**: Same SQL query runs successfully with **2GB driver memory**
- **Spark 3.5.7**: Same SQL query fails with OOM even with **8GB driver
memory**
- **Memory Regression**: **4x increase** in driver memory requirements
### Environment
- Spark Version: 3.5.7 (with patches: SPARK-50229, SPARK-45439)
- Query Characteristics:
- 155 stages
- ~1000 tasks per stage
- Complex feature engineering with multiple UDFs
- 11 LEFT JOINs
- 100+ output columns
JIRA Issue Body - Part 2: Root Cause Analysis
## Root Cause Analysis
### Heap Dump Analysis
Driver heap dump analysis revealed:
- **15.6million** `AccumulatorMetadata` objects
- **~1.5GB** memory consumed by accumulator metadata alone
- Primary memory holder: `SQLAppStatusListener`
### Memory Calculation
**Spark 3.2.2 baseline:**
- 200 stages × 1000 tasks × ~25 accumulators/task = ~5M accumulators
- Memory usage: ~500MB (fits in 2GB driver)
**Spark 3.5.7 regression:**
- 155 stages × 1000 tasks × ~100 accumulators/task = 15.5M accumulators
- Memory usage: ~1.5GB (causes OOM even with 8GB driver)
**Accumulator count per task increased by 4x** from version 3.2 to 3.5.
JIRA Issue Body - Part 3: Contributing JIRAs
### Contributing JIRAs
The following JIRAs introduced new metrics between Spark 3.2 and 3.5,
significantly increasing the accumulator count:
1. **SPARK-36620** (Spark 3.3.0): Added push-based shuffle metrics
- New metrics: corruptMergedBlockChunks, mergedFetchFallbackCount,
remoteMergeBlocksFetched, etc.
2. **SPARK-40711** (Spark 3.4.0): Added window spill metrics
- New metrics: spillSize, memoryBytesSpilled, diskBytesSpilled for window
operations
3. **SPARK-43214** (Spark 3.5.0): Added driver-side metrics
- New metrics: driverAccumulatorValue for various operations
- Significantly increased accumulator count for complex queries
These metrics are collected by `SQLAppStatusListener` and stored in driver
memory, causing OOM for queries with many stages and tasks.
JIRA Issue Body - Part 4: Solution
## Solution
### New Configuration
Add a new static configuration to control `SQLAppStatusListener` loading:
**Config Name:** `spark.sql.ui.appStatusListener.enabled`
**Default Value:** `true` (backward compatible)
**Description:** Whether to enable SQLAppStatusListener for collecting SQL
execution metrics. Disabling this can significantly reduce driver memory usage
for complex queries with many stages and tasks, but will result in limited SQL
execution information in the UI.
> Driver OOM with complex SQL queries: Add config to disable
> SQLAppStatusListener
> --------------------------------------------------------------------------------
>
> Key: SPARK-55353
> URL: https://issues.apache.org/jira/browse/SPARK-55353
> Project: Spark
> Issue Type: Improvement
> Components: SQL, UI
> Affects Versions: 3.5.7
> Reporter: Huan Zheng
> Priority: Minor
>
> We encountered a severe driver OOM issue when running complex SQL queries
> on Spark 3.5.7, which represents a significant memory regression compared to
> Spark 3.2.2.
> ### Regression Details
> - *{*}Spark 3.2.2{*}{*}: Same SQL query runs successfully with **2GB driver
> memory{*}*
> - *{*}Spark 3.5.7{*}{*}: Same SQL query fails with OOM even with **8GB
> driver memory{*}*
> - *{*}Memory Regression{*}{*}: **4x increase{*}* in driver memory
> requirements
> ### Environment
> - Spark Version: 3.5.7 (with patches: SPARK-50229, SPARK-45439)
> - Query Characteristics:
> - 155 stages
> - ~1000 tasks per stage
> - Complex feature engineering with multiple UDFs
> - 11 LEFT JOINs
> - 100+ output columns
> ## Root Cause Analysis
> ### Heap Dump Analysis
> Driver heap dump analysis revealed:
> - *{*}15.6million{*}* `AccumulatorMetadata` objects
> - *{*}~1.5GB{*}* memory consumed by accumulator metadata alone
> - Primary memory holder: `SQLAppStatusListener`
> ### Memory Calculation
> *{*}Spark 3.2.2 baseline:{*}*
> - 200 stages × 1000 tasks × ~25 accumulators/task = ~5M accumulators
> - Memory usage: ~500MB (fits in 2GB driver)
> *{*}Spark 3.5.7 regression:{*}*
> - 200 stages × 1000 tasks × ~100 accumulators/task = 15.5M accumulators
> - Memory usage: ~1.5GB (causes OOM even with 8GB driver)
> *{*}Accumulator count per task increased by 4x{*}* from version 3.2 to 3.5.
> ### Contributing JIRAs
> The following JIRAs introduced new metrics between Spark 3.2 and 3.5,
> significantly increasing the accumulator count:
> 1. *{*}SPARK-36620{*}* (Spark 3.3.0): Added push-based shuffle metrics
> - New metrics: corruptMergedBlockChunks, mergedFetchFallbackCount,
> remoteMergeBlocksFetched, etc.
> 2. *{*}SPARK-40711{*}* (Spark 3.4.0): Added window spill metrics
> - New metrics: spillSize, memoryBytesSpilled, diskBytesSpilled for
> window operations
> 3. *{*}SPARK-43214{*}* (Spark 3.5.0): Added driver-side metrics
> - New metrics: driverAccumulatorValue for various operations
> - Significantly increased accumulator count for complex queries
> These metrics are collected by `SQLAppStatusListener` and stored in driver
> memory, causing OOM for queries with many stages and tasks.
> ## Solution
> ### New Configuration
> Add a new static configuration to control `SQLAppStatusListener` loading:
> *{*}Config Name:{*}* `spark.sql.ui.appStatusListener.enabled`
> *{*}Default Value:{*}* `true` (backward compatible)
> *{*}Description:{*}* Whether to enable SQLAppStatusListener for collecting
> SQL execution metrics. Disabling this can significantly reduce driver memory
> usage for complex queries with many stages and tasks, but will result in
> limited SQL execution information in the UI.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]