[
https://issues.apache.org/jira/browse/SPARK-55353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Huan Zheng updated SPARK-55353:
-------------------------------
Description:
PROBLEM:
Severe driver OOM regression from Spark 3.2.2 to 3.5.7 for complex SQL queries.
- Spark 3.2.2: Same query runs with 2GB driver memory
- Spark 3.5.7: Same query OOMs with 8GB driver memory
- Memory regression: 4x increase in driver memory requirements
Query characteristics:
- ~200 stages, ~1000 tasks per stage
- Complex feature engineering with multiple UDFs
- 12 LEFT JOINs with subqueries
- 100+ output columns
================================================================================
ROOT CAUSE:
2G Driver Heap dump analysis shows 15.6M AccumulatorMetadata objects consuming
~1.5GB memory,
held by SQLAppStatusListener.
Memory calculation:
- Spark 3.2.2: 200 stages × 1000 tasks × 25 accumulators/task = 5M objects
(~500MB)
- Spark 3.5.7: 155 stages × 1000 tasks × 100 accumulators/task = 15.5M objects
(~1.5GB)
Accumulator count per task increased 4x due to new metrics added between
versions:
- SPARK-36620 (3.3.0): Push-based shuffle metrics
- SPARK-40711 (3.4.0): Window spill metrics
- SPARK-43214 (3.5.0): Driver-side metrics
SQLAppStatusListener collects all these metrics in driver memory, causing OOM
for
queries with many stages and tasks.
================================================================================
SOLUTION:
Add new static configuration: spark.sql.ui.appStatusListener.enabled Controls
whether SQLAppStatusListener is loaded.
was:
We encountered a severe driver OOM issue when running complex SQL queries on
Spark 3.5.7, which represents a significant memory regression compared to Spark
3.2.2.
### Regression Details
- *{*}Spark 3.2.2{*}{*}: Same SQL query runs successfully with **2GB driver
memory{*}*
- *{*}Spark 3.5.7{*}{*}: Same SQL query fails with OOM even with **8GB driver
memory{*}*
- *{*}Memory Regression{*}{*}: **4x increase{*}* in driver memory requirements
### Environment
- Spark Version: 3.5.7 (with patches: SPARK-50229, SPARK-45439)
- Query Characteristics:
- 155 stages
- ~1000 tasks per stage
- Complex feature engineering with multiple UDFs
- 11 LEFT JOINs
- 100+ output columns
## Root Cause Analysis
### Heap Dump Analysis
Driver heap dump analysis revealed:
- *{*}15.6million{*}* `AccumulatorMetadata` objects
- *{*}~1.5GB{*}* memory consumed by accumulator metadata alone
- Primary memory holder: `SQLAppStatusListener`
### Memory Calculation
*{*}Spark 3.2.2 baseline:{*}*
- 200 stages × 1000 tasks × ~25 accumulators/task = ~5M accumulators
- Memory usage: ~500MB (fits in 2GB driver)
*{*}Spark 3.5.7 regression:{*}*
- 200 stages × 1000 tasks × ~100 accumulators/task = 15.5M accumulators
- Memory usage: ~1.5GB (causes OOM even with 8GB driver)
*{*}Accumulator count per task increased by 4x{*}* from version 3.2 to 3.5.
### Contributing JIRAs
The following JIRAs introduced new metrics between Spark 3.2 and 3.5,
significantly increasing the accumulator count:
1. *{*}SPARK-36620{*}* (Spark 3.3.0): Added push-based shuffle metrics
- New metrics: corruptMergedBlockChunks, mergedFetchFallbackCount,
remoteMergeBlocksFetched, etc.
2. *{*}SPARK-40711{*}* (Spark 3.4.0): Added window spill metrics
- New metrics: spillSize, memoryBytesSpilled, diskBytesSpilled for window
operations
3. *{*}SPARK-43214{*}* (Spark 3.5.0): Added driver-side metrics
- New metrics: driverAccumulatorValue for various operations
- Significantly increased accumulator count for complex queries
These metrics are collected by `SQLAppStatusListener` and stored in driver
memory, causing OOM for queries with many stages and tasks.
## Solution
### New Configuration
Add a new static configuration to control `SQLAppStatusListener` loading:
*{*}Config Name:{*}* `spark.sql.ui.appStatusListener.enabled`
*{*}Default Value:{*}* `true` (backward compatible)
*{*}Description:{*}* Whether to enable SQLAppStatusListener for collecting
SQL execution metrics. Disabling this can significantly reduce driver memory
usage for complex queries with many stages and tasks, but will result in
limited SQL execution information in the UI.
> Driver OOM with complex SQL queries: Add config to disable
> SQLAppStatusListener
> --------------------------------------------------------------------------------
>
> Key: SPARK-55353
> URL: https://issues.apache.org/jira/browse/SPARK-55353
> Project: Spark
> Issue Type: Improvement
> Components: SQL, UI
> Affects Versions: 3.5.7
> Reporter: Huan Zheng
> Priority: Minor
> Attachments: desensitized_sql_example.txt
>
>
> PROBLEM:
> Severe driver OOM regression from Spark 3.2.2 to 3.5.7 for complex SQL
> queries.
> - Spark 3.2.2: Same query runs with 2GB driver memory
> - Spark 3.5.7: Same query OOMs with 8GB driver memory
> - Memory regression: 4x increase in driver memory requirements
> Query characteristics:
> - ~200 stages, ~1000 tasks per stage
> - Complex feature engineering with multiple UDFs
> - 12 LEFT JOINs with subqueries
> - 100+ output columns
> ================================================================================
> ROOT CAUSE:
> 2G Driver Heap dump analysis shows 15.6M AccumulatorMetadata objects
> consuming ~1.5GB memory,
> held by SQLAppStatusListener.
> Memory calculation:
> - Spark 3.2.2: 200 stages × 1000 tasks × 25 accumulators/task = 5M objects
> (~500MB)
> - Spark 3.5.7: 155 stages × 1000 tasks × 100 accumulators/task = 15.5M
> objects (~1.5GB)
> Accumulator count per task increased 4x due to new metrics added between
> versions:
> - SPARK-36620 (3.3.0): Push-based shuffle metrics
> - SPARK-40711 (3.4.0): Window spill metrics
> - SPARK-43214 (3.5.0): Driver-side metrics
> SQLAppStatusListener collects all these metrics in driver memory, causing OOM
> for
> queries with many stages and tasks.
> ================================================================================
> SOLUTION:
> Add new static configuration: spark.sql.ui.appStatusListener.enabled Controls
> whether SQLAppStatusListener is loaded.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]