[ 
https://issues.apache.org/jira/browse/SPARK-55353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huan Zheng updated SPARK-55353:
-------------------------------
    Description: 
PROBLEM:

Severe driver OOM regression from Spark 3.2.2 to 3.5.7 for complex SQL queries.

- Spark 3.2.2: Same query runs with 2GB driver memory
- Spark 3.5.7: Same query OOMs with 8GB driver memory
- Memory regression: 4x increase in driver memory requirements

Query characteristics:
- ~200 stages, ~1000 tasks per stage
- Complex feature engineering with multiple UDFs
- 12 LEFT JOINs with subqueries
- 100+ output columns

================================================================================

ROOT CAUSE:

2G Driver Heap dump analysis shows 15.6M AccumulatorMetadata objects consuming 
~1.5GB memory,
held by SQLAppStatusListener.

Memory calculation:
- Spark 3.2.2: 200 stages × 1000 tasks × 25 accumulators/task = 5M objects 
(~500MB)
- Spark 3.5.7: 155 stages × 1000 tasks × 100 accumulators/task = 15.5M objects 
(~1.5GB)

Accumulator count per task increased 4x due to new metrics added between 
versions:
- SPARK-36620 (3.3.0): Push-based shuffle metrics
- SPARK-40711 (3.4.0): Window spill metrics
- SPARK-43214 (3.5.0): Driver-side metrics

SQLAppStatusListener collects all these metrics in driver memory, causing OOM 
for
queries with many stages and tasks.

================================================================================

SOLUTION:

Add new static configuration: spark.sql.ui.appStatusListener.enabled Controls 
whether SQLAppStatusListener is loaded.

  was:
  We encountered a severe driver OOM issue when running complex SQL queries on 
Spark 3.5.7, which represents a significant memory regression compared to Spark 
3.2.2.

  ### Regression Details

  - *{*}Spark 3.2.2{*}{*}: Same SQL query runs successfully with **2GB driver 
memory{*}*
  - *{*}Spark 3.5.7{*}{*}: Same SQL query fails with OOM even with **8GB driver 
memory{*}*
  - *{*}Memory Regression{*}{*}: **4x increase{*}* in driver memory requirements

  ### Environment

  - Spark Version: 3.5.7 (with patches: SPARK-50229, SPARK-45439)
  - Query Characteristics:
    - 155 stages
    - ~1000 tasks per stage
    - Complex feature engineering with multiple UDFs
    - 11 LEFT JOINs
    - 100+ output columns

  ## Root Cause Analysis

  ### Heap Dump Analysis

  Driver heap dump analysis revealed:
  - *{*}15.6million{*}* `AccumulatorMetadata` objects
  - *{*}~1.5GB{*}* memory consumed by accumulator metadata alone
  - Primary memory holder: `SQLAppStatusListener`

  ### Memory Calculation

  *{*}Spark 3.2.2 baseline:{*}*
  - 200 stages × 1000 tasks × ~25 accumulators/task = ~5M accumulators
  - Memory usage: ~500MB (fits in 2GB driver)

  *{*}Spark 3.5.7 regression:{*}*
  - 200 stages × 1000 tasks × ~100 accumulators/task = 15.5M accumulators
  - Memory usage: ~1.5GB (causes OOM even with 8GB driver)

  *{*}Accumulator count per task increased by 4x{*}* from version 3.2 to 3.5.

  ### Contributing JIRAs

  The following JIRAs introduced new metrics between Spark 3.2 and 3.5, 
significantly increasing the accumulator count:

  1. *{*}SPARK-36620{*}* (Spark 3.3.0): Added push-based shuffle metrics
     - New metrics: corruptMergedBlockChunks, mergedFetchFallbackCount, 
remoteMergeBlocksFetched, etc.

  2. *{*}SPARK-40711{*}* (Spark 3.4.0): Added window spill metrics
     - New metrics: spillSize, memoryBytesSpilled, diskBytesSpilled for window 
operations

  3. *{*}SPARK-43214{*}* (Spark 3.5.0): Added driver-side metrics
     - New metrics: driverAccumulatorValue for various operations
     - Significantly increased accumulator count for complex queries

  These metrics are collected by `SQLAppStatusListener` and stored in driver 
memory, causing OOM for queries with many stages and tasks.

  ## Solution

  ### New Configuration

  Add a new static configuration to control `SQLAppStatusListener` loading:

  *{*}Config Name:{*}* `spark.sql.ui.appStatusListener.enabled`

  *{*}Default Value:{*}* `true` (backward compatible)

  *{*}Description:{*}* Whether to enable SQLAppStatusListener for collecting 
SQL execution metrics. Disabling this can significantly reduce driver memory 
usage for complex queries with many stages and tasks, but will result in 
limited SQL execution information in the UI.


>  Driver OOM with complex SQL queries: Add config to disable 
> SQLAppStatusListener
> --------------------------------------------------------------------------------
>
>                 Key: SPARK-55353
>                 URL: https://issues.apache.org/jira/browse/SPARK-55353
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL, UI
>    Affects Versions: 3.5.7
>            Reporter: Huan Zheng
>            Priority: Minor
>         Attachments: desensitized_sql_example.txt
>
>
> PROBLEM:
> Severe driver OOM regression from Spark 3.2.2 to 3.5.7 for complex SQL 
> queries.
> - Spark 3.2.2: Same query runs with 2GB driver memory
> - Spark 3.5.7: Same query OOMs with 8GB driver memory
> - Memory regression: 4x increase in driver memory requirements
> Query characteristics:
> - ~200 stages, ~1000 tasks per stage
> - Complex feature engineering with multiple UDFs
> - 12 LEFT JOINs with subqueries
> - 100+ output columns
> ================================================================================
> ROOT CAUSE:
> 2G Driver Heap dump analysis shows 15.6M AccumulatorMetadata objects 
> consuming ~1.5GB memory,
> held by SQLAppStatusListener.
> Memory calculation:
> - Spark 3.2.2: 200 stages × 1000 tasks × 25 accumulators/task = 5M objects 
> (~500MB)
> - Spark 3.5.7: 155 stages × 1000 tasks × 100 accumulators/task = 15.5M 
> objects (~1.5GB)
> Accumulator count per task increased 4x due to new metrics added between 
> versions:
> - SPARK-36620 (3.3.0): Push-based shuffle metrics
> - SPARK-40711 (3.4.0): Window spill metrics
> - SPARK-43214 (3.5.0): Driver-side metrics
> SQLAppStatusListener collects all these metrics in driver memory, causing OOM 
> for
> queries with many stages and tasks.
> ================================================================================
> SOLUTION:
> Add new static configuration: spark.sql.ui.appStatusListener.enabled Controls 
> whether SQLAppStatusListener is loaded.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to