[ 
https://issues.apache.org/jira/browse/SPARK-55353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huan Zheng updated SPARK-55353:
-------------------------------
    Description: 
  We encountered a severe driver OOM issue when running complex SQL queries on 
Spark 3.5.7, which represents a significant memory regression compared to Spark 
3.2.2.

  ### Regression Details

  - *{*}Spark 3.2.2{*}{*}: Same SQL query runs successfully with **2GB driver 
memory{*}*
  - *{*}Spark 3.5.7{*}{*}: Same SQL query fails with OOM even with **8GB driver 
memory{*}*
  - *{*}Memory Regression{*}{*}: **4x increase{*}* in driver memory requirements

  ### Environment

  - Spark Version: 3.5.7 (with patches: SPARK-50229, SPARK-45439)
  - Query Characteristics:
    - 155 stages
    - ~1000 tasks per stage
    - Complex feature engineering with multiple UDFs
    - 11 LEFT JOINs
    - 100+ output columns

  ## Root Cause Analysis

  ### Heap Dump Analysis

  Driver heap dump analysis revealed:
  - *{*}15.6million{*}* `AccumulatorMetadata` objects
  - *{*}~1.5GB{*}* memory consumed by accumulator metadata alone
  - Primary memory holder: `SQLAppStatusListener`

  ### Memory Calculation

  *{*}Spark 3.2.2 baseline:{*}*
  - 200 stages × 1000 tasks × ~25 accumulators/task = ~5M accumulators
  - Memory usage: ~500MB (fits in 2GB driver)

  *{*}Spark 3.5.7 regression:{*}*
  - 200 stages × 1000 tasks × ~100 accumulators/task = 15.5M accumulators
  - Memory usage: ~1.5GB (causes OOM even with 8GB driver)

  *{*}Accumulator count per task increased by 4x{*}* from version 3.2 to 3.5.

  ### Contributing JIRAs

  The following JIRAs introduced new metrics between Spark 3.2 and 3.5, 
significantly increasing the accumulator count:

  1. *{*}SPARK-36620{*}* (Spark 3.3.0): Added push-based shuffle metrics
     - New metrics: corruptMergedBlockChunks, mergedFetchFallbackCount, 
remoteMergeBlocksFetched, etc.

  2. *{*}SPARK-40711{*}* (Spark 3.4.0): Added window spill metrics
     - New metrics: spillSize, memoryBytesSpilled, diskBytesSpilled for window 
operations

  3. *{*}SPARK-43214{*}* (Spark 3.5.0): Added driver-side metrics
     - New metrics: driverAccumulatorValue for various operations
     - Significantly increased accumulator count for complex queries

  These metrics are collected by `SQLAppStatusListener` and stored in driver 
memory, causing OOM for queries with many stages and tasks.

  ## Solution

  ### New Configuration

  Add a new static configuration to control `SQLAppStatusListener` loading:

  *{*}Config Name:{*}* `spark.sql.ui.appStatusListener.enabled`

  *{*}Default Value:{*}* `true` (backward compatible)

  *{*}Description:{*}* Whether to enable SQLAppStatusListener for collecting 
SQL execution metrics. Disabling this can significantly reduce driver memory 
usage for complex queries with many stages and tasks, but will result in 
limited SQL execution information in the UI.

  was:
## Problem Description

  We encountered a severe driver OOM issue when running complex SQL queries on 
Spark 3.5.7, which represents a significant memory regression compared to Spark 
3.2.2.

  ### Regression Details

  - **Spark 3.2.2**: Same SQL query runs successfully with **2GB driver memory**
  - **Spark 3.5.7**: Same SQL query fails with OOM even with **8GB driver 
memory**
  - **Memory Regression**: **4x increase** in driver memory requirements

  ### Environment

  - Spark Version: 3.5.7 (with patches: SPARK-50229, SPARK-45439)
  - Query Characteristics:
    - 155 stages
    - ~1000 tasks per stage
    - Complex feature engineering with multiple UDFs
    - 11 LEFT JOINs
    - 100+ output columns

  JIRA Issue Body - Part 2: Root Cause Analysis

  ## Root Cause Analysis

  ### Heap Dump Analysis

  Driver heap dump analysis revealed:
  - **15.6million** `AccumulatorMetadata` objects
  - **~1.5GB** memory consumed by accumulator metadata alone
  - Primary memory holder: `SQLAppStatusListener`

  ### Memory Calculation

  **Spark 3.2.2 baseline:**
  - 200 stages × 1000 tasks × ~25 accumulators/task = ~5M accumulators
  - Memory usage: ~500MB (fits in 2GB driver)

  **Spark 3.5.7 regression:**
  - 155 stages × 1000 tasks × ~100 accumulators/task = 15.5M accumulators
  - Memory usage: ~1.5GB (causes OOM even with 8GB driver)

  **Accumulator count per task increased by 4x** from version 3.2 to 3.5.

  JIRA Issue Body - Part 3: Contributing JIRAs

  ### Contributing JIRAs

  The following JIRAs introduced new metrics between Spark 3.2 and 3.5, 
significantly increasing the accumulator count:

  1. **SPARK-36620** (Spark 3.3.0): Added push-based shuffle metrics
     - New metrics: corruptMergedBlockChunks, mergedFetchFallbackCount, 
remoteMergeBlocksFetched, etc.

  2. **SPARK-40711** (Spark 3.4.0): Added window spill metrics
     - New metrics: spillSize, memoryBytesSpilled, diskBytesSpilled for window 
operations

  3. **SPARK-43214** (Spark 3.5.0): Added driver-side metrics
     - New metrics: driverAccumulatorValue for various operations
     - Significantly increased accumulator count for complex queries

  These metrics are collected by `SQLAppStatusListener` and stored in driver 
memory, causing OOM for queries with many stages and tasks.

  JIRA Issue Body - Part 4: Solution

  ## Solution

  ### New Configuration

  Add a new static configuration to control `SQLAppStatusListener` loading:

  **Config Name:** `spark.sql.ui.appStatusListener.enabled`

  **Default Value:** `true` (backward compatible)

  **Description:** Whether to enable SQLAppStatusListener for collecting SQL 
execution metrics. Disabling this can significantly reduce driver memory usage 
for complex queries with many stages and tasks, but will result in limited SQL 
execution information in the UI.


>  Driver OOM with complex SQL queries: Add config to disable 
> SQLAppStatusListener
> --------------------------------------------------------------------------------
>
>                 Key: SPARK-55353
>                 URL: https://issues.apache.org/jira/browse/SPARK-55353
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL, UI
>    Affects Versions: 3.5.7
>            Reporter: Huan Zheng
>            Priority: Minor
>
>   We encountered a severe driver OOM issue when running complex SQL queries 
> on Spark 3.5.7, which represents a significant memory regression compared to 
> Spark 3.2.2.
>   ### Regression Details
>   - *{*}Spark 3.2.2{*}{*}: Same SQL query runs successfully with **2GB driver 
> memory{*}*
>   - *{*}Spark 3.5.7{*}{*}: Same SQL query fails with OOM even with **8GB 
> driver memory{*}*
>   - *{*}Memory Regression{*}{*}: **4x increase{*}* in driver memory 
> requirements
>   ### Environment
>   - Spark Version: 3.5.7 (with patches: SPARK-50229, SPARK-45439)
>   - Query Characteristics:
>     - 155 stages
>     - ~1000 tasks per stage
>     - Complex feature engineering with multiple UDFs
>     - 11 LEFT JOINs
>     - 100+ output columns
>   ## Root Cause Analysis
>   ### Heap Dump Analysis
>   Driver heap dump analysis revealed:
>   - *{*}15.6million{*}* `AccumulatorMetadata` objects
>   - *{*}~1.5GB{*}* memory consumed by accumulator metadata alone
>   - Primary memory holder: `SQLAppStatusListener`
>   ### Memory Calculation
>   *{*}Spark 3.2.2 baseline:{*}*
>   - 200 stages × 1000 tasks × ~25 accumulators/task = ~5M accumulators
>   - Memory usage: ~500MB (fits in 2GB driver)
>   *{*}Spark 3.5.7 regression:{*}*
>   - 200 stages × 1000 tasks × ~100 accumulators/task = 15.5M accumulators
>   - Memory usage: ~1.5GB (causes OOM even with 8GB driver)
>   *{*}Accumulator count per task increased by 4x{*}* from version 3.2 to 3.5.
>   ### Contributing JIRAs
>   The following JIRAs introduced new metrics between Spark 3.2 and 3.5, 
> significantly increasing the accumulator count:
>   1. *{*}SPARK-36620{*}* (Spark 3.3.0): Added push-based shuffle metrics
>      - New metrics: corruptMergedBlockChunks, mergedFetchFallbackCount, 
> remoteMergeBlocksFetched, etc.
>   2. *{*}SPARK-40711{*}* (Spark 3.4.0): Added window spill metrics
>      - New metrics: spillSize, memoryBytesSpilled, diskBytesSpilled for 
> window operations
>   3. *{*}SPARK-43214{*}* (Spark 3.5.0): Added driver-side metrics
>      - New metrics: driverAccumulatorValue for various operations
>      - Significantly increased accumulator count for complex queries
>   These metrics are collected by `SQLAppStatusListener` and stored in driver 
> memory, causing OOM for queries with many stages and tasks.
>   ## Solution
>   ### New Configuration
>   Add a new static configuration to control `SQLAppStatusListener` loading:
>   *{*}Config Name:{*}* `spark.sql.ui.appStatusListener.enabled`
>   *{*}Default Value:{*}* `true` (backward compatible)
>   *{*}Description:{*}* Whether to enable SQLAppStatusListener for collecting 
> SQL execution metrics. Disabling this can significantly reduce driver memory 
> usage for complex queries with many stages and tasks, but will result in 
> limited SQL execution information in the UI.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to