Huan Zheng created SPARK-55353:
----------------------------------

             Summary:  Driver OOM with complex SQL queries: Add config to 
disable SQLAppStatusListener
                 Key: SPARK-55353
                 URL: https://issues.apache.org/jira/browse/SPARK-55353
             Project: Spark
          Issue Type: Improvement
          Components: SQL, UI
    Affects Versions: 3.5.7
            Reporter: Huan Zheng


## Problem Description

  We encountered a severe driver OOM issue when running complex SQL queries on 
Spark 3.5.7, which represents a significant memory regression compared to Spark 
3.2.2.

  ### Regression Details

  - **Spark 3.2.2**: Same SQL query runs successfully with **2GB driver memory**
  - **Spark 3.5.7**: Same SQL query fails with OOM even with **8GB driver 
memory**
  - **Memory Regression**: **4x increase** in driver memory requirements

  ### Environment

  - Spark Version: 3.5.7 (with patches: SPARK-50229, SPARK-45439)
  - Query Characteristics:
    - 155 stages
    - ~1000 tasks per stage
    - Complex feature engineering with multiple UDFs
    - 11 LEFT JOINs
    - 100+ output columns

  JIRA Issue Body - Part 2: Root Cause Analysis

  ## Root Cause Analysis

  ### Heap Dump Analysis

  Driver heap dump analysis revealed:
  - **15.6million** `AccumulatorMetadata` objects
  - **~1.5GB** memory consumed by accumulator metadata alone
  - Primary memory holder: `SQLAppStatusListener`

  ### Memory Calculation

  **Spark 3.2.2 baseline:**
  - 200 stages × 1000 tasks × ~25 accumulators/task = ~5M accumulators
  - Memory usage: ~500MB (fits in 2GB driver)

  **Spark 3.5.7 regression:**
  - 155 stages × 1000 tasks × ~100 accumulators/task = 15.5M accumulators
  - Memory usage: ~1.5GB (causes OOM even with 8GB driver)

  **Accumulator count per task increased by 4x** from version 3.2 to 3.5.

  JIRA Issue Body - Part 3: Contributing JIRAs

  ### Contributing JIRAs

  The following JIRAs introduced new metrics between Spark 3.2 and 3.5, 
significantly increasing the accumulator count:

  1. **SPARK-36620** (Spark 3.3.0): Added push-based shuffle metrics
     - New metrics: corruptMergedBlockChunks, mergedFetchFallbackCount, 
remoteMergeBlocksFetched, etc.

  2. **SPARK-40711** (Spark 3.4.0): Added window spill metrics
     - New metrics: spillSize, memoryBytesSpilled, diskBytesSpilled for window 
operations

  3. **SPARK-43214** (Spark 3.5.0): Added driver-side metrics
     - New metrics: driverAccumulatorValue for various operations
     - Significantly increased accumulator count for complex queries

  These metrics are collected by `SQLAppStatusListener` and stored in driver 
memory, causing OOM for queries with many stages and tasks.

  JIRA Issue Body - Part 4: Solution

  ## Solution

  ### New Configuration

  Add a new static configuration to control `SQLAppStatusListener` loading:

  **Config Name:** `spark.sql.ui.appStatusListener.enabled`

  **Default Value:** `true` (backward compatible)

  **Description:** Whether to enable SQLAppStatusListener for collecting SQL 
execution metrics. Disabling this can significantly reduce driver memory usage 
for complex queries with many stages and tasks, but will result in limited SQL 
execution information in the UI.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to