[ 
https://issues.apache.org/jira/browse/SPARK-55353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huan Zheng updated SPARK-55353:
-------------------------------
    Attachment: desensitized_sql_example.txt

>  Driver OOM with complex SQL queries: Add config to disable 
> SQLAppStatusListener
> --------------------------------------------------------------------------------
>
>                 Key: SPARK-55353
>                 URL: https://issues.apache.org/jira/browse/SPARK-55353
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL, UI
>    Affects Versions: 3.5.7
>            Reporter: Huan Zheng
>            Priority: Minor
>         Attachments: desensitized_sql_example.txt
>
>
>   We encountered a severe driver OOM issue when running complex SQL queries 
> on Spark 3.5.7, which represents a significant memory regression compared to 
> Spark 3.2.2.
>   ### Regression Details
>   - *{*}Spark 3.2.2{*}{*}: Same SQL query runs successfully with **2GB driver 
> memory{*}*
>   - *{*}Spark 3.5.7{*}{*}: Same SQL query fails with OOM even with **8GB 
> driver memory{*}*
>   - *{*}Memory Regression{*}{*}: **4x increase{*}* in driver memory 
> requirements
>   ### Environment
>   - Spark Version: 3.5.7 (with patches: SPARK-50229, SPARK-45439)
>   - Query Characteristics:
>     - 155 stages
>     - ~1000 tasks per stage
>     - Complex feature engineering with multiple UDFs
>     - 11 LEFT JOINs
>     - 100+ output columns
>   ## Root Cause Analysis
>   ### Heap Dump Analysis
>   Driver heap dump analysis revealed:
>   - *{*}15.6million{*}* `AccumulatorMetadata` objects
>   - *{*}~1.5GB{*}* memory consumed by accumulator metadata alone
>   - Primary memory holder: `SQLAppStatusListener`
>   ### Memory Calculation
>   *{*}Spark 3.2.2 baseline:{*}*
>   - 200 stages × 1000 tasks × ~25 accumulators/task = ~5M accumulators
>   - Memory usage: ~500MB (fits in 2GB driver)
>   *{*}Spark 3.5.7 regression:{*}*
>   - 200 stages × 1000 tasks × ~100 accumulators/task = 15.5M accumulators
>   - Memory usage: ~1.5GB (causes OOM even with 8GB driver)
>   *{*}Accumulator count per task increased by 4x{*}* from version 3.2 to 3.5.
>   ### Contributing JIRAs
>   The following JIRAs introduced new metrics between Spark 3.2 and 3.5, 
> significantly increasing the accumulator count:
>   1. *{*}SPARK-36620{*}* (Spark 3.3.0): Added push-based shuffle metrics
>      - New metrics: corruptMergedBlockChunks, mergedFetchFallbackCount, 
> remoteMergeBlocksFetched, etc.
>   2. *{*}SPARK-40711{*}* (Spark 3.4.0): Added window spill metrics
>      - New metrics: spillSize, memoryBytesSpilled, diskBytesSpilled for 
> window operations
>   3. *{*}SPARK-43214{*}* (Spark 3.5.0): Added driver-side metrics
>      - New metrics: driverAccumulatorValue for various operations
>      - Significantly increased accumulator count for complex queries
>   These metrics are collected by `SQLAppStatusListener` and stored in driver 
> memory, causing OOM for queries with many stages and tasks.
>   ## Solution
>   ### New Configuration
>   Add a new static configuration to control `SQLAppStatusListener` loading:
>   *{*}Config Name:{*}* `spark.sql.ui.appStatusListener.enabled`
>   *{*}Default Value:{*}* `true` (backward compatible)
>   *{*}Description:{*}* Whether to enable SQLAppStatusListener for collecting 
> SQL execution metrics. Disabling this can significantly reduce driver memory 
> usage for complex queries with many stages and tasks, but will result in 
> limited SQL execution information in the UI.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to