[
https://issues.apache.org/jira/browse/SPARK-55353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Huan Zheng updated SPARK-55353:
-------------------------------
Attachment: desensitized_sql_example.txt
> Driver OOM with complex SQL queries: Add config to disable
> SQLAppStatusListener
> --------------------------------------------------------------------------------
>
> Key: SPARK-55353
> URL: https://issues.apache.org/jira/browse/SPARK-55353
> Project: Spark
> Issue Type: Improvement
> Components: SQL, UI
> Affects Versions: 3.5.7
> Reporter: Huan Zheng
> Priority: Minor
> Attachments: desensitized_sql_example.txt
>
>
> We encountered a severe driver OOM issue when running complex SQL queries
> on Spark 3.5.7, which represents a significant memory regression compared to
> Spark 3.2.2.
> ### Regression Details
> - *{*}Spark 3.2.2{*}{*}: Same SQL query runs successfully with **2GB driver
> memory{*}*
> - *{*}Spark 3.5.7{*}{*}: Same SQL query fails with OOM even with **8GB
> driver memory{*}*
> - *{*}Memory Regression{*}{*}: **4x increase{*}* in driver memory
> requirements
> ### Environment
> - Spark Version: 3.5.7 (with patches: SPARK-50229, SPARK-45439)
> - Query Characteristics:
> - 155 stages
> - ~1000 tasks per stage
> - Complex feature engineering with multiple UDFs
> - 11 LEFT JOINs
> - 100+ output columns
> ## Root Cause Analysis
> ### Heap Dump Analysis
> Driver heap dump analysis revealed:
> - *{*}15.6million{*}* `AccumulatorMetadata` objects
> - *{*}~1.5GB{*}* memory consumed by accumulator metadata alone
> - Primary memory holder: `SQLAppStatusListener`
> ### Memory Calculation
> *{*}Spark 3.2.2 baseline:{*}*
> - 200 stages × 1000 tasks × ~25 accumulators/task = ~5M accumulators
> - Memory usage: ~500MB (fits in 2GB driver)
> *{*}Spark 3.5.7 regression:{*}*
> - 200 stages × 1000 tasks × ~100 accumulators/task = 15.5M accumulators
> - Memory usage: ~1.5GB (causes OOM even with 8GB driver)
> *{*}Accumulator count per task increased by 4x{*}* from version 3.2 to 3.5.
> ### Contributing JIRAs
> The following JIRAs introduced new metrics between Spark 3.2 and 3.5,
> significantly increasing the accumulator count:
> 1. *{*}SPARK-36620{*}* (Spark 3.3.0): Added push-based shuffle metrics
> - New metrics: corruptMergedBlockChunks, mergedFetchFallbackCount,
> remoteMergeBlocksFetched, etc.
> 2. *{*}SPARK-40711{*}* (Spark 3.4.0): Added window spill metrics
> - New metrics: spillSize, memoryBytesSpilled, diskBytesSpilled for
> window operations
> 3. *{*}SPARK-43214{*}* (Spark 3.5.0): Added driver-side metrics
> - New metrics: driverAccumulatorValue for various operations
> - Significantly increased accumulator count for complex queries
> These metrics are collected by `SQLAppStatusListener` and stored in driver
> memory, causing OOM for queries with many stages and tasks.
> ## Solution
> ### New Configuration
> Add a new static configuration to control `SQLAppStatusListener` loading:
> *{*}Config Name:{*}* `spark.sql.ui.appStatusListener.enabled`
> *{*}Default Value:{*}* `true` (backward compatible)
> *{*}Description:{*}* Whether to enable SQLAppStatusListener for collecting
> SQL execution metrics. Disabling this can significantly reduce driver memory
> usage for complex queries with many stages and tasks, but will result in
> limited SQL execution information in the UI.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]