[jira] [Commented] (SPARK-50992) OOMs and performance issues with AQE in large plans

Angel Alvarez Pascua (Jira) Fri, 02 Jan 2026 05:20:07 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-50992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18048733#comment-18048733
 ]


Angel Alvarez Pascua commented on SPARK-50992:
----------------------------------------------

In Spark 4.0, OutOfMemory errors in large query plans are prevented by 
introducing lazy initialization for {{TreeNode}} metadata such as {{tags}} and 
{{{}ineffectiveRules{}}}, allocating them only when accessed. Additionally, 
memory-leaking {{Stream}} collections were replaced with {{{}LazyList{}}}, and 
{{IdentityHashMap}} is used for child node lookups to avoid expensive recursive 
{{hashCode}} computations. These changes together significantly reduce the 
driver’s memory footprint, allowing Spark to handle complex plans that 
previously caused crashes in earlier versions.

However, in Spark 3.5.7 I'm still getting the same OOM.

> OOMs and performance issues with AQE in large plans
> ---------------------------------------------------
>
>                 Key: SPARK-50992
>                 URL: https://issues.apache.org/jira/browse/SPARK-50992
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.5.3, 3.5.4, 4.0.0
>            Reporter: Ángel Álvarez Pascua
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: Main.scala
>
>
> When AQE is enabled, Spark triggers update events to the internal listener 
> bus whenever a plan changes. These events include a plain-text description of 
> the plan, which is computationally expensive to generate for large plans.
> {*}Key Issues:{*}{*}{{*}}
> *1. High Cost of Plan String Calculation:*
>  * Generating the string description for large physical plans is a costly 
> operation.
>  * This impacts performance, particularly in complex workflows with frequent 
> plan updates (e.g. persisting DataFrames).
> *2. Out-of-Memory (OOM) Errors:*
>  * Events are stored in the listener bus as {{SQLExecutionUIData}} objects 
> and retained until a threshold is reached.
>  * This retention behavior can lead to memory exhaustion when processing 
> large plans, causing OOM errors.
>  
> *Current Workarounds Are Ineffective:*
>  * *Reducing Retained Executions* ({{{}spark.sql.ui.retainedExecutions{}}}): 
> Even when set to {{1}} or {{{}0{}}}, events are still created, requiring plan 
> string calculations.
>  * *Limiting Plan String Length* ({{{}spark.sql.maxPlanStringLength{}}}): 
> Reducing the maximum string length (e.g., to {{{}1,000,000{}}}) may mitigate 
> OOMs but does not eliminate the overhead of string generation.
>  * *Available Explain Modes:* All existing explain modes are verbose and 
> computationally expensive, failing to resolve these issues.
>  
> *Proposed Solution:*
> Introduce a new explain mode, {*}{{off}}{*}, which suppresses the generation 
> of plan string descriptions.
>  * When this mode is enabled, Spark skips the calculation of plan 
> descriptions altogether.
>  * This resolves OOM errors and restores performance parity with non-AQE 
> execution.
>  
> *Impact of Proposed Solution:*
>  * Eliminates OOMs in large plans with AQE enabled.
>  * Reduces the performance overhead associated with plan string generation.
>  * Ensures Spark scales better in environments with large, complex plans.
>  
> *Reproducibility:*
> A test reproducing the issue has been attached.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-50992) OOMs and performance issues with AQE in large plans

Reply via email to