[jira] [Updated] (SPARK-50992) OOMs and performance issues with AQE in large plans

Jira Sat, 25 Jan 2025 23:45:01 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-50992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ángel Álvarez Pascua updated SPARK-50992:
-----------------------------------------
    Description: 
When AQE is enabled, Spark triggers update events to the internal listener bus 
whenever a plan changes. These events include a plain-text description of the 
plan, which is computationally expensive to generate for large plans.

*Key Issues:*

*1. High Cost of Plan String Calculation:*
 * Generating the string description for large physical plans is a costly 
operation.
 * This impacts performance, particularly in complex workflows with frequent 
plan updates (e.g. persisting DataFrames).
 *  

*2. Out-of-Memory (OOM) Errors:*
 * Events are stored in the listener bus as {{SQLExecutionUIData}} objects and 
retained until a threshold is reached.
 * This retention behavior can lead to memory exhaustion when processing large 
plans, causing OOM errors.
 *  

*Current Workarounds Are Ineffective:*
 * *Reducing Retained Executions* ({{{}spark.sql.ui.retainedExecutions{}}}): 
Even when set to {{1}} or {{{}0{}}}, events are still created, requiring plan 
string calculations.
 * *Limiting Plan String Length* ({{{}spark.sql.maxPlanStringLength{}}}): 
Reducing the maximum string length (e.g., to {{{}1,000,000{}}}) may mitigate 
OOMs but does not eliminate the overhead of string generation.
 * *Available Explain Modes:* All existing explain modes are verbose and 
computationally expensive, failing to resolve these issues.
 *  

*Proposed Solution:*
Introduce a new explain mode, {*}{{off}}{*}, which suppresses the generation of 
plan string descriptions.
 * When this mode is enabled, Spark skips the calculation of plan descriptions 
altogether.
 * This resolves OOM errors and restores performance parity with non-AQE 
execution.

*Impact of Proposed Solution:*
 * Eliminates OOMs in large plans with AQE enabled.
 * Reduces the performance overhead associated with plan string generation.
 * Ensures Spark scales better in environments with large, complex plans.

*Reproducibility:*
The following test replicates the issue has been attached.

  was:
When AQE is enabled, Spark triggers update events to the internal listener bus 
whenever a plan changes. These events include a plain-text description of the 
plan, which is computationally expensive to generate for large plans.

*Key Issues:*
 # *High Cost of Plan String Calculation:*

 * 
 ** Generating the string description for large physical plans is a costly 
operation.
 ** This impacts performance, particularly in complex workflows with frequent 
plan updates (e.g. persisting DataFrames).

 # *Out-of-Memory (OOM) Errors:*

 * 
 ** Events are stored in the listener bus as {{SQLExecutionUIData}} objects and 
retained until a threshold is reached.
 ** This retention behavior can lead to memory exhaustion when processing large 
plans, causing OOM errors.

 # *Current Workarounds Are Ineffective:*

 * 
 ** *Reducing Retained Executions* ({{{}spark.sql.ui.retainedExecutions{}}}): 
Even when set to {{1}} or {{{}0{}}}, events are still created, requiring plan 
string calculations.
 ** *Limiting Plan String Length* ({{{}spark.sql.maxPlanStringLength{}}}): 
Reducing the maximum string length (e.g., to {{{}1,000,000{}}}) may mitigate 
OOMs but does not eliminate the overhead of string generation.
 ** *Available Explain Modes:* All existing explain modes are verbose and 
computationally expensive, failing to resolve these issues.

*Proposed Solution:*
Introduce a new explain mode, {*}{{off}}{*}, which suppresses the generation of 
plan string descriptions.
 * When this mode is enabled, Spark skips the calculation of plan descriptions 
altogether.
 * This resolves OOM errors and restores performance parity with non-AQE 
execution.

*Impact of Proposed Solution:*
 * Eliminates OOMs in large plans with AQE enabled.
 * Reduces the performance overhead associated with plan string generation.
 * Ensures Spark scales better in environments with large, complex plans.

*Reproducibility:*
The following test replicates the issue has been attached.


> OOMs and performance issues with AQE in large plans
> ---------------------------------------------------
>
>                 Key: SPARK-50992
>                 URL: https://issues.apache.org/jira/browse/SPARK-50992
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 4.0.0, 3.5.3, 3.5.4
>            Reporter: Ángel Álvarez Pascua
>            Priority: Major
>         Attachments: Main.scala
>
>
> When AQE is enabled, Spark triggers update events to the internal listener 
> bus whenever a plan changes. These events include a plain-text description of 
> the plan, which is computationally expensive to generate for large plans.
> *Key Issues:*
> *1. High Cost of Plan String Calculation:*
>  * Generating the string description for large physical plans is a costly 
> operation.
>  * This impacts performance, particularly in complex workflows with frequent 
> plan updates (e.g. persisting DataFrames).
>  *  
> *2. Out-of-Memory (OOM) Errors:*
>  * Events are stored in the listener bus as {{SQLExecutionUIData}} objects 
> and retained until a threshold is reached.
>  * This retention behavior can lead to memory exhaustion when processing 
> large plans, causing OOM errors.
>  *  
> *Current Workarounds Are Ineffective:*
>  * *Reducing Retained Executions* ({{{}spark.sql.ui.retainedExecutions{}}}): 
> Even when set to {{1}} or {{{}0{}}}, events are still created, requiring plan 
> string calculations.
>  * *Limiting Plan String Length* ({{{}spark.sql.maxPlanStringLength{}}}): 
> Reducing the maximum string length (e.g., to {{{}1,000,000{}}}) may mitigate 
> OOMs but does not eliminate the overhead of string generation.
>  * *Available Explain Modes:* All existing explain modes are verbose and 
> computationally expensive, failing to resolve these issues.
>  *  
> *Proposed Solution:*
> Introduce a new explain mode, {*}{{off}}{*}, which suppresses the generation 
> of plan string descriptions.
>  * When this mode is enabled, Spark skips the calculation of plan 
> descriptions altogether.
>  * This resolves OOM errors and restores performance parity with non-AQE 
> execution.
> *Impact of Proposed Solution:*
>  * Eliminates OOMs in large plans with AQE enabled.
>  * Reduces the performance overhead associated with plan string generation.
>  * Ensures Spark scales better in environments with large, complex plans.
> *Reproducibility:*
> The following test replicates the issue has been attached.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-50992) OOMs and performance issues with AQE in large plans

Reply via email to