[
https://issues.apache.org/jira/browse/SPARK-50992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ángel Álvarez Pascua updated SPARK-50992:
-----------------------------------------
Description:
When AQE is enabled, Spark triggers update events to the internal listener bus
whenever a plan changes. These events include a plain-text description of the
plan, which is computationally expensive to generate for large plans.
*Key Issues:*
*1. High Cost of Plan String Calculation:*
* Generating the string description for large physical plans is a costly
operation.
* This impacts performance, particularly in complex workflows with frequent
plan updates (e.g. persisting DataFrames).
*
*2. Out-of-Memory (OOM) Errors:*
* Events are stored in the listener bus as {{SQLExecutionUIData}} objects and
retained until a threshold is reached.
* This retention behavior can lead to memory exhaustion when processing large
plans, causing OOM errors.
*
*Current Workarounds Are Ineffective:*
* *Reducing Retained Executions* ({{{}spark.sql.ui.retainedExecutions{}}}):
Even when set to {{1}} or {{{}0{}}}, events are still created, requiring plan
string calculations.
* *Limiting Plan String Length* ({{{}spark.sql.maxPlanStringLength{}}}):
Reducing the maximum string length (e.g., to {{{}1,000,000{}}}) may mitigate
OOMs but does not eliminate the overhead of string generation.
* *Available Explain Modes:* All existing explain modes are verbose and
computationally expensive, failing to resolve these issues.
*
*Proposed Solution:*
Introduce a new explain mode, {*}{{off}}{*}, which suppresses the generation of
plan string descriptions.
* When this mode is enabled, Spark skips the calculation of plan descriptions
altogether.
* This resolves OOM errors and restores performance parity with non-AQE
execution.
*Impact of Proposed Solution:*
* Eliminates OOMs in large plans with AQE enabled.
* Reduces the performance overhead associated with plan string generation.
* Ensures Spark scales better in environments with large, complex plans.
*Reproducibility:*
The following test replicates the issue has been attached.
was:
When AQE is enabled, Spark triggers update events to the internal listener bus
whenever a plan changes. These events include a plain-text description of the
plan, which is computationally expensive to generate for large plans.
*Key Issues:*
# *High Cost of Plan String Calculation:*
*
** Generating the string description for large physical plans is a costly
operation.
** This impacts performance, particularly in complex workflows with frequent
plan updates (e.g. persisting DataFrames).
# *Out-of-Memory (OOM) Errors:*
*
** Events are stored in the listener bus as {{SQLExecutionUIData}} objects and
retained until a threshold is reached.
** This retention behavior can lead to memory exhaustion when processing large
plans, causing OOM errors.
# *Current Workarounds Are Ineffective:*
*
** *Reducing Retained Executions* ({{{}spark.sql.ui.retainedExecutions{}}}):
Even when set to {{1}} or {{{}0{}}}, events are still created, requiring plan
string calculations.
** *Limiting Plan String Length* ({{{}spark.sql.maxPlanStringLength{}}}):
Reducing the maximum string length (e.g., to {{{}1,000,000{}}}) may mitigate
OOMs but does not eliminate the overhead of string generation.
** *Available Explain Modes:* All existing explain modes are verbose and
computationally expensive, failing to resolve these issues.
*Proposed Solution:*
Introduce a new explain mode, {*}{{off}}{*}, which suppresses the generation of
plan string descriptions.
* When this mode is enabled, Spark skips the calculation of plan descriptions
altogether.
* This resolves OOM errors and restores performance parity with non-AQE
execution.
*Impact of Proposed Solution:*
* Eliminates OOMs in large plans with AQE enabled.
* Reduces the performance overhead associated with plan string generation.
* Ensures Spark scales better in environments with large, complex plans.
*Reproducibility:*
The following test replicates the issue has been attached.
> OOMs and performance issues with AQE in large plans
> ---------------------------------------------------
>
> Key: SPARK-50992
> URL: https://issues.apache.org/jira/browse/SPARK-50992
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 4.0.0, 3.5.3, 3.5.4
> Reporter: Ángel Álvarez Pascua
> Priority: Major
> Attachments: Main.scala
>
>
> When AQE is enabled, Spark triggers update events to the internal listener
> bus whenever a plan changes. These events include a plain-text description of
> the plan, which is computationally expensive to generate for large plans.
> *Key Issues:*
> *1. High Cost of Plan String Calculation:*
> * Generating the string description for large physical plans is a costly
> operation.
> * This impacts performance, particularly in complex workflows with frequent
> plan updates (e.g. persisting DataFrames).
> *
> *2. Out-of-Memory (OOM) Errors:*
> * Events are stored in the listener bus as {{SQLExecutionUIData}} objects
> and retained until a threshold is reached.
> * This retention behavior can lead to memory exhaustion when processing
> large plans, causing OOM errors.
> *
> *Current Workarounds Are Ineffective:*
> * *Reducing Retained Executions* ({{{}spark.sql.ui.retainedExecutions{}}}):
> Even when set to {{1}} or {{{}0{}}}, events are still created, requiring plan
> string calculations.
> * *Limiting Plan String Length* ({{{}spark.sql.maxPlanStringLength{}}}):
> Reducing the maximum string length (e.g., to {{{}1,000,000{}}}) may mitigate
> OOMs but does not eliminate the overhead of string generation.
> * *Available Explain Modes:* All existing explain modes are verbose and
> computationally expensive, failing to resolve these issues.
> *
> *Proposed Solution:*
> Introduce a new explain mode, {*}{{off}}{*}, which suppresses the generation
> of plan string descriptions.
> * When this mode is enabled, Spark skips the calculation of plan
> descriptions altogether.
> * This resolves OOM errors and restores performance parity with non-AQE
> execution.
> *Impact of Proposed Solution:*
> * Eliminates OOMs in large plans with AQE enabled.
> * Reduces the performance overhead associated with plan string generation.
> * Ensures Spark scales better in environments with large, complex plans.
> *Reproducibility:*
> The following test replicates the issue has been attached.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]