[PR] [SPARK-50992][SQL] OOMs and performance issues with AQE in large plans [spark]

via GitHub Tue, 28 Jan 2025 23:10:31 -0800


SauronShepherd opened a new pull request, #49724:
URL: https://github.com/apache/spark/pull/49724


   ### What changes were proposed in this pull request?  
   This PR introduces a new explain mode `off` to disable the generation of 
physical plan strings. It also modifies the internal attribute `cachedName` of 
`CachedRDDBuilder` objects.  
   
   ### Why are the changes needed?  
   Whenever a plan changes (which happens frequently when AQE kicks in), the 
physical plan's explain is generated as a plain string. This process is highly 
expensive for large plans. Moreover, these strings are stored in the 
`ListenerBus` of `SparkContext`, consuming heap memory and potentially leading 
to OutOfMemory errors.  
   
   Due to its potential negative impact on Spark applications, this information 
should be available only on demand for debugging purposes. This PR introduces a 
new explain mode `off`, which is set as the default to prevent unnecessary 
string generation. However, explicit explanations of a DataFrame remain 
accessible even when this mode is active.
   
   Additionally, when a `CachedRDDBuilder` object is created without a defined 
`tableName`, the full string representation of the plan is also computed, only 
to later extract the first 1024 characters. This is an expensive operation and 
has been replaced with a more efficient call to `simpleStringWithNodeId` to 
avoid unnecessary computation.  
   
   **IMPORTANT NOTE:** This issue is causing an OutOfMemory (OOM) error in 
certain unit tests within GraphFrames, as reported in [Connected Components 
gives wrong results](https://github.com/graphframes/graphframes/issues/453). It 
may also be a contributing factor to the frequent overuse of checkpoints not 
only in GraphFrames, but also for many Spark users.
   
   ### Does this PR introduce _any_ user-facing change?  
   Yes. By default, plan descriptions will no longer be available in the Spark 
UI. If users require this information, they must explicitly enable it by 
setting the `spark.sql.ui.explainMode` Spark configuration.  
   
   ### How was this patch tested?  
   Unit tests from **sql/core** and **sql/catalyst**, along with the test 
attached to the 
[SPARK-50992](https://issues.apache.org/jira/browse/SPARK-50992) ticket.  
   
   ### Was this patch authored or co-authored using generative AI tooling?  
   No.  
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-50992][SQL] OOMs and performance issues with AQE in large plans [spark]

Reply via email to