[PR] [SPARK-46710][SQL] Clean up the broadcast data generated when sql execution ends [spark]

via GitHub Mon, 29 Jan 2024 23:26:37 -0800


yabola opened a new pull request, #44949:
URL: https://github.com/apache/spark/pull/44949


   ### What changes were proposed in this pull request?
   Add config `spark.broadcast.cleanAfterExecution.enabled`(default false) . 
Clean up the broadcast data generated when sql execution ends ( only suitable 
for long running Spark SQL services ). 
   Before this PR: broadcast data cleaning can only rely on when GC is 
triggered, which may lead to a lot of waste of memory usage , and may also 
cause query instability if a single GC takes too long.
   After this PR: after the execution of sql is completed, the broadcast data 
generated during the execution of the sql will be cleared.
   
   Note: this parameter is only suitable for long running Spark SQL services. 
If this parameter is turned on and one dataframe is collected twice, the 
broadcast data will not be found during the second execution (because it has 
been cleaned).
   
   ### Why are the changes needed?
    Reduce memory load on driver and executor. This can make a long running 
spark service more stable. 
   
   
   ### Does this PR introduce _any_ user-facing change?
   Add config `spark.broadcast.cleanAfterExecution.enabled`.
   Default `false`,  If `true`, the broadcast data generated by the sql will be 
destroyed when it is completed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-46710][SQL] Clean up the broadcast data generated when sql execution ends [spark]

Reply via email to