yabola opened a new pull request, #44949:
URL: https://github.com/apache/spark/pull/44949
### What changes were proposed in this pull request?
Add config `spark.broadcast.cleanAfterExecution.enabled`(default false) .
Clean up the broadcast data generated when sql execution ends ( only suitable
for long running Spark SQL services ).
Before this PR: broadcast data cleaning can only rely on when GC is
triggered, which may lead to a lot of waste of memory usage , and may also
cause query instability if a single GC takes too long.
After this PR: after the execution of sql is completed, the broadcast data
generated during the execution of the sql will be cleared.
Note: this parameter is only suitable for long running Spark SQL services.
If this parameter is turned on and one dataframe is collected twice, the
broadcast data will not be found during the second execution (because it has
been cleaned).
### Why are the changes needed?
Reduce memory load on driver and executor. This can make a long running
spark service more stable.
### Does this PR introduce _any_ user-facing change?
Add config `spark.broadcast.cleanAfterExecution.enabled`.
Default `false`, If `true`, the broadcast data generated by the sql will be
destroyed when it is completed.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]