[
https://issues.apache.org/jira/browse/SPARK-47764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17864823#comment-17864823
]
Alessandro Bellina commented on SPARK-47764:
--------------------------------------------
Adding https://issues.apache.org/jira/browse/SPARK-48861 as a related issue.
> Cleanup shuffle dependencies for Spark Connect SQL executions
> -------------------------------------------------------------
>
> Key: SPARK-47764
> URL: https://issues.apache.org/jira/browse/SPARK-47764
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core, SQL
> Affects Versions: 4.0.0
> Reporter: Bo Zhang
> Assignee: Bo Zhang
> Priority: Major
> Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Shuffle dependencies are created by shuffle map stages, which consists of
> files on disks and the corresponding references in Spark JVM heap memory.
> Currently Spark cleanup unused shuffle dependencies through JVM GCs, and
> periodic GCs are triggered once every 30 minutes (see ContextCleaner).
> However, we still found cases in which the size of the shuffle data files are
> too large, which makes shuffle data migration slow.
>
> We do have chances to cleanup shuffle dependencies, especially for SQL
> queries created by Spark Connect, since we do have better control of the
> DataFrame instances there. Even if DataFrame instances are reused in the
> client side, on the server side the instances are still recreated.
>
> We might also provide the option to 1. cleanup eagerly after each query
> executions, or 2. only mark the shuffle executions and do not migrate them at
> node decommissions.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]