[jira] [Commented] (SPARK-47764) Cleanup shuffle dependencies for Spark Connect SQL executions

Alessandro Bellina (Jira) Wed, 10 Jul 2024 15:06:09 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-47764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17864823#comment-17864823
 ]


Alessandro Bellina commented on SPARK-47764:
--------------------------------------------

Adding https://issues.apache.org/jira/browse/SPARK-48861 as a related issue.

> Cleanup shuffle dependencies for Spark Connect SQL executions
> -------------------------------------------------------------
>
>                 Key: SPARK-47764
>                 URL: https://issues.apache.org/jira/browse/SPARK-47764
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core, SQL
>    Affects Versions: 4.0.0
>            Reporter: Bo Zhang
>            Assignee: Bo Zhang
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.0.0
>
>
> Shuffle dependencies are created by shuffle map stages, which consists of 
> files on disks and the corresponding references in Spark JVM heap memory. 
> Currently Spark cleanup unused  shuffle dependencies through JVM GCs, and 
> periodic GCs are triggered once every 30 minutes (see ContextCleaner). 
> However, we still found cases in which the size of the shuffle data files are 
> too large, which makes shuffle data migration slow.
>  
> We do have chances to cleanup shuffle dependencies, especially for SQL 
> queries created by Spark Connect, since we do have better control of the 
> DataFrame instances there. Even if DataFrame instances are reused in the 
> client side, on the server side the instances are still recreated. 
>  
> We might also provide the option to 1. cleanup eagerly after each query 
> executions, or 2. only mark the shuffle executions and do not migrate them at 
> node decommissions.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-47764) Cleanup shuffle dependencies for Spark Connect SQL executions

Reply via email to