[ 
https://issues.apache.org/jira/browse/SPARK-21097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16801960#comment-16801960
 ] 

Imran Rashid commented on SPARK-21097:
--------------------------------------

{quote}
I'm wondering if this is going to be subsumed by the Shuffle Service redesign 
proposal.
{quote}

The shuffle service redesign does not subsume this.  It only covers shuffle 
blocks currently.  While there are some commonalities in shuffle vs. cached 
blocks, there are also enough differences they need special handling.

You could *extend* the redesigned shuffle service to do something like this, 
but it wouldn't be exactly the same.  You'd be giving up locality on the 
executors if you move cached rdd blocks over, so it may be tricky to decide 
when to move blocks over.  It would definitely be additional work we'd do on 
top of the shuffle service (which I think might make the most sense as the way 
to do this in any case).

> Dynamic allocation will preserve cached data
> --------------------------------------------
>
>                 Key: SPARK-21097
>                 URL: https://issues.apache.org/jira/browse/SPARK-21097
>             Project: Spark
>          Issue Type: Improvement
>          Components: Block Manager, Scheduler, Spark Core
>    Affects Versions: 2.2.0, 2.3.0
>            Reporter: Brad
>            Priority: Major
>         Attachments: Preserving Cached Data with Dynamic Allocation.pdf
>
>
> We want to use dynamic allocation to distribute resources among many notebook 
> users on our spark clusters. One difficulty is that if a user has cached data 
> then we are either prevented from de-allocating any of their executors, or we 
> are forced to drop their cached data, which can lead to a bad user experience.
> We propose adding a feature to preserve cached data by copying it to other 
> executors before de-allocation. This behavior would be enabled by a simple 
> spark config. Now when an executor reaches its configured idle timeout, 
> instead of just killing it on the spot, we will stop sending it new tasks, 
> replicate all of its rdd blocks onto other executors, and then kill it. If 
> there is an issue while we replicate the data, like an error, it takes too 
> long, or there isn't enough space, then we will fall back to the original 
> behavior and drop the data and kill the executor.
> This feature should allow anyone with notebook users to use their cluster 
> resources more efficiently. Also since it will be completely opt-in it will 
> unlikely to cause problems for other use cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to