[ 
https://issues.apache.org/jira/browse/SPARK-25299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Cheah updated SPARK-25299:
-------------------------------
    Description: 
In Spark, the shuffle primitive requires Spark executors to persist data to the 
local disk of the worker nodes. If executors crash, the external shuffle 
service can continue to serve the shuffle data that was written beyond the 
lifetime of the executor itself. In YARN, Mesos, and Standalone mode, the 
external shuffle service is deployed on every worker node. The shuffle service 
shares local disk with the executors that run on its node.

There are some shortcomings with the way shuffle is fundamentally implemented 
right now. Particularly:
 * If any external shuffle service process or node becomes unavailable, all 
applications that had an executor that ran on that node must recompute the 
shuffle blocks that were lost.
 * Similarly to the above, the external shuffle service must be kept running at 
all times, which may waste resources when no applications are using that 
shuffle service node.
 * Mounting local storage can prevent users from taking advantage of desirable 
isolation benefits from using containerized environments, like Kubernetes. We 
had an external shuffle service implementation in an early prototype of the 
Kubernetes backend, but it was rejected due to its strict requirement to be 
able to mount hostPath volumes or other persistent volume setups.

In the following [architecture discussion 
document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40]
 (note: _not_ an SPIP), we brainstorm various high level architectures for 
improving the external shuffle service in a way that addresses the above 
problems. The purpose of this umbrella JIRA is to promote additional discussion 
on how we can approach these problems, both at the architecture level and the 
implementation level. We anticipate filing sub-issues that break down the tasks 
that must be completed to achieve this goal.

Edit June 28 2019: Our SPIP is here: 
[https://docs.google.com/document/d/1d6egnL6WHOwWZe8MWv3m8n4PToNacdx7n_0iMSWwhCQ/edit]

  was:
In Spark, the shuffle primitive requires Spark executors to persist data to the 
local disk of the worker nodes. If executors crash, the external shuffle 
service can continue to serve the shuffle data that was written beyond the 
lifetime of the executor itself. In YARN, Mesos, and Standalone mode, the 
external shuffle service is deployed on every worker node. The shuffle service 
shares local disk with the executors that run on its node.

There are some shortcomings with the way shuffle is fundamentally implemented 
right now. Particularly:
 * If any external shuffle service process or node becomes unavailable, all 
applications that had an executor that ran on that node must recompute the 
shuffle blocks that were lost.
 * Similarly to the above, the external shuffle service must be kept running at 
all times, which may waste resources when no applications are using that 
shuffle service node.
 * Mounting local storage can prevent users from taking advantage of desirable 
isolation benefits from using containerized environments, like Kubernetes. We 
had an external shuffle service implementation in an early prototype of the 
Kubernetes backend, but it was rejected due to its strict requirement to be 
able to mount hostPath volumes or other persistent volume setups.

In the following [architecture discussion 
document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40]
 (note: _not_ an SPIP), we brainstorm various high level architectures for 
improving the external shuffle service in a way that addresses the above 
problems. The purpose of this umbrella JIRA is to promote additional discussion 
on how we can approach these problems, both at the architecture level and the 
implementation level. We anticipate filing sub-issues that break down the tasks 
that must be completed to achieve this goal.


> Use remote storage for persisting shuffle data
> ----------------------------------------------
>
>                 Key: SPARK-25299
>                 URL: https://issues.apache.org/jira/browse/SPARK-25299
>             Project: Spark
>          Issue Type: New Feature
>          Components: Shuffle
>    Affects Versions: 2.4.0
>            Reporter: Matt Cheah
>            Priority: Major
>              Labels: SPIP
>
> In Spark, the shuffle primitive requires Spark executors to persist data to 
> the local disk of the worker nodes. If executors crash, the external shuffle 
> service can continue to serve the shuffle data that was written beyond the 
> lifetime of the executor itself. In YARN, Mesos, and Standalone mode, the 
> external shuffle service is deployed on every worker node. The shuffle 
> service shares local disk with the executors that run on its node.
> There are some shortcomings with the way shuffle is fundamentally implemented 
> right now. Particularly:
>  * If any external shuffle service process or node becomes unavailable, all 
> applications that had an executor that ran on that node must recompute the 
> shuffle blocks that were lost.
>  * Similarly to the above, the external shuffle service must be kept running 
> at all times, which may waste resources when no applications are using that 
> shuffle service node.
>  * Mounting local storage can prevent users from taking advantage of 
> desirable isolation benefits from using containerized environments, like 
> Kubernetes. We had an external shuffle service implementation in an early 
> prototype of the Kubernetes backend, but it was rejected due to its strict 
> requirement to be able to mount hostPath volumes or other persistent volume 
> setups.
> In the following [architecture discussion 
> document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40]
>  (note: _not_ an SPIP), we brainstorm various high level architectures for 
> improving the external shuffle service in a way that addresses the above 
> problems. The purpose of this umbrella JIRA is to promote additional 
> discussion on how we can approach these problems, both at the architecture 
> level and the implementation level. We anticipate filing sub-issues that 
> break down the tasks that must be completed to achieve this goal.
> Edit June 28 2019: Our SPIP is here: 
> [https://docs.google.com/document/d/1d6egnL6WHOwWZe8MWv3m8n4PToNacdx7n_0iMSWwhCQ/edit]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to