GitHub user squito opened a pull request:
https://github.com/apache/spark/pull/7839
[SPARK-9439] [yarn] [wip] External shuffle service should be robust to NM
restarts
https://issues.apache.org/jira/browse/SPARK-9439
In general, Yarn apps should be robust to NodeManager restarts. However,
if you run spark with the external shuffle service on, after a NM restart all
shuffles fail, b/c the shuffle service has lost some state with info on each
executor. (Note the shuffle data is perfectly fine on disk across a NM
restart, the problem is we've lost the small bit of state that lets us *find*
those files.)
The solution proposed here is that the external shuffle service can write
out its state to a file every time an executor is added. When running with
yarn, that file is in the NM's local dir. Whenever the service is started, it
looks for that file, and if it exists, it reads the file and re-registers all
executors there.
Nothing is changed in non-yarn modes with this patch. The service is not
given a place to save the state to, so it operates the same as before. This
should make it easy to update other cluster managers as well, but just
supplying the right file, but I wasn't familiar enough w/ the other modes to
know where they should stick that file.
The current code needs a lot of polish & tests, but thought I'd put it up
now to get opinions on the overall approach. Also I'm pretty sure there are
some potential leaks, if the NM restarts right when an app ends, and then we
never remove executors or something like that, need to think about that more.
But, I have tested this on a small cluster, killed the NM on a running spark
app, and it works.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/squito/spark
external_shuffle_service_NM_restart
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/7839.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #7839
----
commit b9d2ced92a9712edf7612f4cee1ac929e5fa3dab
Author: Imran Rashid <[email protected]>
Date: 2015-07-28T14:17:15Z
incomplete setup for external shuffle service tests
commit 36127d38750e75d682c8858b7128bd634732be71
Author: Imran Rashid <[email protected]>
Date: 2015-07-28T18:27:02Z
wip
commit bb3ba494be05a11ab8105ce4ce570f60d1d33e51
Author: Imran Rashid <[email protected]>
Date: 2015-07-28T18:27:12Z
minor cleanup
commit c69f46b98b6b9da16bfc7fc3a15deeade3e229ed
Author: Imran Rashid <[email protected]>
Date: 2015-07-30T20:08:49Z
maybe working version, needs tests & cleanup ...
commit 5e5a7c36c3f6ac2c0f0dd6738eb4bb66605f8428
Author: Imran Rashid <[email protected]>
Date: 2015-07-31T15:37:14Z
fix build
commit ad122ef6b07eba0326929057704ef827a7eb26d6
Author: Imran Rashid <[email protected]>
Date: 2015-07-31T15:40:00Z
more fixes
commit 0b588bd5e0dd4ee81f9b4f856f9a57f4a4c72151
Author: Imran Rashid <[email protected]>
Date: 2015-07-31T15:52:57Z
more fixes ...
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]