GitHub user sitalkedia opened a pull request:
https://github.com/apache/spark/pull/18150
Cleanup shuffle
## What changes were proposed in this pull request?
Currently, when we detect fetch failure, we only remove the shuffle files
produced by the executor, while the host itself might be down and all the
shuffle files are not accessible. In case we are running multiple executors on
a host, any host going down currently results in multiple fetch failures and
multiple retries of the stage, which is very inefficient. If we remove all the
shuffle files on that host, on first fetch failure, we can rerun all the tasks
on that host in a single stage retry.
## How was this patch tested?
Unit testing and also ran a job on the cluster and made sure multiple
retries are gone.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/sitalkedia/spark cleanup_shuffle
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/18150.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #18150
----
commit 74ca88bc1d2b67cc12ea32a3cd344ec0259500a9
Author: Sital Kedia <[email protected]>
Date: 2017-02-25T00:35:00Z
[SPARK-19753][CORE] All shuffle files on a host should be removed in case
of fetch failure or slave lost
commit 6898c2bb0f8a65dcc488e53b248fbeaec64efdb8
Author: Sital Kedia <[email protected]>
Date: 2017-03-01T02:03:55Z
Do not un-register shuffle files in case of executor lost
commit 32a2315caa07a5a6be1bd92ec1e13500b74308cb
Author: Sital Kedia <[email protected]>
Date: 2017-03-01T02:13:07Z
no-op when external shuffle service is disabled
commit c7c3129dcc4ad2fc1a75bff5a941f6c4a8dfd0ef
Author: Sital Kedia <[email protected]>
Date: 2017-03-01T02:23:59Z
fix check style
commit f96ec68d6922fe2108c5869fedf2d8aca373c6eb
Author: Sital Kedia <[email protected]>
Date: 2017-03-16T22:24:43Z
Addressed review comments and fixed a bug
commit d4979e35137152db00c53ea0b9e82aaf41dad5b5
Author: Sital Kedia <[email protected]>
Date: 2017-03-16T23:00:17Z
Fix build
commit 8787db1679c5b468afa3d2ede64eee53908fa5de
Author: Sital Kedia <[email protected]>
Date: 2017-03-17T02:37:44Z
Fix test failures
commit 4ca9527a8cf78ba1c3e64c81ee6afc9e93b05fe6
Author: Imran Rashid <[email protected]>
Date: 2017-03-17T15:53:25Z
refactoring & comments
commit 9f64e2931eabd2fcc5909123e73c9c046caceb3b
Author: Sital Kedia <[email protected]>
Date: 2017-03-18T04:03:37Z
Review comments
commit be3b3dbd2d813a3d1d164d9b7f8127d09b752880
Author: Sital Kedia <[email protected]>
Date: 2017-03-24T22:39:05Z
Minor changes as per review comments
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]