GitHub user sitalkedia opened a pull request:

    https://github.com/apache/spark/pull/18150

    Cleanup shuffle

    ## What changes were proposed in this pull request?
    
    Currently, when we detect fetch failure, we only remove the shuffle files 
produced by the executor, while the host itself might be down and all the 
shuffle files are not accessible. In case we are running multiple executors on 
a host, any host going down currently results in multiple fetch failures and 
multiple retries of the stage, which is very inefficient. If we remove all the 
shuffle files on that host, on first fetch failure, we can rerun all the tasks 
on that host in a single stage retry.
    
    ## How was this patch tested?
    
    Unit testing and also ran a job on the cluster and made sure multiple 
retries are gone.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sitalkedia/spark cleanup_shuffle

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18150.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18150
    
----
commit 74ca88bc1d2b67cc12ea32a3cd344ec0259500a9
Author: Sital Kedia <[email protected]>
Date:   2017-02-25T00:35:00Z

    [SPARK-19753][CORE] All shuffle files on a host should be removed in case 
of fetch failure or slave lost

commit 6898c2bb0f8a65dcc488e53b248fbeaec64efdb8
Author: Sital Kedia <[email protected]>
Date:   2017-03-01T02:03:55Z

    Do not un-register shuffle files in case of executor lost

commit 32a2315caa07a5a6be1bd92ec1e13500b74308cb
Author: Sital Kedia <[email protected]>
Date:   2017-03-01T02:13:07Z

    no-op when external shuffle service is disabled

commit c7c3129dcc4ad2fc1a75bff5a941f6c4a8dfd0ef
Author: Sital Kedia <[email protected]>
Date:   2017-03-01T02:23:59Z

    fix check style

commit f96ec68d6922fe2108c5869fedf2d8aca373c6eb
Author: Sital Kedia <[email protected]>
Date:   2017-03-16T22:24:43Z

    Addressed review comments and fixed a bug

commit d4979e35137152db00c53ea0b9e82aaf41dad5b5
Author: Sital Kedia <[email protected]>
Date:   2017-03-16T23:00:17Z

    Fix build

commit 8787db1679c5b468afa3d2ede64eee53908fa5de
Author: Sital Kedia <[email protected]>
Date:   2017-03-17T02:37:44Z

    Fix test failures

commit 4ca9527a8cf78ba1c3e64c81ee6afc9e93b05fe6
Author: Imran Rashid <[email protected]>
Date:   2017-03-17T15:53:25Z

    refactoring & comments

commit 9f64e2931eabd2fcc5909123e73c9c046caceb3b
Author: Sital Kedia <[email protected]>
Date:   2017-03-18T04:03:37Z

    Review comments

commit be3b3dbd2d813a3d1d164d9b7f8127d09b752880
Author: Sital Kedia <[email protected]>
Date:   2017-03-24T22:39:05Z

    Minor changes as per review comments

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to