GitHub user sitalkedia opened a pull request:

    https://github.com/apache/spark/pull/17088

    [SPARK-19753][CORE] All shuffle files on a host should be removed in …

    ## What changes were proposed in this pull request?
    
    Currently, when we detect fetch failure, we only remove the shuffle files 
produced by the executor, while the host itself might be down and all the 
shuffle files are not accessible. In case we are running multiple executors on 
a host, any host going down currently results in multiple fetch failures and 
multiple retries of the stage, which is very inefficient. If we remove all the 
shuffle files on that host, on first fetch failure, we can rerun all the tasks 
on that host in a single stage retry.
    
    ## How was this patch tested?
    
    Unit testing and also ran a job on the cluster and made sure multiple 
retries are gone.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sitalkedia/spark cleanup_shuffle

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17088.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17088
    
----
commit 74ca88bc1d2b67cc12ea32a3cd344ec0259500a9
Author: Sital Kedia <[email protected]>
Date:   2017-02-25T00:35:00Z

    [SPARK-19753][CORE] All shuffle files on a host should be removed in case 
of fetch failure or slave lost

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to