GitHub user jerryshao opened a pull request:

    https://github.com/apache/spark/pull/17113

    [SPARK-13669][Core]Improve the blacklist mechanism to handle external 
shuffle service unavailable situation

    ## What changes were proposed in this pull request?
    
    Currently we are running into an issue with Yarn work preserving enabled + 
external shuffle service.
    In the work preserving enabled scenario, the failure of NM will not lead to 
the exit of executors, so executors can still accept and run the tasks. The 
problem here is when NM is failed, external shuffle service is actually 
inaccessible, so reduce tasks will always complain about the “Fetch 
failure”, and the failure of reduce stage will make the parent stage (map 
stage) rerun. The tricky thing here is Spark scheduler is not aware of the 
unavailability of external shuffle service, and will reschedule the map tasks 
on the executor where NM is failed, and again reduce stage will be failed with 
“Fetch failure”, and after 4 retries, the job is failed. This could also 
apply to other cluster manager with external shuffle service.
    
    So here the main problem is that we should avoid assigning tasks to those 
bad executors (where shuffle service is unavailable). Current Spark's blacklist 
mechanism could blacklist executors/nodes by failure tasks, but it doesn't 
handle this specific fetch failure scenario. So here propose to improve the 
current application blacklist mechanism to handle fetch failure issue 
(especially with external shuffle service unavailable issue), to blacklist the 
executors/nodes where shuffle fetch is unavailable. 
    
    ## How was this patch tested?
    
    Unit test and small cluster verification.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jerryshao/apache-spark SPARK-13669

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17113.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17113
    
----
commit a90b23d894bfd568a054a5752c958c36aa58c79b
Author: jerryshao <[email protected]>
Date:   2017-03-01T05:54:21Z

    Improve the blacklist mechanism to handle external shuffle service 
unavailable situation
    
    Change-Id: I1c0776ec98866c5294ea4ed5d98793fdcebf44ae

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to