GitHub user attilapiros opened a pull request:

    https://github.com/apache/spark/pull/21068

    [SPARK-16630][YARN] Blacklist a node if executors won't launch on it

    ## What changes were proposed in this pull request?
    
    This change extends YARN resource allocation handling with blacklisting 
functionality.
    This handles cases when node is messed up or misconfigured such that a 
container won't launch on it. Before this change backlisting only focused on 
task execution but this change introduces YarnAllocatorBlacklistTracker which 
tracks allocation failures per host (when enabled via 
"spark.yarn.allocation.blacklist.enabled"). 
    
    For avoiding blacklisting all the cluster nodes a new a limit is introduced 
for the max number of backlisted nodes: "spark.yarn.blacklist.size.limit". If 
this limit is not set directly then a default value is computed from the 
cluster size (YARN's allocateResponse.getNumClusterNodes) and 
"spark.yarn.blacklist.size.default.weight" which is factor between [0, 1].
    
    This limit is valid for all the backlisting (task-level backlisted hosts 
are included).
    If this limit is reached then only a subset nodes (with the size of this 
limit), the most relevant nodes, are communicated to YARN as backlisted nodes. 
From two nodes the one which backlisting expires latter is taken to be more 
relevant.
    
    ## How was this patch tested?
    
    With unit tests. Including a new suite: YarnAllocatorBlacklistTrackerSuite.
    
    And manually on a cluster by deleting the Spark jars on one of the node.
     
    ### Before this PR
    
    Starting Spark as:
    ```
    spark2-shell --master yarn --deploy-mode client --num-executors 4  --conf 
spark.executor.memory=4g --conf "spark.yarn.max.executor.failures=6"
    ```
    
    Log is:
    ```
    18/04/12 06:49:36 INFO yarn.ApplicationMaster: Final app status: FAILED, 
exitCode: 11, (reason: Max number of executor failures (6) reached)
    18/04/12 06:49:39 INFO yarn.ApplicationMaster: Unregistering 
ApplicationMaster with FAILED (diag message: Max number of executor failures 
(6) reached)
    18/04/12 06:49:39 INFO impl.AMRMClientImpl: Waiting for application to be 
successfully unregistered.
    18/04/12 06:49:39 INFO yarn.ApplicationMaster: Deleting staging directory 
hdfs://apiros-1.gce.test.com:8020/user/systest/.sparkStaging/application_1523459048274_0016
    18/04/12 06:49:39 INFO util.ShutdownHookManager: Shutdown hook called
    ```
    
    
    ### After these changes
    
    Starting Spark as:
    ```
    spark2-shell --master yarn --deploy-mode client --num-executors 4  --conf 
spark.executor.memory=4g --conf "spark.yarn.max.executor.failures=6" --conf 
"spark.yarn.allocation.blacklist.enabled=true"
    ```
    
    And the log is:
    ```
    18/04/13 05:37:43 INFO yarn.YarnAllocator: Will request 1 executor 
container(s), each with 1 core(s) and 4505 MB memory (including 409 MB of 
overhead)
    18/04/13 05:37:43 INFO yarn.YarnAllocator: Submitted 1 unlocalized 
container requests.
    18/04/13 05:37:43 INFO yarn.YarnAllocator: Launching container 
container_1523459048274_0025_01_000008 on host apiros-4.gce.test.com for 
executor with ID 6
    18/04/13 05:37:43 INFO yarn.YarnAllocator: Received 1 containers from YARN, 
launching executors on 1 of them.
    18/04/13 05:37:43 INFO yarn.YarnAllocator: Completed container 
container_1523459048274_0025_01_000007 on host: apiros-4.gce.test.com (state: 
COMPLETE, exit status: 1)
    18/04/13 05:37:43 INFO yarn.YarnAllocatorBlacklistTracker: blacklisting 
host as YARN allocation failed: apiros-4.gce.test.com
    18/04/13 05:37:43 INFO yarn.YarnAllocatorBlacklistTracker: adding nodes to 
YARN application master's blacklist: List(apiros-4.gce.test.com)
    18/04/13 05:37:43 WARN yarn.YarnAllocator: Container marked as failed: 
container_1523459048274_0025_01_000007 on host: apiros-4.gce.test.com. Exit 
status: 1. Diagnostics: Exception from container-launch.
    Container id: container_1523459048274_0025_01_000007
    Exit code: 1
    Stack trace: ExitCodeException exitCode=1:
            at org.apache.hadoop.util.Shell.runCommand(Shell.java:604)
            at org.apache.hadoop.util.Shell.run(Shell.java:507)
            at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:789)
            at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:213)
            at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
            at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
            at java.util.concurrent.FutureTask.run(FutureTask.java:266)
            at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
            at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
            at java.lang.Thread.run(Thread.java:748)
    ```
    
    
    
    
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/attilapiros/spark SPARK-16630

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21068.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21068
    
----
commit fd1923ef3a9b7ab5355e13ddf3d3f537ac00c704
Author: “attilapiros” <piros.attila.zsolt@...>
Date:   2018-04-11T13:33:26Z

    initial commit

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to