GitHub user attilapiros opened a pull request:
https://github.com/apache/spark/pull/21068
[SPARK-16630][YARN] Blacklist a node if executors won't launch on it
## What changes were proposed in this pull request?
This change extends YARN resource allocation handling with blacklisting
functionality.
This handles cases when node is messed up or misconfigured such that a
container won't launch on it. Before this change backlisting only focused on
task execution but this change introduces YarnAllocatorBlacklistTracker which
tracks allocation failures per host (when enabled via
"spark.yarn.allocation.blacklist.enabled").
For avoiding blacklisting all the cluster nodes a new a limit is introduced
for the max number of backlisted nodes: "spark.yarn.blacklist.size.limit". If
this limit is not set directly then a default value is computed from the
cluster size (YARN's allocateResponse.getNumClusterNodes) and
"spark.yarn.blacklist.size.default.weight" which is factor between [0, 1].
This limit is valid for all the backlisting (task-level backlisted hosts
are included).
If this limit is reached then only a subset nodes (with the size of this
limit), the most relevant nodes, are communicated to YARN as backlisted nodes.
From two nodes the one which backlisting expires latter is taken to be more
relevant.
## How was this patch tested?
With unit tests. Including a new suite: YarnAllocatorBlacklistTrackerSuite.
And manually on a cluster by deleting the Spark jars on one of the node.
### Before this PR
Starting Spark as:
```
spark2-shell --master yarn --deploy-mode client --num-executors 4 --conf
spark.executor.memory=4g --conf "spark.yarn.max.executor.failures=6"
```
Log is:
```
18/04/12 06:49:36 INFO yarn.ApplicationMaster: Final app status: FAILED,
exitCode: 11, (reason: Max number of executor failures (6) reached)
18/04/12 06:49:39 INFO yarn.ApplicationMaster: Unregistering
ApplicationMaster with FAILED (diag message: Max number of executor failures
(6) reached)
18/04/12 06:49:39 INFO impl.AMRMClientImpl: Waiting for application to be
successfully unregistered.
18/04/12 06:49:39 INFO yarn.ApplicationMaster: Deleting staging directory
hdfs://apiros-1.gce.test.com:8020/user/systest/.sparkStaging/application_1523459048274_0016
18/04/12 06:49:39 INFO util.ShutdownHookManager: Shutdown hook called
```
### After these changes
Starting Spark as:
```
spark2-shell --master yarn --deploy-mode client --num-executors 4 --conf
spark.executor.memory=4g --conf "spark.yarn.max.executor.failures=6" --conf
"spark.yarn.allocation.blacklist.enabled=true"
```
And the log is:
```
18/04/13 05:37:43 INFO yarn.YarnAllocator: Will request 1 executor
container(s), each with 1 core(s) and 4505 MB memory (including 409 MB of
overhead)
18/04/13 05:37:43 INFO yarn.YarnAllocator: Submitted 1 unlocalized
container requests.
18/04/13 05:37:43 INFO yarn.YarnAllocator: Launching container
container_1523459048274_0025_01_000008 on host apiros-4.gce.test.com for
executor with ID 6
18/04/13 05:37:43 INFO yarn.YarnAllocator: Received 1 containers from YARN,
launching executors on 1 of them.
18/04/13 05:37:43 INFO yarn.YarnAllocator: Completed container
container_1523459048274_0025_01_000007 on host: apiros-4.gce.test.com (state:
COMPLETE, exit status: 1)
18/04/13 05:37:43 INFO yarn.YarnAllocatorBlacklistTracker: blacklisting
host as YARN allocation failed: apiros-4.gce.test.com
18/04/13 05:37:43 INFO yarn.YarnAllocatorBlacklistTracker: adding nodes to
YARN application master's blacklist: List(apiros-4.gce.test.com)
18/04/13 05:37:43 WARN yarn.YarnAllocator: Container marked as failed:
container_1523459048274_0025_01_000007 on host: apiros-4.gce.test.com. Exit
status: 1. Diagnostics: Exception from container-launch.
Container id: container_1523459048274_0025_01_000007
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:604)
at org.apache.hadoop.util.Shell.run(Shell.java:507)
at
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:789)
at
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:213)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
```
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/attilapiros/spark SPARK-16630
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/21068.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #21068
----
commit fd1923ef3a9b7ab5355e13ddf3d3f537ac00c704
Author: âattilapirosâ <piros.attila.zsolt@...>
Date: 2018-04-11T13:33:26Z
initial commit
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]