GitHub user sadhen opened a pull request:
https://github.com/apache/spark/pull/21059
fix when numExecutorsTarget equals maxNumExecutors
## What changes were proposed in this pull request?
In dynamic allocation, there are cases that the `numExecutorsTarget` has
reached `maxNumExecutors`, but for some reason (client.requestTotalExecutors
didn't work as expected or throw exceptions due to RPC failure), the method
`addExecutors` always returns 0 without do `client.requestTotalExecutors`. And
there are too many tasks to handle, `maxNeeded < numExecutorsTarget` is false,
in `updateAndSyncNumExecutorsTarget` we always run into `addExecutors` with
`numExecutorsTarget == maxNumExecutors`. Since numExecutorsTarget is hard to
decrease, as a result, we are using only a few executors to handle the heavy
tasks without dynamically increase the number of executors.
Online logs:
```
$ grep "Not adding executors because our current target total"
spark-job-server.log.9 | tail
[2018-04-12 16:07:19,070] DEBUG .ExecutorAllocationManager []
[akka://JobServer/user/jobManager] - Not adding executors because our current
target total is already 600 (limit 600)
[2018-04-12 16:07:20,071] DEBUG .ExecutorAllocationManager []
[akka://JobServer/user/jobManager] - Not adding executors because our current
target total is already 600 (limit 600)
[2018-04-12 16:07:21,072] DEBUG .ExecutorAllocationManager []
[akka://JobServer/user/jobManager] - Not adding executors because our current
target total is already 600 (limit 600)
[2018-04-12 16:07:22,073] DEBUG .ExecutorAllocationManager []
[akka://JobServer/user/jobManager] - Not adding executors because our current
target total is already 600 (limit 600)
[2018-04-12 16:07:23,074] DEBUG .ExecutorAllocationManager []
[akka://JobServer/user/jobManager] - Not adding executors because our current
target total is already 600 (limit 600)
[2018-04-12 16:07:24,075] DEBUG .ExecutorAllocationManager []
[akka://JobServer/user/jobManager] - Not adding executors because our current
target total is already 600 (limit 600)
[2018-04-12 16:07:25,076] DEBUG .ExecutorAllocationManager []
[akka://JobServer/user/jobManager] - Not adding executors because our current
target total is already 600 (limit 600)
[2018-04-12 16:07:26,077] DEBUG .ExecutorAllocationManager []
[akka://JobServer/user/jobManager] - Not adding executors because our current
target total is already 600 (limit 600)
[2018-04-12 16:07:27,078] DEBUG .ExecutorAllocationManager []
[akka://JobServer/user/jobManager] - Not adding executors because our current
target total is already 600 (limit 600)
[2018-04-12 16:07:28,079] DEBUG .ExecutorAllocationManager []
[akka://JobServer/user/jobManager] - Not adding executors because our current
target total is already 600 (limit 600)
$ grep "Not adding executors because our current target total"
spark-job-server.log.9 | head
[2018-04-12 13:52:18,067] DEBUG .ExecutorAllocationManager []
[akka://JobServer/user/jobManager] - Not adding executors because our current
target total is already 600 (limit 600)
[2018-04-12 13:52:19,071] DEBUG .ExecutorAllocationManager []
[akka://JobServer/user/jobManager] - Not adding executors because our current
target total is already 600 (limit 600)
[2018-04-12 13:52:20,072] DEBUG .ExecutorAllocationManager []
[akka://JobServer/user/jobManager] - Not adding executors because our current
target total is already 600 (limit 600)
[2018-04-12 13:52:21,073] DEBUG .ExecutorAllocationManager []
[akka://JobServer/user/jobManager] - Not adding executors because our current
target total is already 600 (limit 600)
[2018-04-12 13:52:22,074] DEBUG .ExecutorAllocationManager []
[akka://JobServer/user/jobManager] - Not adding executors because our current
target total is already 600 (limit 600)
[2018-04-12 13:52:23,075] DEBUG .ExecutorAllocationManager []
[akka://JobServer/user/jobManager] - Not adding executors because our current
target total is already 600 (limit 600)
[2018-04-12 13:52:24,076] DEBUG .ExecutorAllocationManager []
[akka://JobServer/user/jobManager] - Not adding executors because our current
target total is already 600 (limit 600)
[2018-04-12 13:52:25,077] DEBUG .ExecutorAllocationManager []
[akka://JobServer/user/jobManager] - Not adding executors because our current
target total is already 600 (limit 600)
[2018-04-12 13:52:26,078] DEBUG .ExecutorAllocationManager []
[akka://JobServer/user/jobManager] - Not adding executors because our current
target total is already 600 (limit 600)
[2018-04-12 13:52:27,079] DEBUG .ExecutorAllocationManager []
[akka://JobServer/user/jobManager] - Not adding executors because our current
target total is already 600 (limit 600)
$ grep "Not adding executors because our current target total"
spark-job-server.log.9 | wc -l
8111
```
The logs mean that we are keeping the `numExecutorsTarget ==
maxNumExecutors == 600` without requesting new executors. And at that time, we
only have 7 executors available for our users.
Since the semantics of `client.requestTotalExecutors` are request executors
up to a num. And for Yarn, it finally set the `targetNumExecutors` in
`YarnAllocator`. It won't be a problem that we call this method repeatedly.
## How was this patch tested?
Existing tests.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/sadhen/spark jira23974
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/21059.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #21059
----
commit a7990db6238ce0a21f64492eaf15ec1b9c278e13
Author: å¿å¬ <rendong@...>
Date: 2018-04-13T02:37:37Z
fix when numExecutorsTarget equals maxNumExecutors
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]