[GitHub] spark pull request #21059: fix when numExecutorsTarget equals maxNumExecutor...

sadhen Thu, 12 Apr 2018 19:59:04 -0700

GitHub user sadhen opened a pull request:

    https://github.com/apache/spark/pull/21059


    fix when numExecutorsTarget equals maxNumExecutors

    ## What changes were proposed in this pull request?
    
    In dynamic allocation, there are cases that the `numExecutorsTarget` has 
reached `maxNumExecutors`, but for some reason (client.requestTotalExecutors 
didn't work as expected or throw exceptions due to RPC failure), the  method 
`addExecutors` always returns 0 without do `client.requestTotalExecutors`. And 
there are too many tasks to handle, `maxNeeded < numExecutorsTarget` is false, 
in `updateAndSyncNumExecutorsTarget` we always run into `addExecutors` with 
`numExecutorsTarget == maxNumExecutors`. Since numExecutorsTarget is hard to 
decrease, as a result, we are using only a few executors to handle the heavy 
tasks without dynamically increase the number of executors.
    
    Online logs:
    ```
    $ grep "Not adding executors because our current target total" 
spark-job-server.log.9 | tail
    [2018-04-12 16:07:19,070] DEBUG .ExecutorAllocationManager [] 
[akka://JobServer/user/jobManager] - Not adding executors because our current 
target total is already 600 (limit 600)
    [2018-04-12 16:07:20,071] DEBUG .ExecutorAllocationManager [] 
[akka://JobServer/user/jobManager] - Not adding executors because our current 
target total is already 600 (limit 600)
    [2018-04-12 16:07:21,072] DEBUG .ExecutorAllocationManager [] 
[akka://JobServer/user/jobManager] - Not adding executors because our current 
target total is already 600 (limit 600)
    [2018-04-12 16:07:22,073] DEBUG .ExecutorAllocationManager [] 
[akka://JobServer/user/jobManager] - Not adding executors because our current 
target total is already 600 (limit 600)
    [2018-04-12 16:07:23,074] DEBUG .ExecutorAllocationManager [] 
[akka://JobServer/user/jobManager] - Not adding executors because our current 
target total is already 600 (limit 600)
    [2018-04-12 16:07:24,075] DEBUG .ExecutorAllocationManager [] 
[akka://JobServer/user/jobManager] - Not adding executors because our current 
target total is already 600 (limit 600)
    [2018-04-12 16:07:25,076] DEBUG .ExecutorAllocationManager [] 
[akka://JobServer/user/jobManager] - Not adding executors because our current 
target total is already 600 (limit 600)
    [2018-04-12 16:07:26,077] DEBUG .ExecutorAllocationManager [] 
[akka://JobServer/user/jobManager] - Not adding executors because our current 
target total is already 600 (limit 600)
    [2018-04-12 16:07:27,078] DEBUG .ExecutorAllocationManager [] 
[akka://JobServer/user/jobManager] - Not adding executors because our current 
target total is already 600 (limit 600)
    [2018-04-12 16:07:28,079] DEBUG .ExecutorAllocationManager [] 
[akka://JobServer/user/jobManager] - Not adding executors because our current 
target total is already 600 (limit 600)
    
    $ grep "Not adding executors because our current target total" 
spark-job-server.log.9 | head
    [2018-04-12 13:52:18,067] DEBUG .ExecutorAllocationManager [] 
[akka://JobServer/user/jobManager] - Not adding executors because our current 
target total is already 600 (limit 600)
    [2018-04-12 13:52:19,071] DEBUG .ExecutorAllocationManager [] 
[akka://JobServer/user/jobManager] - Not adding executors because our current 
target total is already 600 (limit 600)
    [2018-04-12 13:52:20,072] DEBUG .ExecutorAllocationManager [] 
[akka://JobServer/user/jobManager] - Not adding executors because our current 
target total is already 600 (limit 600)
    [2018-04-12 13:52:21,073] DEBUG .ExecutorAllocationManager [] 
[akka://JobServer/user/jobManager] - Not adding executors because our current 
target total is already 600 (limit 600)
    [2018-04-12 13:52:22,074] DEBUG .ExecutorAllocationManager [] 
[akka://JobServer/user/jobManager] - Not adding executors because our current 
target total is already 600 (limit 600)
    [2018-04-12 13:52:23,075] DEBUG .ExecutorAllocationManager [] 
[akka://JobServer/user/jobManager] - Not adding executors because our current 
target total is already 600 (limit 600)
    [2018-04-12 13:52:24,076] DEBUG .ExecutorAllocationManager [] 
[akka://JobServer/user/jobManager] - Not adding executors because our current 
target total is already 600 (limit 600)
    [2018-04-12 13:52:25,077] DEBUG .ExecutorAllocationManager [] 
[akka://JobServer/user/jobManager] - Not adding executors because our current 
target total is already 600 (limit 600)
    [2018-04-12 13:52:26,078] DEBUG .ExecutorAllocationManager [] 
[akka://JobServer/user/jobManager] - Not adding executors because our current 
target total is already 600 (limit 600)
    [2018-04-12 13:52:27,079] DEBUG .ExecutorAllocationManager [] 
[akka://JobServer/user/jobManager] - Not adding executors because our current 
target total is already 600 (limit 600)
    
    
    $ grep "Not adding executors because our current target total" 
spark-job-server.log.9 | wc -l
    8111
    ```
    The logs mean that we are keeping the `numExecutorsTarget == 
maxNumExecutors == 600` without requesting new executors. And at that time, we 
only have 7 executors available for our users.
    
    Since the semantics of `client.requestTotalExecutors` are request executors 
up to a num. And for Yarn, it finally set the `targetNumExecutors` in 
`YarnAllocator`. It won't be a problem that we call this method repeatedly.
    
    ## How was this patch tested?
    Existing tests.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sadhen/spark jira23974

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21059.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21059
    
----
commit a7990db6238ce0a21f64492eaf15ec1b9c278e13
Author: å¿å¬ <rendong@...>
Date:   2018-04-13T02:37:37Z

    fix when numExecutorsTarget equals maxNumExecutors

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #21059: fix when numExecutorsTarget equals maxNumExecutor...

Reply via email to