GitHub user rdblue opened a pull request:

    https://github.com/apache/spark/pull/17813

    [SPARK-20540][CORE] Fix unstable executor requests.

    There are two problems fixed in this commit. First, the
    ExecutorAllocationManager sets a timeout to avoid requesting executors
    too often. However, the timeout is always updated based on its value and
    a timeout, not the current time. If the call is delayed by locking for
    more than the ongoing scheduler timeout, the manager will request more
    executors on every run. This seems to be the main cause of SPARK-20540.
    
    The second problem is that the total number of requested executors is
    not tracked by the CoarseGrainedSchedulerBackend. Instead, it calculates
    the value based on the current status of 3 variables: the number of
    known executors, the number of executors that have been killed, and the
    number of pending executors. But, the number of pending executors is
    never less than 0, even though there may be more known than requested.
    When executors are killed and not replaced, this can cause the request
    sent to YARN to be incorrect because there were too many executors due
    to the scheduler's state being slightly out of date. This is fixed by 
tracking
    the currently requested size explicitly.
    
    ## How was this patch tested?
    
    Existing tests.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/rdblue/spark 
SPARK-20540-fix-dynamic-allocation

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17813.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17813
    
----
commit 96a76868abe168d17b24011d9090cb495c8c1c3f
Author: Ryan Blue <b...@apache.org>
Date:   2017-04-24T18:42:57Z

    SPARK-20540: Fix unstable executor requests.
    
    There are two problems fixed in this commit. First, the
    ExecutorAllocationManager sets a timeout to avoid requesting executors
    too often. However, the timeout is always updated based on its value and
    a timeout, not the current time. If the call is delayed by locking for
    more than the ongoing scheduler timeout, the manager will request more
    executors on every run.
    
    The second problem is that the total number of requested executors is
    not tracked by the CoarseGrainedSchedulerBackend. Instead, it calculates
    the value based on the current status of 3 variables: the number of
    known executors, the number of executors that have been killed, and the
    number of pending executors. But, the number of pending executors is
    never less than 0, even though there may be more known than requested.
    When executors are killed and not replaced, this can cause the request
    sent to YARN to be incorrect because there were too many executors due
    to the scheduler's state being slightly out of date.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to