GitHub user andrewor14 opened a pull request:

    https://github.com/apache/spark/pull/2746

    [WIP][SPARK-3795] Heuristics for dynamically scaling executors

    This is part of a bigger effort to provide elastic scaling of executors 
within a Spark application 
([SPARK-3174](https://issues.apache.org/jira/browse/SPARK-3174)). This PR does 
not provide any functionality by itself; it is a skeleton that is missing a 
mechanism to be added later in 
[SPARK-3822](https://issues.apache.org/jira/browse/SPARK-3822).
    
    The design doc can be found at 
[SPARK-3174](https://issues.apache.org/jira/browse/SPARK-3174) (under 
"Heuristics for Scaling Executors". This deviates from the design in two ways: 
(1) the remove policy is significantly simplified by cutting out the 
exponential increase logic that mirrors the add policy, and (2) the 
configuration of the add policy is split up into `addExecutorThreshold` and 
`addExecutorInterval`. The main reason for this is that the user may want to 
configure how quickly executors are added, but want to keep the trigger 
condition the same.
    
    This is work-in-progress because it is missing two things: (1) logic for 
retrying a request to add/remove executors if the request is not fulfilled, and 
(2) unit tests.
    
    Comments and feedback are most welcome, but please keep in mind that this 
is still WIP.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/andrewor14/spark scaling-heuristics

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2746.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2746
    
----
commit b2e6dcc91897b9bd8718f1ec8d9f16b2f8075010
Author: Andrew Or <[email protected]>
Date:   2014-10-07T22:52:46Z

    Add skeleton interface for requesting / killing executors

commit 6f1fa66f0ce0b141a934ade3235d873e6e1d8a60
Author: Andrew Or <[email protected]>
Date:   2014-10-08T23:06:23Z

    First cut implementation of adding executors dynamically
    
    This provides a framework that keeps track of when to add and
    remove executors, and how many. The add part of this framework
    is fully implemented minus the backend, i.e. we do not yet
    actually ask the cluster manager for more executors, as that
    depends on the backend implementation.

commit 4077ae21a05dc1af9985bf2cf419d5ecff6fde99
Author: Andrew Or <[email protected]>
Date:   2014-10-08T23:41:01Z

    Minor code re-organization

commit 20ec6b9955113d7dedc915825b54bbf36dd820e0
Author: Andrew Or <[email protected]>
Date:   2014-10-09T02:54:48Z

    First cut implementation of removing executors dynamically
    
    This sets a timer on each executor every time it is launched or
    has finished a task. The remove executor timer is then started
    as soon as one of the idle executors is triggered. An idle
    executor timer is cancelled when the executor starts running
    a task again, and the remove executor timer is cancelled when
    there are no more idle executors.
    
    An important TODO is to deal with synchronization. I have
    witnessed multiple remove executor timers being started due
    to the lack of synchronization.

commit ae5b64a1d2572321931e95cb0297a7e3636470c4
Author: Andrew Or <[email protected]>
Date:   2014-10-09T20:10:58Z

    Add synchronization
    
    Synchronization is applied somewhat profusely in both classes.
    This is because the task scheduler is single-threaded, and normally
    does not schedule multiple tasks concurrently. Synchronization is
    still needed, however, when executors are added or removed, but
    these are relatively rare events so it is OK to synchronize this way.

commit 1cc84446d18d1b1078e98260856ffc278239d614
Author: Andrew Or <[email protected]>
Date:   2014-10-09T20:22:20Z

    Minor wording change

commit 89019008088dd8cc34baf1c9e65ceaba66889f62
Author: Andrew Or <[email protected]>
Date:   2014-10-09T23:05:46Z

    Simplify remove policy + change the semantics of add policy
    
    With the latest changes, exponential increase only applies to the
    add policy. Then, the add threshold and the add interval are
    decoupled into separate configurations. The remove policy is made
    simpler: an executor is removed as soon as it is marked as idle.
    
    This also updates the comments to explain why we use exponential
    increase in the number of executors when adding executors.

commit 6c48ab003af8b809ccc05b58e12148c2e89fcfb4
Author: Andrew Or <[email protected]>
Date:   2014-10-09T23:13:40Z

    Update synchronization comment

commit 67c03c7ca4781d62da41b995e59014c374931e99
Author: Andrew Or <[email protected]>
Date:   2014-10-10T02:26:52Z

    Correct semantics of adding executors + update comments
    
    Previously, we just kept asking for more executors whether or not
    we are granted the ones we have asked for in the past. This is
    extremely not cautious and could lead to an overflow of add executor
    requests, if the cluster manager cannot accommodate all the requests,
    for instance. This commit adds the correct guards against attempting
    to request more executors before the pending ones have all registered.
    
    An important TODO at this point is to retry these requests if
    they're not immediately granted. We can do this after a timeout.
    The tricky thing there is that if the retry is too eager then we
    may end up exceeding the upper bound on the number of executors.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to