GitHub user andrewor14 opened a pull request:
https://github.com/apache/spark/pull/2746
[WIP][SPARK-3795] Heuristics for dynamically scaling executors
This is part of a bigger effort to provide elastic scaling of executors
within a Spark application
([SPARK-3174](https://issues.apache.org/jira/browse/SPARK-3174)). This PR does
not provide any functionality by itself; it is a skeleton that is missing a
mechanism to be added later in
[SPARK-3822](https://issues.apache.org/jira/browse/SPARK-3822).
The design doc can be found at
[SPARK-3174](https://issues.apache.org/jira/browse/SPARK-3174) (under
"Heuristics for Scaling Executors". This deviates from the design in two ways:
(1) the remove policy is significantly simplified by cutting out the
exponential increase logic that mirrors the add policy, and (2) the
configuration of the add policy is split up into `addExecutorThreshold` and
`addExecutorInterval`. The main reason for this is that the user may want to
configure how quickly executors are added, but want to keep the trigger
condition the same.
This is work-in-progress because it is missing two things: (1) logic for
retrying a request to add/remove executors if the request is not fulfilled, and
(2) unit tests.
Comments and feedback are most welcome, but please keep in mind that this
is still WIP.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/andrewor14/spark scaling-heuristics
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/2746.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2746
----
commit b2e6dcc91897b9bd8718f1ec8d9f16b2f8075010
Author: Andrew Or <[email protected]>
Date: 2014-10-07T22:52:46Z
Add skeleton interface for requesting / killing executors
commit 6f1fa66f0ce0b141a934ade3235d873e6e1d8a60
Author: Andrew Or <[email protected]>
Date: 2014-10-08T23:06:23Z
First cut implementation of adding executors dynamically
This provides a framework that keeps track of when to add and
remove executors, and how many. The add part of this framework
is fully implemented minus the backend, i.e. we do not yet
actually ask the cluster manager for more executors, as that
depends on the backend implementation.
commit 4077ae21a05dc1af9985bf2cf419d5ecff6fde99
Author: Andrew Or <[email protected]>
Date: 2014-10-08T23:41:01Z
Minor code re-organization
commit 20ec6b9955113d7dedc915825b54bbf36dd820e0
Author: Andrew Or <[email protected]>
Date: 2014-10-09T02:54:48Z
First cut implementation of removing executors dynamically
This sets a timer on each executor every time it is launched or
has finished a task. The remove executor timer is then started
as soon as one of the idle executors is triggered. An idle
executor timer is cancelled when the executor starts running
a task again, and the remove executor timer is cancelled when
there are no more idle executors.
An important TODO is to deal with synchronization. I have
witnessed multiple remove executor timers being started due
to the lack of synchronization.
commit ae5b64a1d2572321931e95cb0297a7e3636470c4
Author: Andrew Or <[email protected]>
Date: 2014-10-09T20:10:58Z
Add synchronization
Synchronization is applied somewhat profusely in both classes.
This is because the task scheduler is single-threaded, and normally
does not schedule multiple tasks concurrently. Synchronization is
still needed, however, when executors are added or removed, but
these are relatively rare events so it is OK to synchronize this way.
commit 1cc84446d18d1b1078e98260856ffc278239d614
Author: Andrew Or <[email protected]>
Date: 2014-10-09T20:22:20Z
Minor wording change
commit 89019008088dd8cc34baf1c9e65ceaba66889f62
Author: Andrew Or <[email protected]>
Date: 2014-10-09T23:05:46Z
Simplify remove policy + change the semantics of add policy
With the latest changes, exponential increase only applies to the
add policy. Then, the add threshold and the add interval are
decoupled into separate configurations. The remove policy is made
simpler: an executor is removed as soon as it is marked as idle.
This also updates the comments to explain why we use exponential
increase in the number of executors when adding executors.
commit 6c48ab003af8b809ccc05b58e12148c2e89fcfb4
Author: Andrew Or <[email protected]>
Date: 2014-10-09T23:13:40Z
Update synchronization comment
commit 67c03c7ca4781d62da41b995e59014c374931e99
Author: Andrew Or <[email protected]>
Date: 2014-10-10T02:26:52Z
Correct semantics of adding executors + update comments
Previously, we just kept asking for more executors whether or not
we are granted the ones we have asked for in the past. This is
extremely not cautious and could lead to an overflow of add executor
requests, if the cluster manager cannot accommodate all the requests,
for instance. This commit adds the correct guards against attempting
to request more executors before the pending ones have all registered.
An important TODO at this point is to retry these requests if
they're not immediately granted. We can do this after a timeout.
The tricky thing there is that if the retry is too eager then we
may end up exceeding the upper bound on the number of executors.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]