GitHub user StephanEwen opened a pull request:
https://github.com/apache/flink/pull/3295
[FLINK-5747] [distributed coordination] Eager scheduling allocates slots
and deploys tasks in bulk
## Problem Addressed
Currently, eager scheduling immediately triggers the scheduling for all
vertices and their subtasks in topological order.
This has two problems:
- This works only, as long as resource acquisition is "synchronous". With
dynamic resource acquisition in FLIP-6, the resources are returned as Futures
which may complete out of order. This results in out-of-order (not in
topological order) scheduling of tasks which does not work for streaming.
- Deploying some tasks that depend on other tasks before it is clear that
the other tasks have resources as well leads to situations where many
deploy/recovery cycles happen before enough resources are available to get the
job running fully.
## Implemented Change
- The `Execution` has separate methods to allocate a resource and to
deploy the task to that resource
- The **eager** scheduling mode allocates all resources in one chunk and
then deploys once all resources are available.
As a utility, this implements the `FutureUtils.combineAll` method that
combines the Futures of the individual resources to a combined Future.
## Tests
The main tests are in `ExecutionGraphSchedulingTest`. The used utilities
are tested in `FutureUtilsTest` and in `ExecutionGraphUtilsTest`.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/StephanEwen/incubator-flink slot_scheduling
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/flink/pull/3295.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #3295
----
commit 1f18cbb0d6d119fa5e5c4803201c28887b90cef5
Author: Stephan Ewen <[email protected]>
Date: 2017-02-03T19:26:23Z
[FLINK-5747] [distributed coordination] Eager scheduling allocates slots
and deploys tasks in bulk
That way, strictly topological deployment can be guaranteed.
Also, many quick deploy/not-enough-resources/fail/recover cycles can be
avoided in the cases where resources need some time to appear.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---