GitHub user tillrohrmann opened a pull request:
https://github.com/apache/flink/pull/1213
[FLINK-2790] [yarn] [ha] Adds high availability support for Yarn
Adds high availability support for Yarn by exploiting Yarn's functionality
to restart a failed application master. Depending on the Hadoop version the
behaviour is an increasing superset of functionalities of the preceding
version's behaviour
###2.3.0 <= version < 2.4.0
* Set the number of application attempts to the configuration value
`yarn.application-attempts`. This means that the application can be restarted
`yarn.application-attempts` time before yarn fails the application. In case of
an application master, all other task manager containers will also be killed.
### 2.4.0 <= version < 2.6.0
* Additionally, enables that containers will be kept across application
attempts. This avoids the killing of TaskManager containers in the case of an
application master failure. This has the advantage that the startup time is
faster and that the user does not have to wait for obtaining the container
resources again.
### 2.6.0 <= version
* Sets the attempts failure validity interval to the akka timeout value.
The attempts failure validity interval says that an application is only killed
after the system has seen the maximum number of application attempts during one
interval. This avoids that a long lasting job will deplete it's application
attempts.
This PR also refactors the different Yarn components to allow the start-up
of testing actors within Yarn. Furthermore, the `JobManager` start up logic is
slightly extended to allow code reuse in the `ApplicationMasterBase`.
The HA functionality is tested via the `YARNHighAvailabilityITCase` where
an application master is multiple times killed. Each time it's checked that the
single TaskManager successfully reconnects to the newly started
`YarnJobManager`. In case of version `2.3.0`, the `TaskManager` is restarted.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/tillrohrmann/flink yarnHA
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/flink/pull/1213.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1213
----
commit 1a18172ae69eb576638704f8e143a921aa8630d5
Author: Till Rohrmann <[email protected]>
Date: 2015-09-01T14:35:48Z
[FLINK-2790] [yarn] [ha] Adds high availability support for Yarn
commit 5359676556d16610303d4f36fcbe5320ef4e6643
Author: Till Rohrmann <[email protected]>
Date: 2015-09-23T15:42:57Z
Refactors JobManager's start actors method to be reusable
commit d6a47cd8ad265c5122d1a67c09773cbc5a491261
Author: Till Rohrmann <[email protected]>
Date: 2015-09-24T12:55:18Z
Yarn refactoring to introduce yarn testing functionality
commit f9578f136dd41cd9829d712f7c62a59c9ea8e194
Author: Till Rohrmann <[email protected]>
Date: 2015-09-28T16:21:30Z
Added support for testing yarn cluster. Extracted JobManager's and
TaskManager's testing messages into stackable traits.
commit dbfa16438ad9d7d61e8d1a582c8cd1de9352078e
Author: Till Rohrmann <[email protected]>
Date: 2015-09-29T15:05:01Z
Implemented YarnHighAvailabilityITCase using Akka messages for
synchronization.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---