[
https://issues.apache.org/jira/browse/FLINK-4141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15359090#comment-15359090
]
ASF GitHub Bot commented on FLINK-4141:
---------------------------------------
GitHub user mxm opened a pull request:
https://github.com/apache/flink/pull/2190
[FLINK-4141] remove leaderUpdated() method from ResourceManager
This removes the leaderUpdated method from the framework. Further it
lets the RM client thread communicate directly with the
ResourceManager actor. This is fine since the two are always spawned
together. Failures of the ResourceManager actor will lead to dropped
messages of the RM client thread. Failures of the RM client thread will
inform the JobManager.
The leaderUpdated() method was used to signal the ResourceManager
framework that a new leader was elected. However, the method was not
always called when the leader changed, only when a new leader was
elected. This dropped all messages from the async Yarn RM client
thread (YarnResourceManagerCallbackHandler) for the time that the old
leader had failed and no new leader had been elected. The Yarn RM client
thread used leader tagged messages to communicate with the main Flink
ResourceManager actor.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/mxm/flink FLINK-4141
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/flink/pull/2190.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2190
----
commit c758121b9e5e2d7de8318bd529aa5da88ed424c6
Author: Maximilian Michels <[email protected]>
Date: 2016-07-01T14:27:18Z
[FLINK-4141] remove leaderUpdated() method from ResourceManager
This removes the leaderUpdated method from the framework. Further it
lets the RM client thread communicate directly with the
ResourceManager actor. This is fine since the two are always spawned
together. Failures of the ResourceManager actor will lead to dropped
messages of the RM client thread. Failures of the RM client thread will
inform the JobManager.
The leaderUpdated() method was used to signal the ResourceManager
framework that a new leader was elected. However, the method was not
always called when the leader changed, only when a new leader was
elected. This dropped all messages from the async Yarn RM client
thread (YarnResourceManagerCallbackHandler) for the time that the old
leader had failed and no new leader had been elected. The Yarn RM client
thread used leader tagged messages to communicate with the main Flink
ResourceManager actor.
----
> TaskManager failures not always recover when killed during an
> ApplicationMaster failure in HA mode on Yarn
> ----------------------------------------------------------------------------------------------------------
>
> Key: FLINK-4141
> URL: https://issues.apache.org/jira/browse/FLINK-4141
> Project: Flink
> Issue Type: Bug
> Affects Versions: 1.0.3
> Reporter: Stefan Richter
>
> High availability on Yarn often fails to recover in the following test
> scenario:
> 1. Kill application master process.
> 2. Then, while application master is recovering, randomly kill several task
> managers (with some delay).
> After the application master recovered, not all the killed task manager are
> brought back and no further attempts are made the restart them.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)