[
https://issues.apache.org/jira/browse/FLINK-4944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15629399#comment-15629399
]
ASF GitHub Bot commented on FLINK-4944:
---------------------------------------
GitHub user tillrohrmann opened a pull request:
https://github.com/apache/flink/pull/2742
[FLINK-4944] Replace Akka's death watch with own heartbeat on the TM side
This PR introduces the HeartbeatActor which is used by the TaskManager to
monitor the
JobManager. The HeartbeatActor constantly sends Heartbeat messages to the
JobManager
which responds with a HeartbeatResponse. If the HeartbeatResponse fails to
be received
for an acceptable heartbeat pause, then the HeartbeatActor sends a
HeartbeatTimeout
message to the owner of the HeartbeatActor.
The acceptable heartbeat pause can be extended by the HeartbeatActor if it
detects that
it has been stalled by garbage collection, for example.
The HeartbeatActor is started as a child actor of the TaskManager.
Add ClusterOptions
Add comments
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/tillrohrmann/flink removeDeathWatch
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/flink/pull/2742.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #2742
----
commit 4437ef25a3f7a084b3f1a577411a7863410bfde3
Author: Till Rohrmann <[email protected]>
Date: 2016-11-01T20:14:40Z
[FLINK-4944] Replace Akka's death watch with own heartbeat on the TM side
This PR introduces the HeartbeatActor which is used by the TaskManager to
monitor the
JobManager. The HeartbeatActor constantly sends Heartbeat messages to the
JobManager
which responds with a HeartbeatResponse. If the HeartbeatResponse fails to
be received
for an acceptable heartbeat pause, then the HeartbeatActor sends a
HeartbeatTimeout
message to the owner of the HeartbeatActor.
The acceptable heartbeat pause can be extended by the HeartbeatActor if it
detects that
it has been stalled by garbage collection, for example.
The HeartbeatActor is started as a child actor of the TaskManager.
Add ClusterOptions
Add comments
----
> Replace Akka's death watch with own heartbeat on the TM side
> ------------------------------------------------------------
>
> Key: FLINK-4944
> URL: https://issues.apache.org/jira/browse/FLINK-4944
> Project: Flink
> Issue Type: Improvement
> Components: TaskManager
> Affects Versions: 1.2.0
> Reporter: Till Rohrmann
> Assignee: Till Rohrmann
> Fix For: 1.2.0
>
>
> In order to properly implement FLINK-3347, the {{TaskManager}} must no longer
> use Akka's death watch mechanism to detect {{JobManager}} failures. The
> reason is that a hard {{JobManager}} failure will lead to quarantining the
> {{JobManager's}} {{ActorSystem}} by the {{TaskManagers}}. This in combination
> with FLINK-3347 would lead to a shutdown of all {{TaskManagers}}.
> Instead we should use our own heartbeat signal to detect dead
> {{JobManagers}}. In case of a heartbeat timeout, the {{TaskManager}} won't
> shut down but simply cancel and clear everything.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)