[ 
https://issues.apache.org/jira/browse/FLINK-4944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15629399#comment-15629399
 ] 

ASF GitHub Bot commented on FLINK-4944:
---------------------------------------

GitHub user tillrohrmann opened a pull request:

    https://github.com/apache/flink/pull/2742

    [FLINK-4944] Replace Akka's death watch with own heartbeat on the TM side

    This PR introduces the HeartbeatActor which is used by the TaskManager to 
monitor the
    JobManager. The HeartbeatActor constantly sends Heartbeat messages to the 
JobManager
    which responds with a HeartbeatResponse. If the HeartbeatResponse fails to 
be received
    for an acceptable heartbeat pause, then the HeartbeatActor sends a 
HeartbeatTimeout
    message to the owner of the HeartbeatActor.
    
    The acceptable heartbeat pause can be extended by the HeartbeatActor if it 
detects that
    it has been stalled by garbage collection, for example.
    
    The HeartbeatActor is started as a child actor of the TaskManager.
    
    Add ClusterOptions
    
    Add comments

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tillrohrmann/flink removeDeathWatch

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/2742.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2742
    
----
commit 4437ef25a3f7a084b3f1a577411a7863410bfde3
Author: Till Rohrmann <[email protected]>
Date:   2016-11-01T20:14:40Z

    [FLINK-4944] Replace Akka's death watch with own heartbeat on the TM side
    
    This PR introduces the HeartbeatActor which is used by the TaskManager to 
monitor the
    JobManager. The HeartbeatActor constantly sends Heartbeat messages to the 
JobManager
    which responds with a HeartbeatResponse. If the HeartbeatResponse fails to 
be received
    for an acceptable heartbeat pause, then the HeartbeatActor sends a 
HeartbeatTimeout
    message to the owner of the HeartbeatActor.
    
    The acceptable heartbeat pause can be extended by the HeartbeatActor if it 
detects that
    it has been stalled by garbage collection, for example.
    
    The HeartbeatActor is started as a child actor of the TaskManager.
    
    Add ClusterOptions
    
    Add comments

----


> Replace Akka's death watch with own heartbeat on the TM side
> ------------------------------------------------------------
>
>                 Key: FLINK-4944
>                 URL: https://issues.apache.org/jira/browse/FLINK-4944
>             Project: Flink
>          Issue Type: Improvement
>          Components: TaskManager
>    Affects Versions: 1.2.0
>            Reporter: Till Rohrmann
>            Assignee: Till Rohrmann
>             Fix For: 1.2.0
>
>
> In order to properly implement FLINK-3347, the {{TaskManager}} must no longer 
> use Akka's death watch mechanism to detect {{JobManager}} failures. The 
> reason is that a hard {{JobManager}} failure will lead to quarantining the 
> {{JobManager's}} {{ActorSystem}} by the {{TaskManagers}}. This in combination 
> with FLINK-3347 would lead to a shutdown of all {{TaskManagers}}.
> Instead we should use our own heartbeat signal to detect dead 
> {{JobManagers}}. In case of a heartbeat timeout, the {{TaskManager}} won't 
> shut down but simply cancel and clear everything. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to