[ 
https://issues.apache.org/jira/browse/FLINK-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15874711#comment-15874711
 ] 

ASF GitHub Bot commented on FLINK-3347:
---------------------------------------

GitHub user tillrohrmann opened a pull request:

    https://github.com/apache/flink/pull/3363

    [backport] [FLINK-3347] [akka] Add QuarantineMonitor which shuts a 
quarantined actor system and JVM down

    This is a backport of #2696 onto the `release-1.2` branch.
    
    The QuarantineMonitor subscribes to the actor system's event bus and 
listens to
    AssociationErrorEvents. These are the events which are generated when the 
actor system
    has quarantined another actor system or if it has been quarantined by 
another actor
    system. In case of the quarantined state, the actor system will be shutdown 
killing
    all actors and then the JVM is terminated.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tillrohrmann/flink quarantineMonitorBackport

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/3363.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3363
    
----
commit 0cee579bde9f04a07f36b9c01be3e8089c34b0a4
Author: Till Rohrmann <[email protected]>
Date:   2016-10-26T22:24:12Z

    [FLINK-3347] [akka] Add QuarantineMonitor which shuts a quarantined actor 
system and JVM down
    
    The QuarantineMonitor subscribes to the actor system's event bus and 
listens to
    AssociationErrorEvents. These are the events which are generated when the 
actor system
    has quarantined another actor system or if it has been quarantined by 
another actor
    system. In case of the quarantined state, the actor system will be shutdown 
killing
    all actors and then the JVM is terminated.

commit c52bcfb24ba51120b51d1b62cec44f9a88690e19
Author: Till Rohrmann <[email protected]>
Date:   2017-02-20T15:37:28Z

    Disable QuarantineMonitor per default; Reintroduce config option for 
activation

----


> TaskManager (or its ActorSystem) need to restart in case they notice 
> quarantine
> -------------------------------------------------------------------------------
>
>                 Key: FLINK-3347
>                 URL: https://issues.apache.org/jira/browse/FLINK-3347
>             Project: Flink
>          Issue Type: Improvement
>          Components: Distributed Coordination
>    Affects Versions: 0.10.1
>            Reporter: Stephan Ewen
>            Assignee: Till Rohrmann
>            Priority: Critical
>             Fix For: 1.0.0, 1.2.0, 1.1.4
>
>
> There are cases where Akka quarantines remote actor systems. In that case, no 
> further communication is possible with that actor system unless one of the 
> two actor systems is restarted.
> The result is that a TaskManager is up and available, but cannot register at 
> the JobManager (Akka refuses connection because of the quarantined state), 
> making the TaskManager a useless process.
> I suggest to let the TaskManager restart itself once it notices that either 
> it quarantined the JobManager, or the JobManager quarantined it.
> It is possible to recognize that by listening to certain events in the actor 
> system event stream: 
> http://stackoverflow.com/questions/32471088/akka-cluster-detecting-quarantined-state



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to