[ 
https://issues.apache.org/jira/browse/NIFI-2406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15395887#comment-15395887
 ] 

ASF GitHub Bot commented on NIFI-2406:
--------------------------------------

GitHub user markap14 opened a pull request:

    https://github.com/apache/nifi/pull/729

    NIFI-2406: Ensure that hearbeat monitor continues to run while instance …

    …is running. This way if a node sends heartbeat to this node as elected 
coordinator changes, we notify the node accordingly. Handle Exceptions more 
gracefully in leader election code.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/markap14/nifi NIFI-2406

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/nifi/pull/729.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #729
    
----
commit e9f5a2182362f32e837d4eabd47e6d3604a37e55
Author: Mark Payne <[email protected]>
Date:   2016-07-27T15:55:02Z

    NIFI-2406: Ensure that hearbeat monitor continues to run while instance is 
running. This way if a node sends heartbeat to this node as elected coordinator 
changes, we notify the node accordingly. Handle Exceptions more gracefully in 
leader election code.

----


> Rare start-up problems resulting in all nodes disconnected
> ----------------------------------------------------------
>
>                 Key: NIFI-2406
>                 URL: https://issues.apache.org/jira/browse/NIFI-2406
>             Project: Apache NiFi
>          Issue Type: Bug
>            Reporter: Joseph Percivall
>            Assignee: Mark Payne
>         Attachments: logs.tar.gz
>
>
> While testing PR 678[1], I came across a time where all the nodes were in a 
> disconnected state and each were in a weird state of heartbeating but not 
> connected.
> Also in the logs there were ~1000 lines of:
> 2016-07-26 11:38:07,841 INFO [Leader Election Notification Thread-1] 
> o.a.n.c.l.e.CuratorLeaderElectionManager 
> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@24fae8c6
>  This node has been elected Leader for Role 'Cluster Coordinator'
> This message only gets called here[2] which is a call back for ZK. Also there 
> were many log messages of:
> 2016-07-26 11:54:07,910 WARN [Clustering Tasks Thread-1] 
> o.a.n.c.c.node.NodeClusterCoordinator Failed to determine which node is 
> elected active Cluster Coordinator: ZooKeeper reports the address as 
> localhost:6001, but there is no node with this address
> I believe this is a problem with ZK/NiFi that existed before this PR and not 
> directly related to the PR being reviewed. I will attach a tar of the 3 
> node's logs.
> [1] https://github.com/apache/nifi/pull/678
> [2] 
> https://github.com/apache/nifi/blame/master/nifi-nar-bundles/nifi-framework-bundle/nifi-framework/nifi-framework-core/src/main/java/org/apache/nifi/controller/leader/election/CuratorLeaderElectionManager.java#L220-L220



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to