[ 
https://issues.apache.org/jira/browse/NIFI-2406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15395894#comment-15395894
 ] 

Mark Payne commented on NIFI-2406:
----------------------------------

[~JPercivall] - thanks for reporting this! I have created a PR that I believe 
addresses the underlying issue(s). There were two changes made:

First, if hearbeat monitor is running and we call start(), that will throw an 
Exception. I changed this so that it no longer throws an Exception but instead 
of just logs a warning and returns. Because this threw an Exception, the 
Exception bubbled up to Curator client code, which then swallowed the Exception 
and called the callback again. So the Exception was thrown again and the cycle 
repeated over and over. This explains all of those log messages about being 
elected coordinator. So I also ensured that we catch any Exception, log it, and 
handle the situation appropriately.

Secondly, I changed the heartbeat monitor so that it is running the whole time 
that the NiFi instance is running. There was really no reason that we needed to 
stop & start when connected & disconnected from cluster. I believe that because 
of the zookeeper connection/disconnection/connection/suspended messages that we 
see in the logs, Node 1, which was actually elected coordinator was not 
processing the heartbeat messages. As a result, none of the nodes ever become 
'CONNECTED'. More over, since we saw connection issues at startup, the 
heartbeat monitor should have received the heartbeat and notified the node to 
reconnect. But since the heartbeat monitor wasn't running, it did not request 
that the node reconnect, so the nodes all stated in a 'DISCONNECTED' state.

> Rare start-up problems resulting in all nodes disconnected
> ----------------------------------------------------------
>
>                 Key: NIFI-2406
>                 URL: https://issues.apache.org/jira/browse/NIFI-2406
>             Project: Apache NiFi
>          Issue Type: Bug
>            Reporter: Joseph Percivall
>            Assignee: Mark Payne
>         Attachments: logs.tar.gz
>
>
> While testing PR 678[1], I came across a time where all the nodes were in a 
> disconnected state and each were in a weird state of heartbeating but not 
> connected.
> Also in the logs there were ~1000 lines of:
> 2016-07-26 11:38:07,841 INFO [Leader Election Notification Thread-1] 
> o.a.n.c.l.e.CuratorLeaderElectionManager 
> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@24fae8c6
>  This node has been elected Leader for Role 'Cluster Coordinator'
> This message only gets called here[2] which is a call back for ZK. Also there 
> were many log messages of:
> 2016-07-26 11:54:07,910 WARN [Clustering Tasks Thread-1] 
> o.a.n.c.c.node.NodeClusterCoordinator Failed to determine which node is 
> elected active Cluster Coordinator: ZooKeeper reports the address as 
> localhost:6001, but there is no node with this address
> I believe this is a problem with ZK/NiFi that existed before this PR and not 
> directly related to the PR being reviewed. I will attach a tar of the 3 
> node's logs.
> [1] https://github.com/apache/nifi/pull/678
> [2] 
> https://github.com/apache/nifi/blame/master/nifi-nar-bundles/nifi-framework-bundle/nifi-framework/nifi-framework-core/src/main/java/org/apache/nifi/controller/leader/election/CuratorLeaderElectionManager.java#L220-L220



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to