[ 
https://issues.apache.org/jira/browse/NIFI-2406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15414029#comment-15414029
 ] 

Mark Payne commented on NIFI-2406:
----------------------------------

[~joewitt] - I just pushed an update to the PR (separate commit so you can see 
what changed). I was able to reproduce this easily when doing a kill -9, or a 
nice restart, but it depends on whether or not the node becomes the cluster 
coordinator when it starts up. The issue was that FlowController started the 
heartbeat monitor, which in turn announced the node as the Cluster Coordinator 
- even when it wasn't. This overwrote the true Cluster Coordinator's address, 
so the nodes were confused. Fixed it so that it publishes the address only when 
the node is truly elected Cluster Coordinator.

> Rare start-up problems resulting in all nodes disconnected
> ----------------------------------------------------------
>
>                 Key: NIFI-2406
>                 URL: https://issues.apache.org/jira/browse/NIFI-2406
>             Project: Apache NiFi
>          Issue Type: Bug
>            Reporter: Joseph Percivall
>            Assignee: Mark Payne
>             Fix For: 1.0.0
>
>         Attachments: logs.tar.gz
>
>
> While testing PR 678[1], I came across a time where all the nodes were in a 
> disconnected state and each were in a weird state of heartbeating but not 
> connected.
> Also in the logs there were ~1000 lines of:
> 2016-07-26 11:38:07,841 INFO [Leader Election Notification Thread-1] 
> o.a.n.c.l.e.CuratorLeaderElectionManager 
> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@24fae8c6
>  This node has been elected Leader for Role 'Cluster Coordinator'
> This message only gets called here[2] which is a call back for ZK. Also there 
> were many log messages of:
> 2016-07-26 11:54:07,910 WARN [Clustering Tasks Thread-1] 
> o.a.n.c.c.node.NodeClusterCoordinator Failed to determine which node is 
> elected active Cluster Coordinator: ZooKeeper reports the address as 
> localhost:6001, but there is no node with this address
> I believe this is a problem with ZK/NiFi that existed before this PR and not 
> directly related to the PR being reviewed. I will attach a tar of the 3 
> node's logs.
> [1] https://github.com/apache/nifi/pull/678
> [2] 
> https://github.com/apache/nifi/blame/master/nifi-nar-bundles/nifi-framework-bundle/nifi-framework/nifi-framework-core/src/main/java/org/apache/nifi/controller/leader/election/CuratorLeaderElectionManager.java#L220-L220



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to