[jira] [Commented] (NIFI-2406) Rare start-up problems resulting in all nodes disconnected

ASF GitHub Bot (JIRA) Thu, 04 Aug 2016 10:54:42 -0700

    [ 
https://issues.apache.org/jira/browse/NIFI-2406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15408206#comment-15408206
 ]


ASF GitHub Bot commented on NIFI-2406:
--------------------------------------

Github user YolandaMDavis commented on the issue:

    https://github.com/apache/nifi/pull/729
  
    @markap14 thanks! have retested tried disconnecting nodes as well as 
restarting nodes in cluster.  Did see the error logged "This node was elected 
Leader for Role 'Cluster Coordinator' but failed to take leadership. Will 
relinquish leadership role." on one node with error thrown due to the node not 
being part of the cluster.  That was also followed up with the logged warning 
"Failed to determine which node is elected active Cluster Coordinator: 
ZooKeeper reports the address as 127.0.0.1:11001, but there is no node with 
this address".  This occurred during initial startup and appeared to resolve 
once the cluster coodinator (node 2, on port 11001) came online.
    
    I also shutdown node 2 (which was Cluster Coordinator) and noted the 
repeated "Failed to send heartbeat due to: 
org.apache.nifi.cluster.protocol.ProtocolException: Failed to send message to 
Cluster Coordinator due to: java.net.ConnectException: Connection refused" 
warning in node 1's log.  This was expected based on notes in Jira.  Node 1 was 
primary node at the time.  When node 2 was back online the heartbeat warning no 
longer appeared in node 1's log however I could not access node 1's UI (it was 
displaying an error demonstrated in the snapshot below).
    
    I've included just the nifi-app logs from both servers since it seemed to 
have the most relevant information. Would focus on logs with timestamps 
beginning at 13:35 (08-04-2016). Please let me know if you need any additional 
information or logs.
    
    
    <img width="1440" alt="node1-error-cc-restart" 
src="https://cloud.githubusercontent.com/assets/1371858/17412342/3f28890a-5a4a-11e6-8ff8-d3acd5a5b0b8.png";>
    
    [logs.zip](https://github.com/apache/nifi/files/402267/logs.zip)
    
    
    
    



> Rare start-up problems resulting in all nodes disconnected
> ----------------------------------------------------------
>
>                 Key: NIFI-2406
>                 URL: https://issues.apache.org/jira/browse/NIFI-2406
>             Project: Apache NiFi
>          Issue Type: Bug
>            Reporter: Joseph Percivall
>            Assignee: Mark Payne
>         Attachments: logs.tar.gz
>
>
> While testing PR 678[1], I came across a time where all the nodes were in a 
> disconnected state and each were in a weird state of heartbeating but not 
> connected.
> Also in the logs there were ~1000 lines of:
> 2016-07-26 11:38:07,841 INFO [Leader Election Notification Thread-1] 
> o.a.n.c.l.e.CuratorLeaderElectionManager 
> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@24fae8c6
>  This node has been elected Leader for Role 'Cluster Coordinator'
> This message only gets called here[2] which is a call back for ZK. Also there 
> were many log messages of:
> 2016-07-26 11:54:07,910 WARN [Clustering Tasks Thread-1] 
> o.a.n.c.c.node.NodeClusterCoordinator Failed to determine which node is 
> elected active Cluster Coordinator: ZooKeeper reports the address as 
> localhost:6001, but there is no node with this address
> I believe this is a problem with ZK/NiFi that existed before this PR and not 
> directly related to the PR being reviewed. I will attach a tar of the 3 
> node's logs.
> [1] https://github.com/apache/nifi/pull/678
> [2] 
> https://github.com/apache/nifi/blame/master/nifi-nar-bundles/nifi-framework-bundle/nifi-framework/nifi-framework-core/src/main/java/org/apache/nifi/controller/leader/election/CuratorLeaderElectionManager.java#L220-L220



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NIFI-2406) Rare start-up problems resulting in all nodes disconnected

Reply via email to