[jira] [Commented] (NIFI-2406) Rare start-up problems resulting in all nodes disconnected

Joseph Witt (JIRA) Tue, 09 Aug 2016 10:19:37 -0700

    [ 
https://issues.apache.org/jira/browse/NIFI-2406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15413870#comment-15413870
 ]


Joseph Witt commented on NIFI-2406:
-----------------------------------

[~markap14] I setup a simple three node cluster all on localhost no security 
setup.  Nodes were on ports 8081, 8082, 8083.  I setup a basic site-to-site 
connection to loop data back to itself.  Site to site interestingly enough is 
set to connect to 8081 if that helps.

I shutdown 8081 nicely.  Navigated to 8082 and it worked well and immediately 
(this was a good improvement over last night).  I brought back 8081 into the 
flow and all come back as it was...lookin good.

I then killed 8081 not nicely (kill -9 style).  Navigated to 8082 and it gave 
me the unexpected error occured -check app logs.  The logs have things like

{quote}
2016-08-09 13:15:00,934 WARN [Clustering Tasks Thread-1] 
org.apache.nifi.cluster.heartbeat Failed to send heartbeat due to: 
org.apache.nifi.cluster.protocol.ProtocolException: Failed to send message to 
Cluster Coordinator due to: java.net.ConnectException: Connection refused
org.apache.nifi.cluster.protocol.ProtocolException: Failed to send message to 
Cluster Coordinator due to: java.net.ConnectException: Connection refused
        at 
org.apache.nifi.cluster.protocol.AbstractNodeProtocolSender.sendProtocolMessage(AbstractNodeProtocolSender.java:117)
 ~[na:na]
        at 
org.apache.nifi.cluster.protocol.AbstractNodeProtocolSender.heartbeat(AbstractNodeProtocolSender.java:88)
 ~[na:na]
        at 
org.apache.nifi.controller.cluster.ClusterProtocolHeartbeater.send(ClusterProtocolHeartbeater.java:96)
 ~[na:na]
        at 
org.apache.nifi.controller.FlowController$HeartbeatSendTask.run(FlowController.java:3864)
 ~[na:na]
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
[na:1.8.0_66]
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) 
[na:1.8.0_66]
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
 [na:1.8.0_66]
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
 [na:1.8.0_66]
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
[na:1.8.0_66]
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
[na:1.8.0_66]
        at java.lang.Thread.run(Thread.java:745) [na:1.8.0_66]
Caused by: java.net.ConnectException: Connection refused
        at java.net.PlainSocketImpl.socketConnect(Native Method) ~[na:1.8.0_66]
        at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) 
~[na:1.8.0_66]
        at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
 ~[na:1.8.0_66]
        at 
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) 
~[na:1.8.0_66]
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) 
~[na:1.8.0_66]
        at java.net.Socket.connect(Socket.java:589) ~[na:1.8.0_66]
        at java.net.Socket.connect(Socket.java:538) ~[na:1.8.0_66]
        at java.net.Socket.<init>(Socket.java:434) ~[na:1.8.0_66]
        at java.net.Socket.<init>(Socket.java:211) ~[na:1.8.0_66]
        at 
org.apache.nifi.io.socket.SocketUtils.createSocket(SocketUtils.java:59) ~[na:na]
        at 
org.apache.nifi.cluster.protocol.AbstractNodeProtocolSender.sendProtocolMessage(AbstractNodeProtocolSender.java:115)
 ~[na:na]
        ... 10 common frames omitted
{quote}

I then started 8081 back up and things came back on-line as per normal.

So, why is the behavior different between the node being shutdown gracefully 
versus it being killed?  Should it be the same?  What is the expected 
mechanics?  I was thinking after 40 seconds it would re-assign a new 
coordinator but it didn't.

> Rare start-up problems resulting in all nodes disconnected
> ----------------------------------------------------------
>
>                 Key: NIFI-2406
>                 URL: https://issues.apache.org/jira/browse/NIFI-2406
>             Project: Apache NiFi
>          Issue Type: Bug
>            Reporter: Joseph Percivall
>            Assignee: Mark Payne
>             Fix For: 1.0.0
>
>         Attachments: logs.tar.gz
>
>
> While testing PR 678[1], I came across a time where all the nodes were in a 
> disconnected state and each were in a weird state of heartbeating but not 
> connected.
> Also in the logs there were ~1000 lines of:
> 2016-07-26 11:38:07,841 INFO [Leader Election Notification Thread-1] 
> o.a.n.c.l.e.CuratorLeaderElectionManager 
> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@24fae8c6
>  This node has been elected Leader for Role 'Cluster Coordinator'
> This message only gets called here[2] which is a call back for ZK. Also there 
> were many log messages of:
> 2016-07-26 11:54:07,910 WARN [Clustering Tasks Thread-1] 
> o.a.n.c.c.node.NodeClusterCoordinator Failed to determine which node is 
> elected active Cluster Coordinator: ZooKeeper reports the address as 
> localhost:6001, but there is no node with this address
> I believe this is a problem with ZK/NiFi that existed before this PR and not 
> directly related to the PR being reviewed. I will attach a tar of the 3 
> node's logs.
> [1] https://github.com/apache/nifi/pull/678
> [2] 
> https://github.com/apache/nifi/blame/master/nifi-nar-bundles/nifi-framework-bundle/nifi-framework/nifi-framework-core/src/main/java/org/apache/nifi/controller/leader/election/CuratorLeaderElectionManager.java#L220-L220



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NIFI-2406) Rare start-up problems resulting in all nodes disconnected

Reply via email to