[
https://issues.apache.org/jira/browse/NIFI-2406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15413870#comment-15413870
]
Joseph Witt commented on NIFI-2406:
-----------------------------------
[~markap14] I setup a simple three node cluster all on localhost no security
setup. Nodes were on ports 8081, 8082, 8083. I setup a basic site-to-site
connection to loop data back to itself. Site to site interestingly enough is
set to connect to 8081 if that helps.
I shutdown 8081 nicely. Navigated to 8082 and it worked well and immediately
(this was a good improvement over last night). I brought back 8081 into the
flow and all come back as it was...lookin good.
I then killed 8081 not nicely (kill -9 style). Navigated to 8082 and it gave
me the unexpected error occured -check app logs. The logs have things like
{quote}
2016-08-09 13:15:00,934 WARN [Clustering Tasks Thread-1]
org.apache.nifi.cluster.heartbeat Failed to send heartbeat due to:
org.apache.nifi.cluster.protocol.ProtocolException: Failed to send message to
Cluster Coordinator due to: java.net.ConnectException: Connection refused
org.apache.nifi.cluster.protocol.ProtocolException: Failed to send message to
Cluster Coordinator due to: java.net.ConnectException: Connection refused
at
org.apache.nifi.cluster.protocol.AbstractNodeProtocolSender.sendProtocolMessage(AbstractNodeProtocolSender.java:117)
~[na:na]
at
org.apache.nifi.cluster.protocol.AbstractNodeProtocolSender.heartbeat(AbstractNodeProtocolSender.java:88)
~[na:na]
at
org.apache.nifi.controller.cluster.ClusterProtocolHeartbeater.send(ClusterProtocolHeartbeater.java:96)
~[na:na]
at
org.apache.nifi.controller.FlowController$HeartbeatSendTask.run(FlowController.java:3864)
~[na:na]
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
[na:1.8.0_66]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
[na:1.8.0_66]
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
[na:1.8.0_66]
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
[na:1.8.0_66]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[na:1.8.0_66]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[na:1.8.0_66]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_66]
Caused by: java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method) ~[na:1.8.0_66]
at
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
~[na:1.8.0_66]
at
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
~[na:1.8.0_66]
at
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
~[na:1.8.0_66]
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
~[na:1.8.0_66]
at java.net.Socket.connect(Socket.java:589) ~[na:1.8.0_66]
at java.net.Socket.connect(Socket.java:538) ~[na:1.8.0_66]
at java.net.Socket.<init>(Socket.java:434) ~[na:1.8.0_66]
at java.net.Socket.<init>(Socket.java:211) ~[na:1.8.0_66]
at
org.apache.nifi.io.socket.SocketUtils.createSocket(SocketUtils.java:59) ~[na:na]
at
org.apache.nifi.cluster.protocol.AbstractNodeProtocolSender.sendProtocolMessage(AbstractNodeProtocolSender.java:115)
~[na:na]
... 10 common frames omitted
{quote}
I then started 8081 back up and things came back on-line as per normal.
So, why is the behavior different between the node being shutdown gracefully
versus it being killed? Should it be the same? What is the expected
mechanics? I was thinking after 40 seconds it would re-assign a new
coordinator but it didn't.
> Rare start-up problems resulting in all nodes disconnected
> ----------------------------------------------------------
>
> Key: NIFI-2406
> URL: https://issues.apache.org/jira/browse/NIFI-2406
> Project: Apache NiFi
> Issue Type: Bug
> Reporter: Joseph Percivall
> Assignee: Mark Payne
> Fix For: 1.0.0
>
> Attachments: logs.tar.gz
>
>
> While testing PR 678[1], I came across a time where all the nodes were in a
> disconnected state and each were in a weird state of heartbeating but not
> connected.
> Also in the logs there were ~1000 lines of:
> 2016-07-26 11:38:07,841 INFO [Leader Election Notification Thread-1]
> o.a.n.c.l.e.CuratorLeaderElectionManager
> org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@24fae8c6
> This node has been elected Leader for Role 'Cluster Coordinator'
> This message only gets called here[2] which is a call back for ZK. Also there
> were many log messages of:
> 2016-07-26 11:54:07,910 WARN [Clustering Tasks Thread-1]
> o.a.n.c.c.node.NodeClusterCoordinator Failed to determine which node is
> elected active Cluster Coordinator: ZooKeeper reports the address as
> localhost:6001, but there is no node with this address
> I believe this is a problem with ZK/NiFi that existed before this PR and not
> directly related to the PR being reviewed. I will attach a tar of the 3
> node's logs.
> [1] https://github.com/apache/nifi/pull/678
> [2]
> https://github.com/apache/nifi/blame/master/nifi-nar-bundles/nifi-framework-bundle/nifi-framework/nifi-framework-core/src/main/java/org/apache/nifi/controller/leader/election/CuratorLeaderElectionManager.java#L220-L220
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)