Joe, Assuming that you're using an embedded ZooKeeper server, it is not surprising that you saw a lot of ERROR-level messages about dropped ZK connections. Since you have only 1 of 3 NiFi nodes up, you had only 1 of 3 ZK servers, so there was no quorum and you were continually trying to connect to nodes that were not available. Once the other nodes were started, you should be okay.
The log messages that you are seeing there indicating weird ports I believe are the ephemeral ports that the client is using for the outgoing connection. These should not need to be opened up in your VM (assuming that you're not blocking outbound ports). The last message there indicates that a session was established with a timeout of 4000 milliseconds, so I don't believe there's any problem with ports being blocked. However, once the nodes have all started up, they shouldn't have problems connecting to each other. Can you grep your logs for "changed from"? NiFi logs at an INFO level every time the connection status of a node in the cluster changes. This may shed some light as to why the nodes were not connecting to the cluster. Thanks -Mark On Nov 18, 2016, at 12:30 PM, Jeff <[email protected]<mailto:[email protected]>> wrote: Joe, I'm glad you were able to get the nodes to reconnect, but I'm interested to know how it got into a state where it couldn't start up previously. If you can reproduce the scenario, and provide the full logs and your NiFi configuration, we can investigate what caused it to get into that state. On Fri, Nov 18, 2016 at 12:17 PM Joe Gresock <[email protected]<mailto:[email protected]>> wrote: I waited the 5 minutes of the election process, and then several minutes beyond that. Incidentally, when I cleared the state (except zookeeper/my_id) from all the nodes, and deleted the flow.xml.gz from all but one of the nodes, and then restarted hte whole cluster, it came back. On Fri, Nov 18, 2016 at 5:11 PM, Jeff <[email protected]<mailto:[email protected]>> wrote: Hello Joe, Just out of curiosity, how long did you let NiFi run while waiting for the nodes to connect? On Fri, Nov 18, 2016 at 10:53 AM Joe Gresock <[email protected]<mailto:[email protected]>> wrote: Despite starting up, the nodes now cannot connect to each other, so they're all listed as Disconnected in the UI. I see this in the logs: 2016-11-18 15:50:19,080 INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181] o.a.zookeeper.server.ZooKeeperServer Client attempting to establish new session at /172.31.33.34:47224 2016-11-18 15:50:19,081 INFO [CommitProcessor:2] o.a.zookeeper.server.ZooKeeperServer Established session 0x258781845940bf9 with negotiated timeout 4000 for client /172.31.33.34:47224 2016-11-18 15:50:19,185 INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181] o.a.zookeeper.server.ZooKeeperServer Client attempting to establish new session at /172.31.33.34:47228 2016-11-18 15:50:19,186 INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181] o.a.zookeeper.server.ZooKeeperServer Client attempting to establish new session at /172.31.33.34:47230 2016-11-18 15:50:19,187 INFO [CommitProcessor:2] o.a.zookeeper.server.ZooKeeperServer Established session 0x258781845940bfa with negotiated timeout 4000 for client /172.31.33.34:47228 2016-11-18 15:50:19,187 INFO [CommitProcessor:2] o.a.zookeeper.server.ZooKeeperServer Established session 0x258781845940bfb with negotiated timeout 4000 for client /172.31.33.34:47230 2016-11-18 15:50:19,292 INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:2181] o.a.zookeeper.server.ZooKeeperServer Client attempting to establish new session at /172.31.33.34:47234 2016-11-18 15:50:19,293 INFO [CommitProcessor:2] o.a.zookeeper.server.ZooKeeperServer Established session 0x258781845940bfc with negotiated timeout 4000 for client /172.31.33.34:47234 However, I definitely did not open any ports similar to 47234 on my nifi VMs. Is there a certain set of ports that need to be open between the servers? My understanding was that only 2888, 3888, and 2121 were necessary for zookeeper. On Fri, Nov 18, 2016 at 3:41 PM, Joe Gresock <[email protected]<mailto:[email protected]>> wrote: It appears that if you try to start up just one node in a cluster with multiple zk hosts specified in zookeeper.properties, you get this error spammed at an incredible rate in your logs. When I started up all 3 nodes at once, they didn't receive the error. On Fri, Nov 18, 2016 at 3:18 PM, Joe Gresock <[email protected]<mailto:[email protected]>> wrote: I'm upgrading a test 0.x nifi cluster to 1.x using the latest in master as of today. I was able to successfully start the 3-node cluster once, but then I restarted it and get the following error spammed in the nifi-app.log. I'm not sure where to start debugging this, and I'm puzzled why it would work once and then start giving me errors on the second restart. Has anyone run into this error? 2016-11-18 15:07:18,178 INFO [main] org.eclipse.jetty.server.Server Started @83426ms 2016-11-18 15:07:18,883 INFO [main] org.apache.nifi.web.server.JettyServer Loading Flow... 2016-11-18 15:07:18,889 INFO [main] org.apache.nifi.io.socket.SocketListener Now listening for connections from nodes on port 9001 2016-11-18 15:07:19,117 INFO [main] o.a.nifi.controller.StandardFlowService Connecting Node: ip-172-31-33-34.ec2.internal:8443 2016-11-18 15:07:25,781 WARN [main] o.a.nifi.controller.StandardFlowService There is currently no Cluster Coordinator. This often happens upon restart of NiFi when running an embedded ZooKeeper. Will register this node to become the active Cluster Coordinator and will attempt to connect to cluster again 2016-11-18 15:07:25,782 INFO [main] o.a.n.c.l.e.CuratorLeaderElectionManager CuratorLeaderElectionManager[stopped=false] Attempted to register Leader Election for role 'Cluster Coordinator' but this role is already registered 2016-11-18 15:07:34,685 WARN [main] o.a.nifi.controller.StandardFlowService There is currently no Cluster Coordinator. This often happens upon restart of NiFi when running an embedded ZooKeeper. Will register this node to become the active Cluster Coordinator and will attempt to connect to cluster again 2016-11-18 15:07:34,685 INFO [main] o.a.n.c.l.e.CuratorLeaderElectionManager CuratorLeaderElectionManager[stopped=false] Attempted to register Leader Election for role 'Cluster Coordinator' but this role is already registered 2016-11-18 15:07:34,696 INFO [Curator-Framework-0] o.a.c.f.state.ConnectionStateManager State change: SUSPENDED 2016-11-18 15:07:34,698 INFO [Curator-ConnectionStateManager-0] o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.lea der.election.CuratorLeaderElectionManager$ElectionListener@671a652a Connection State changed to SUSPENDED *2016-11-18 15:07:34,699 ERROR [Curator-Framework-0] o.a.c.f.imps.CuratorFrameworkImpl Background operation retry gave uporg.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss* at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) ~[zookeeper-3.4.6.jar:3.4.6-1569965] at org.apache.curator.framework.imps.CuratorFrameworkImpl. check BackgroundRetry(CuratorFrameworkImpl.java:728) [curator-framework-2.11.0.jar:na] at org.apache.curator.framework.imps.CuratorFrameworkImpl. perfo rmBackgroundOperation(CuratorFrameworkImpl.java:857) [curator-framework-2.11.0.jar:na] at org.apache.curator.framework.imps.CuratorFrameworkImpl. backg roundOperationsLoop(CuratorFrameworkImpl.java:809) [curator-framework-2.11.0.jar:na] at org.apache.curator.framework.imps.CuratorFrameworkImpl. acces s$300(CuratorFrameworkImpl.java:64) [curator-framework-2.11.0.jar:na] at org.apache.curator.framework.imps.CuratorFrameworkImpl$4. cal l(CuratorFrameworkImpl.java:267) [curator-framework-2.11.0.jar:na] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_111] at java.util.concurrent.ScheduledThreadPoolExecutor$ ScheduledFu tureTask.access$201(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_111] at java.util.concurrent.ScheduledThreadPoolExecutor$ ScheduledFu tureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.8.0_111] at java.util.concurrent.ThreadPoolExecutor.runWorker( ThreadPoolExecutor.java:1142) [na:1.8.0_111] at java.util.concurrent.ThreadPoolExecutor$Worker.run( ThreadPoolExecutor.java:617) [na:1.8.0_111] at java.lang.Thread.run(Thread.java:745) [na:1.8.0_111] -- I know what it is to be in need, and I know what it is to have plenty. I have learned the secret of being content in any and every situation, whether well fed or hungry, whether living in plenty or in want. I can do all this through him who gives me strength. *-Philippians 4:12-13* -- I know what it is to be in need, and I know what it is to have plenty. I have learned the secret of being content in any and every situation, whether well fed or hungry, whether living in plenty or in want. I can do all this through him who gives me strength. *-Philippians 4:12-13* -- I know what it is to be in need, and I know what it is to have plenty. I have learned the secret of being content in any and every situation, whether well fed or hungry, whether living in plenty or in want. I can do all this through him who gives me strength. *-Philippians 4:12-13* -- I know what it is to be in need, and I know what it is to have plenty. I have learned the secret of being content in any and every situation, whether well fed or hungry, whether living in plenty or in want. I can do all this through him who gives me strength. *-Philippians 4:12-13*
