Update: now all 5 nodes, regardless of ZK server, are indicating SUSPENDED -> RECONNECTED.
On Thu, May 25, 2017 at 10:23 AM, Mark Bean <[email protected]> wrote: > I reduced the number of embedded ZooKeeper servers on the 5-Node NiFi > Cluster from 5 to 3. This has improved the situation. I do not see any of > the three Nodes which are also ZK servers disconnecting/reconnecting to the > cluster as before. However, the two Nodes which are not running ZK continue > to disconnect and reconnect. The following is taken from one of the non-ZK > Nodes. It's curious that some messages are issued twice from the same > thread, but reference a different object > > nifi-app.log > 2017-05-25 13:40:01,628 INFO [main-EventTrhead] > o.a.c.f.state.ConnectionStateManager > State change: SUSPENDED > 2017-05-25 13:39:45,627 INFO [Clustering Tasks Thread-1] > o.a.n.c.c.ClusterProtocolHeaertbeater > Heartbeat create at 2017-05-25 13:39:45,504 and sent to FQDN:PORT at > 2017-05-25 13:39:45,627; send took 122 millis > 2017-05-25 13:39:50,862 INFO [Clustering Tasks Thread-1] > o.a.n.c.c.ClusterProtocolHeaertbeater > Heartbeat create at 2017-05-25 13:39:50,732 and sent to FQDN:PORT at > 2017-05-25 13:39:50,862; send took 122 millis > 2017-05-25 13:39:56,089 INFO [Clustering Tasks Thread-1] > o.a.n.c.c.ClusterProtocolHeaertbeater > Heartbeat create at 2017-05-25 13:39:55,966 and sent to FQDN:PORT at > 2017-05-25 13:39:56,089; send took 129 millis > 2017-05-25 13:40:01,629 INFO [Curator-ConnectionStateManager-0] > o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller. > leader.election.CuratorLeaderElectionManager$ElectionListener@68f8b6a2 > Connection State changed to SUSPENDED > 2017-05-25 13:40:01,629 INFO [Curator-ConnectionStateManager-0] > o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller. > leader.election.CuratorLeaderElectionManager$ElectionListener@663f55cd > Connection State changed to SUSPENDED > 2017-05-25 13:40:02,412 INFO [main-EventThread] > o.a.c.f.state.ConnectinoStateManager > State change: RECONNECTED > 2017-05-25 13:40:02,413 INFO [Curator-ConnectionStateManager-0] > o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller. > leader.election.CuratorLeaderElectionManager$ElectionListener@68f8b6a2 > Connection State changed to RECONNECTED > 2017-05-25 13:40:02,413 INFO [Curator-ConnectionStateManager-0] > o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller. > leader.election.CuratorLeaderElectionManager$ElectionListener@663f55cd > Connection State changed to RECONNECTED > 2017-05-25 13:40:02,550 INFO [Clustering Tasks Thread-1] > o.a.n.c.c.ClusterProtocolHeaertbeater > Heartbeat create at 2017-05-25 13:40:01,632 and sent to FQDN:PORT at > 2017-05-25 13:40:02,550; send took 917 millis > 2017-05-25 13:40:07,787 INFO [Clustering Tasks Thread-1] > o.a.n.c.c.ClusterProtocolHeaertbeater > Heartbeat create at 2017-05-25 13:40:07,657 and sent to FQDN:PORT at > 2017-05-25 13:40:07,787; send took 129 millis > > I will work on setting up an external ZK next, but would still like some > insight to what is being observed with the embedded ZK. > > Thanks, > Mark > > > > > On Wed, May 24, 2017 at 3:57 PM, Mark Bean <[email protected]> wrote: > >> Yes, we are using the embedded ZK. We will try instantiating and external >> ZK and see if that resolves the problem. >> >> The load on the system is extremely small. Currently (as Nodes are >> disconnecting/reconnecting) all input ports to the flow are turned off. The >> only data in the flow is from a single GenerateFlow generating 5B every 30 >> secs. >> >> Also, it is a 5-node cluster with embedded ZK on each node. First, I will >> try reducing ZK to only 3 nodes. Then, I will try a 3-node external ZK. >> >> Thanks, >> Mark >> >> On Wed, May 24, 2017 at 11:49 AM, Joe Witt <[email protected]> wrote: >> >>> Are you using the embedded Zookeeper? If yes we recommend using an >>> external zookeeper. >>> >>> What type of load are the systems under when this occurs (cpu, >>> network, memory, disk io)? Under high load the default timeouts for >>> clustering are too aggressive. You can relax these for higher load >>> clusters and should see good behavior. Even if the system overall is >>> not under all that high of load if you're seeing garbage collection >>> pauses that are lengthy and/or frequent it can cause the same high >>> load effect as far as the JVM is concerned. >>> >>> Thanks >>> Joe >>> >>> On Wed, May 24, 2017 at 9:11 AM, Mark Bean <[email protected]> >>> wrote: >>> > We have a cluster which is showing signs of instability. The Primary >>> Node >>> > and Coordinator are reassigned to different nodes every several >>> minutes. I >>> > believe this is due to lack of heartbeat or other coordination. The >>> > following error occurs periodically in the nifi-app.log >>> > >>> > ERROR [CommitProcessor:1] o.apache.zookeeper.server.NIOServerCnxn >>> > Unexpected Exception: >>> > java.nio.channels.CancelledKeyException: null >>> > at sun.nio.ch.SelectionKeyImpl.en >>> sureValid(SectionKeyImpl.java:73) >>> > at sun.nio.ch.SelectionKeyImpl.in >>> terestOps(SelctionKeyImpl.java:77) >>> > at >>> > org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServ >>> erCnxn.java:151) >>> > at >>> > org.apache.zookeeper.server.NIOServerCnXn.sendResopnse(NIOSe >>> rverCnxn.java:1081) >>> > at >>> > org.apache.zookeeper.server.FinalRequestProcessor.processReq >>> uest(FinalRequestProcessor.java:404) >>> > at >>> > org.apache.zookeeper.server.quorum.CommitProcessor.run(Commi >>> tProcessor.java:74) >>> > >>> > Apache NiFi 1.2.0 >>> > >>> > Thoughts? >>> >> >> >
