looked at a secured cluster and the send times are routinely at 100ms similar to yours. I think what i was flagging as potentially interesting is not interesting at all.
On Thu, May 25, 2017 at 11:34 AM, Joe Witt <[email protected]> wrote: > Ok. Well as a point of comparison i'm looking at heartbeat logs from > another cluster and the times are consistently 1-3 millis for the > send. Yours above show 100+ms typical with one north of 900ms. Not > sure how relevant that is but something i noticed. > > On Thu, May 25, 2017 at 11:29 AM, Mark Bean <[email protected]> wrote: >> ping shows acceptably fast response time between servers, approximately >> 0.100-0.150 ms >> >> >> On Thu, May 25, 2017 at 11:13 AM, Joe Witt <[email protected]> wrote: >> >>> have you evaluated latency across the machines in your cluster? I ask >>> because 122ms is pretty long and 917ms is very long. Are these nodes >>> across a WAN link? >>> >>> On Thu, May 25, 2017 at 11:08 AM, Mark Bean <[email protected]> wrote: >>> > Update: now all 5 nodes, regardless of ZK server, are indicating >>> SUSPENDED >>> > -> RECONNECTED. >>> > >>> > On Thu, May 25, 2017 at 10:23 AM, Mark Bean <[email protected]> >>> wrote: >>> > >>> >> I reduced the number of embedded ZooKeeper servers on the 5-Node NiFi >>> >> Cluster from 5 to 3. This has improved the situation. I do not see any >>> of >>> >> the three Nodes which are also ZK servers disconnecting/reconnecting to >>> the >>> >> cluster as before. However, the two Nodes which are not running ZK >>> continue >>> >> to disconnect and reconnect. The following is taken from one of the >>> non-ZK >>> >> Nodes. It's curious that some messages are issued twice from the same >>> >> thread, but reference a different object >>> >> >>> >> nifi-app.log >>> >> 2017-05-25 13:40:01,628 INFO [main-EventTrhead] o.a.c.f.state. >>> ConnectionStateManager >>> >> State change: SUSPENDED >>> >> 2017-05-25 13:39:45,627 INFO [Clustering Tasks Thread-1] o.a.n.c.c. >>> ClusterProtocolHeaertbeater >>> >> Heartbeat create at 2017-05-25 13:39:45,504 and sent to FQDN:PORT at >>> >> 2017-05-25 13:39:45,627; send took 122 millis >>> >> 2017-05-25 13:39:50,862 INFO [Clustering Tasks Thread-1] o.a.n.c.c. >>> ClusterProtocolHeaertbeater >>> >> Heartbeat create at 2017-05-25 13:39:50,732 and sent to FQDN:PORT at >>> >> 2017-05-25 13:39:50,862; send took 122 millis >>> >> 2017-05-25 13:39:56,089 INFO [Clustering Tasks Thread-1] o.a.n.c.c. >>> ClusterProtocolHeaertbeater >>> >> Heartbeat create at 2017-05-25 13:39:55,966 and sent to FQDN:PORT at >>> >> 2017-05-25 13:39:56,089; send took 129 millis >>> >> 2017-05-25 13:40:01,629 INFO [Curator-ConnectionStateManager-0] >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller. >>> >> leader.election.CuratorLeaderElectionManager$ElectionListener@68f8b6a2 >>> >> Connection State changed to SUSPENDED >>> >> 2017-05-25 13:40:01,629 INFO [Curator-ConnectionStateManager-0] >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller. >>> >> leader.election.CuratorLeaderElectionManager$ElectionListener@663f55cd >>> >> Connection State changed to SUSPENDED >>> >> 2017-05-25 13:40:02,412 INFO [main-EventThread] o.a.c.f.state. >>> ConnectinoStateManager >>> >> State change: RECONNECTED >>> >> 2017-05-25 13:40:02,413 INFO [Curator-ConnectionStateManager-0] >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller. >>> >> leader.election.CuratorLeaderElectionManager$ElectionListener@68f8b6a2 >>> >> Connection State changed to RECONNECTED >>> >> 2017-05-25 13:40:02,413 INFO [Curator-ConnectionStateManager-0] >>> >> o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller. >>> >> leader.election.CuratorLeaderElectionManager$ElectionListener@663f55cd >>> >> Connection State changed to RECONNECTED >>> >> 2017-05-25 13:40:02,550 INFO [Clustering Tasks Thread-1] o.a.n.c.c. >>> ClusterProtocolHeaertbeater >>> >> Heartbeat create at 2017-05-25 13:40:01,632 and sent to FQDN:PORT at >>> >> 2017-05-25 13:40:02,550; send took 917 millis >>> >> 2017-05-25 13:40:07,787 INFO [Clustering Tasks Thread-1] o.a.n.c.c. >>> ClusterProtocolHeaertbeater >>> >> Heartbeat create at 2017-05-25 13:40:07,657 and sent to FQDN:PORT at >>> >> 2017-05-25 13:40:07,787; send took 129 millis >>> >> >>> >> I will work on setting up an external ZK next, but would still like some >>> >> insight to what is being observed with the embedded ZK. >>> >> >>> >> Thanks, >>> >> Mark >>> >> >>> >> >>> >> >>> >> >>> >> On Wed, May 24, 2017 at 3:57 PM, Mark Bean <[email protected]> >>> wrote: >>> >> >>> >>> Yes, we are using the embedded ZK. We will try instantiating and >>> external >>> >>> ZK and see if that resolves the problem. >>> >>> >>> >>> The load on the system is extremely small. Currently (as Nodes are >>> >>> disconnecting/reconnecting) all input ports to the flow are turned >>> off. The >>> >>> only data in the flow is from a single GenerateFlow generating 5B >>> every 30 >>> >>> secs. >>> >>> >>> >>> Also, it is a 5-node cluster with embedded ZK on each node. First, I >>> will >>> >>> try reducing ZK to only 3 nodes. Then, I will try a 3-node external ZK. >>> >>> >>> >>> Thanks, >>> >>> Mark >>> >>> >>> >>> On Wed, May 24, 2017 at 11:49 AM, Joe Witt <[email protected]> wrote: >>> >>> >>> >>>> Are you using the embedded Zookeeper? If yes we recommend using an >>> >>>> external zookeeper. >>> >>>> >>> >>>> What type of load are the systems under when this occurs (cpu, >>> >>>> network, memory, disk io)? Under high load the default timeouts for >>> >>>> clustering are too aggressive. You can relax these for higher load >>> >>>> clusters and should see good behavior. Even if the system overall is >>> >>>> not under all that high of load if you're seeing garbage collection >>> >>>> pauses that are lengthy and/or frequent it can cause the same high >>> >>>> load effect as far as the JVM is concerned. >>> >>>> >>> >>>> Thanks >>> >>>> Joe >>> >>>> >>> >>>> On Wed, May 24, 2017 at 9:11 AM, Mark Bean <[email protected]> >>> >>>> wrote: >>> >>>> > We have a cluster which is showing signs of instability. The Primary >>> >>>> Node >>> >>>> > and Coordinator are reassigned to different nodes every several >>> >>>> minutes. I >>> >>>> > believe this is due to lack of heartbeat or other coordination. The >>> >>>> > following error occurs periodically in the nifi-app.log >>> >>>> > >>> >>>> > ERROR [CommitProcessor:1] o.apache.zookeeper.server.NIOServerCnxn >>> >>>> > Unexpected Exception: >>> >>>> > java.nio.channels.CancelledKeyException: null >>> >>>> > at sun.nio.ch.SelectionKeyImpl.en >>> >>>> sureValid(SectionKeyImpl.java:73) >>> >>>> > at sun.nio.ch.SelectionKeyImpl.in >>> >>>> terestOps(SelctionKeyImpl.java:77) >>> >>>> > at >>> >>>> > org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServ >>> >>>> erCnxn.java:151) >>> >>>> > at >>> >>>> > org.apache.zookeeper.server.NIOServerCnXn.sendResopnse(NIOSe >>> >>>> rverCnxn.java:1081) >>> >>>> > at >>> >>>> > org.apache.zookeeper.server.FinalRequestProcessor.processReq >>> >>>> uest(FinalRequestProcessor.java:404) >>> >>>> > at >>> >>>> > org.apache.zookeeper.server.quorum.CommitProcessor.run(Commi >>> >>>> tProcessor.java:74) >>> >>>> > >>> >>>> > Apache NiFi 1.2.0 >>> >>>> > >>> >>>> > Thoughts? >>> >>>> >>> >>> >>> >>> >>> >> >>>
