Re: unstable cluster

Joe Witt Thu, 25 May 2017 08:57:24 -0700

looked at a secured cluster and the send times are routinely at 100ms
similar to yours.  I think what i was flagging as potentially
interesting is not interesting at all.


On Thu, May 25, 2017 at 11:34 AM, Joe Witt <[email protected]> wrote:
> Ok.  Well as a point of comparison i'm looking at heartbeat logs from
> another cluster and the times are consistently 1-3 millis for the
> send.  Yours above show 100+ms typical with one north of 900ms.  Not
> sure how relevant that is but something i noticed.
>
> On Thu, May 25, 2017 at 11:29 AM, Mark Bean <[email protected]> wrote:
>> ping shows acceptably fast response time between servers, approximately
>> 0.100-0.150 ms
>>
>>
>> On Thu, May 25, 2017 at 11:13 AM, Joe Witt <[email protected]> wrote:
>>
>>> have you evaluated latency across the machines in your cluster?  I ask
>>> because 122ms is pretty long and 917ms is very long.  Are these nodes
>>> across a WAN link?
>>>
>>> On Thu, May 25, 2017 at 11:08 AM, Mark Bean <[email protected]> wrote:
>>> > Update: now all 5 nodes, regardless of ZK server, are indicating
>>> SUSPENDED
>>> > -> RECONNECTED.
>>> >
>>> > On Thu, May 25, 2017 at 10:23 AM, Mark Bean <[email protected]>
>>> wrote:
>>> >
>>> >> I reduced the number of embedded ZooKeeper servers on the 5-Node NiFi
>>> >> Cluster from 5 to 3. This has improved the situation. I do not see any
>>> of
>>> >> the three Nodes which are also ZK servers disconnecting/reconnecting to
>>> the
>>> >> cluster as before. However, the two Nodes which are not running ZK
>>> continue
>>> >> to disconnect and reconnect. The following is taken from one of the
>>> non-ZK
>>> >> Nodes. It's curious that some messages are issued twice from the same
>>> >> thread, but reference a different object
>>> >>
>>> >> nifi-app.log
>>> >> 2017-05-25 13:40:01,628 INFO [main-EventTrhead] o.a.c.f.state.
>>> ConnectionStateManager
>>> >> State change: SUSPENDED
>>> >> 2017-05-25 13:39:45,627 INFO [Clustering Tasks Thread-1] o.a.n.c.c.
>>> ClusterProtocolHeaertbeater
>>> >> Heartbeat create at 2017-05-25 13:39:45,504 and sent to FQDN:PORT at
>>> >> 2017-05-25 13:39:45,627; send took 122 millis
>>> >> 2017-05-25 13:39:50,862 INFO [Clustering Tasks Thread-1] o.a.n.c.c.
>>> ClusterProtocolHeaertbeater
>>> >> Heartbeat create at 2017-05-25 13:39:50,732 and sent to FQDN:PORT at
>>> >> 2017-05-25 13:39:50,862; send took 122 millis
>>> >> 2017-05-25 13:39:56,089 INFO [Clustering Tasks Thread-1] o.a.n.c.c.
>>> ClusterProtocolHeaertbeater
>>> >> Heartbeat create at 2017-05-25 13:39:55,966 and sent to FQDN:PORT at
>>> >> 2017-05-25 13:39:56,089; send took 129 millis
>>> >> 2017-05-25 13:40:01,629 INFO [Curator-ConnectionStateManager-0]
>>> >> o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.
>>> >> leader.election.CuratorLeaderElectionManager$ElectionListener@68f8b6a2
>>> >> Connection State changed to SUSPENDED
>>> >> 2017-05-25 13:40:01,629 INFO [Curator-ConnectionStateManager-0]
>>> >> o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.
>>> >> leader.election.CuratorLeaderElectionManager$ElectionListener@663f55cd
>>> >> Connection State changed to SUSPENDED
>>> >> 2017-05-25 13:40:02,412 INFO [main-EventThread] o.a.c.f.state.
>>> ConnectinoStateManager
>>> >> State change: RECONNECTED
>>> >> 2017-05-25 13:40:02,413 INFO [Curator-ConnectionStateManager-0]
>>> >> o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.
>>> >> leader.election.CuratorLeaderElectionManager$ElectionListener@68f8b6a2
>>> >> Connection State changed to RECONNECTED
>>> >> 2017-05-25 13:40:02,413 INFO [Curator-ConnectionStateManager-0]
>>> >> o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.
>>> >> leader.election.CuratorLeaderElectionManager$ElectionListener@663f55cd
>>> >> Connection State changed to RECONNECTED
>>> >> 2017-05-25 13:40:02,550 INFO [Clustering Tasks Thread-1] o.a.n.c.c.
>>> ClusterProtocolHeaertbeater
>>> >> Heartbeat create at 2017-05-25 13:40:01,632 and sent to FQDN:PORT at
>>> >> 2017-05-25 13:40:02,550; send took 917 millis
>>> >> 2017-05-25 13:40:07,787 INFO [Clustering Tasks Thread-1] o.a.n.c.c.
>>> ClusterProtocolHeaertbeater
>>> >> Heartbeat create at 2017-05-25 13:40:07,657 and sent to FQDN:PORT at
>>> >> 2017-05-25 13:40:07,787; send took 129 millis
>>> >>
>>> >> I will work on setting up an external ZK next, but would still like some
>>> >> insight to what is being observed with the embedded ZK.
>>> >>
>>> >> Thanks,
>>> >> Mark
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> On Wed, May 24, 2017 at 3:57 PM, Mark Bean <[email protected]>
>>> wrote:
>>> >>
>>> >>> Yes, we are using the embedded ZK. We will try instantiating and
>>> external
>>> >>> ZK and see if that resolves the problem.
>>> >>>
>>> >>> The load on the system is extremely small. Currently (as Nodes are
>>> >>> disconnecting/reconnecting) all input ports to the flow are turned
>>> off. The
>>> >>> only data in the flow is from a single GenerateFlow generating 5B
>>> every 30
>>> >>> secs.
>>> >>>
>>> >>> Also, it is a 5-node cluster with embedded ZK on each node. First, I
>>> will
>>> >>> try reducing ZK to only 3 nodes. Then, I will try a 3-node external ZK.
>>> >>>
>>> >>> Thanks,
>>> >>> Mark
>>> >>>
>>> >>> On Wed, May 24, 2017 at 11:49 AM, Joe Witt <[email protected]> wrote:
>>> >>>
>>> >>>> Are you using the embedded Zookeeper?  If yes we recommend using an
>>> >>>> external zookeeper.
>>> >>>>
>>> >>>> What type of load are the systems under when this occurs (cpu,
>>> >>>> network, memory, disk io)? Under high load the default timeouts for
>>> >>>> clustering are too aggressive.  You can relax these for higher load
>>> >>>> clusters and should see good behavior.  Even if the system overall is
>>> >>>> not under all that high of load if you're seeing garbage collection
>>> >>>> pauses that are lengthy and/or frequent it can cause the same high
>>> >>>> load effect as far as the JVM is concerned.
>>> >>>>
>>> >>>> Thanks
>>> >>>> Joe
>>> >>>>
>>> >>>> On Wed, May 24, 2017 at 9:11 AM, Mark Bean <[email protected]>
>>> >>>> wrote:
>>> >>>> > We have a cluster which is showing signs of instability. The Primary
>>> >>>> Node
>>> >>>> > and Coordinator are reassigned to different nodes every several
>>> >>>> minutes. I
>>> >>>> > believe this is due to lack of heartbeat or other coordination. The
>>> >>>> > following error occurs periodically in the nifi-app.log
>>> >>>> >
>>> >>>> > ERROR [CommitProcessor:1] o.apache.zookeeper.server.NIOServerCnxn
>>> >>>> > Unexpected Exception:
>>> >>>> > java.nio.channels.CancelledKeyException: null
>>> >>>> >         at sun.nio.ch.SelectionKeyImpl.en
>>> >>>> sureValid(SectionKeyImpl.java:73)
>>> >>>> >         at sun.nio.ch.SelectionKeyImpl.in
>>> >>>> terestOps(SelctionKeyImpl.java:77)
>>> >>>> >         at
>>> >>>> > org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServ
>>> >>>> erCnxn.java:151)
>>> >>>> >         at
>>> >>>> > org.apache.zookeeper.server.NIOServerCnXn.sendResopnse(NIOSe
>>> >>>> rverCnxn.java:1081)
>>> >>>> >         at
>>> >>>> > org.apache.zookeeper.server.FinalRequestProcessor.processReq
>>> >>>> uest(FinalRequestProcessor.java:404)
>>> >>>> >         at
>>> >>>> > org.apache.zookeeper.server.quorum.CommitProcessor.run(Commi
>>> >>>> tProcessor.java:74)
>>> >>>> >
>>> >>>> > Apache NiFi 1.2.0
>>> >>>> >
>>> >>>> > Thoughts?
>>> >>>>
>>> >>>
>>> >>>
>>> >>
>>>

Re: unstable cluster

Reply via email to