Re: unstable cluster

Mark Bean Thu, 25 May 2017 08:08:59 -0700

Update: now all 5 nodes, regardless of ZK server, are indicating SUSPENDED
-> RECONNECTED.


On Thu, May 25, 2017 at 10:23 AM, Mark Bean <[email protected]> wrote:

> I reduced the number of embedded ZooKeeper servers on the 5-Node NiFi
> Cluster from 5 to 3. This has improved the situation. I do not see any of
> the three Nodes which are also ZK servers disconnecting/reconnecting to the
> cluster as before. However, the two Nodes which are not running ZK continue
> to disconnect and reconnect. The following is taken from one of the non-ZK
> Nodes. It's curious that some messages are issued twice from the same
> thread, but reference a different object
>
> nifi-app.log
> 2017-05-25 13:40:01,628 INFO [main-EventTrhead] 
> o.a.c.f.state.ConnectionStateManager
> State change: SUSPENDED
> 2017-05-25 13:39:45,627 INFO [Clustering Tasks Thread-1] 
> o.a.n.c.c.ClusterProtocolHeaertbeater
> Heartbeat create at 2017-05-25 13:39:45,504 and sent to FQDN:PORT at
> 2017-05-25 13:39:45,627; send took 122 millis
> 2017-05-25 13:39:50,862 INFO [Clustering Tasks Thread-1] 
> o.a.n.c.c.ClusterProtocolHeaertbeater
> Heartbeat create at 2017-05-25 13:39:50,732 and sent to FQDN:PORT at
> 2017-05-25 13:39:50,862; send took 122 millis
> 2017-05-25 13:39:56,089 INFO [Clustering Tasks Thread-1] 
> o.a.n.c.c.ClusterProtocolHeaertbeater
> Heartbeat create at 2017-05-25 13:39:55,966 and sent to FQDN:PORT at
> 2017-05-25 13:39:56,089; send took 129 millis
> 2017-05-25 13:40:01,629 INFO [Curator-ConnectionStateManager-0]
> o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.
> leader.election.CuratorLeaderElectionManager$ElectionListener@68f8b6a2
> Connection State changed to SUSPENDED
> 2017-05-25 13:40:01,629 INFO [Curator-ConnectionStateManager-0]
> o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.
> leader.election.CuratorLeaderElectionManager$ElectionListener@663f55cd
> Connection State changed to SUSPENDED
> 2017-05-25 13:40:02,412 INFO [main-EventThread] 
> o.a.c.f.state.ConnectinoStateManager
> State change: RECONNECTED
> 2017-05-25 13:40:02,413 INFO [Curator-ConnectionStateManager-0]
> o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.
> leader.election.CuratorLeaderElectionManager$ElectionListener@68f8b6a2
> Connection State changed to RECONNECTED
> 2017-05-25 13:40:02,413 INFO [Curator-ConnectionStateManager-0]
> o.a.n.c.l.e.CuratorLeaderElectionManager org.apache.nifi.controller.
> leader.election.CuratorLeaderElectionManager$ElectionListener@663f55cd
> Connection State changed to RECONNECTED
> 2017-05-25 13:40:02,550 INFO [Clustering Tasks Thread-1] 
> o.a.n.c.c.ClusterProtocolHeaertbeater
> Heartbeat create at 2017-05-25 13:40:01,632 and sent to FQDN:PORT at
> 2017-05-25 13:40:02,550; send took 917 millis
> 2017-05-25 13:40:07,787 INFO [Clustering Tasks Thread-1] 
> o.a.n.c.c.ClusterProtocolHeaertbeater
> Heartbeat create at 2017-05-25 13:40:07,657 and sent to FQDN:PORT at
> 2017-05-25 13:40:07,787; send took 129 millis
>
> I will work on setting up an external ZK next, but would still like some
> insight to what is being observed with the embedded ZK.
>
> Thanks,
> Mark
>
>
>
>
> On Wed, May 24, 2017 at 3:57 PM, Mark Bean <[email protected]> wrote:
>
>> Yes, we are using the embedded ZK. We will try instantiating and external
>> ZK and see if that resolves the problem.
>>
>> The load on the system is extremely small. Currently (as Nodes are
>> disconnecting/reconnecting) all input ports to the flow are turned off. The
>> only data in the flow is from a single GenerateFlow generating 5B every 30
>> secs.
>>
>> Also, it is a 5-node cluster with embedded ZK on each node. First, I will
>> try reducing ZK to only 3 nodes. Then, I will try a 3-node external ZK.
>>
>> Thanks,
>> Mark
>>
>> On Wed, May 24, 2017 at 11:49 AM, Joe Witt <[email protected]> wrote:
>>
>>> Are you using the embedded Zookeeper?  If yes we recommend using an
>>> external zookeeper.
>>>
>>> What type of load are the systems under when this occurs (cpu,
>>> network, memory, disk io)? Under high load the default timeouts for
>>> clustering are too aggressive.  You can relax these for higher load
>>> clusters and should see good behavior.  Even if the system overall is
>>> not under all that high of load if you're seeing garbage collection
>>> pauses that are lengthy and/or frequent it can cause the same high
>>> load effect as far as the JVM is concerned.
>>>
>>> Thanks
>>> Joe
>>>
>>> On Wed, May 24, 2017 at 9:11 AM, Mark Bean <[email protected]>
>>> wrote:
>>> > We have a cluster which is showing signs of instability. The Primary
>>> Node
>>> > and Coordinator are reassigned to different nodes every several
>>> minutes. I
>>> > believe this is due to lack of heartbeat or other coordination. The
>>> > following error occurs periodically in the nifi-app.log
>>> >
>>> > ERROR [CommitProcessor:1] o.apache.zookeeper.server.NIOServerCnxn
>>> > Unexpected Exception:
>>> > java.nio.channels.CancelledKeyException: null
>>> >         at sun.nio.ch.SelectionKeyImpl.en
>>> sureValid(SectionKeyImpl.java:73)
>>> >         at sun.nio.ch.SelectionKeyImpl.in
>>> terestOps(SelctionKeyImpl.java:77)
>>> >         at
>>> > org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServ
>>> erCnxn.java:151)
>>> >         at
>>> > org.apache.zookeeper.server.NIOServerCnXn.sendResopnse(NIOSe
>>> rverCnxn.java:1081)
>>> >         at
>>> > org.apache.zookeeper.server.FinalRequestProcessor.processReq
>>> uest(FinalRequestProcessor.java:404)
>>> >         at
>>> > org.apache.zookeeper.server.quorum.CommitProcessor.run(Commi
>>> tProcessor.java:74)
>>> >
>>> > Apache NiFi 1.2.0
>>> >
>>> > Thoughts?
>>>
>>
>>
>

Re: unstable cluster

Reply via email to