Re: Zookeeper issues at initial Cluster startup

Mark Bean Tue, 28 Feb 2017 12:34:56 -0800

I did indeed have a port problem. Thank you for leading me to that. So, I'm
using the default port of 2181 for zookeeper. So, I updated the zookeeper
connect string (in both state-management.xml and nifi.properties) to be:
FQDN1:2181,FQDN2:2181,FQDN3:2181


I am continuing to use :11001:11000 in place of the recommended :2888:3888
in the zookeeper.properties file. This is due to available ports.
server.1=<FQND1>:11001:11000
server.2=<FQND2>:11001:11000
server.3=<FQND3>:11001:11000

I also had a cut/paste error. I acutally had:
nifi.cluster.flow.election.max.candidates=2 (not 201, as originally stated)
The rationale was that once 2 of 3 Nodes connected, the flow could be
accepted. In either case, based on your comments, I set this to 3 since
there are 3 Nodes
nifi.cluster.flow.election.max.candidates=3

Also, I reduced the wait time to 2 mins (from the default 5 mins) hoping to
either connect or fail more quickly:
nifi.cluster.flow.election.max.wait.time=2 mins

I cleaned everything out of state except for the ./state/zookeeper/myid
file. And, I removed all flow.xml.gz files. Now, I am seeing the same
"Background retry gave up" errors continuously being reported in the
nifi-app.log on one Node. The other two Nodes remain hung with the last
nifi-app.log entry being:
INFO [main] o.a.nifi.properties.NiFiPropertiesLoader Loaded 122 properties
from <path>/nifi.properties

As noted earlier, on the Node generating the errors, immediately before
they begin, I see
2017-02-28 14:22:46,489 INFO [main]
o.a.n.c.l.e.CuratorLeaderElectionManager
CuratorLeaderElectionManager[stopped=false] Attempted to register Leader
Election for role 'Cluster Coordinator' but this role is already registered
2017-02-28 14:22:53,506 INFO [Curator-Framework-0]
o.a.c.f.state.ConnectionStateManager State change: SUSPENDED
2017-02-28 14:22:53,510 INFO [Curator-ConnectionStateManager-0]
o.a.n.c.l.e.CuratorLeaderElectionManager
org.apache.nifi.controller.leader.election.CuratorLeaderElectionManager$ElectionListener@1e175829
Connection State changed to SUSPENDED

Is it expected that the connection state is SUSPENDED?
What reasons might cause the two Nodes to apparently hang with no errors or
warnings before connecting to the cluster? From the log files, I can't even
tell if the Nodes are trying to connect.


On Tue, Feb 28, 2017 at 2:36 PM, Jeff <[email protected]> wrote:

> Mark,
> There's some copy/paste errors in my last response as well.  Sorry!
> server.1=<FQND1>:2888:3888
> server.2=<FQND2>:2888:3888
> server.3=<FQND3>:2888:3888
>
> On Tue, Feb 28, 2017 at 2:31 PM Jeff <[email protected]> wrote:
>
> > Mark,
> >
> > In my original response, I said that in zookeeper.propertiers, the
> > server.N properties should be set to the host:port of your ZK server, and
> > that was pretty ambiguous.  It should not be set to the same port as
> > clientPort.
> >
> > As Bryan mentioned, with the default clientPort set to 2181, typically
> the
> > server.N properties are set to hostname:2888:3888.  In your case, you
> might
> > want to try something like the following, as long as these ports are not
> > currently in use:
> > server.1=<FQND1>:2888:3888
> > server.2=<FQND1>:2888:3888
> > server.3=<FQND1>:2888:3888
> >
> > Also, your settings for leader elections:
> > nifi.cluster.flow.election.max.wait.time=5 mins
> > nifi.cluster.flow.election.max.candidates=201
> >
> > This will wait for 201 election candidates to connect, or 5 minutes.  You
> > might want to set the max candidates to 3, since you have 3 nodes in your
> > cluster.
> >
> > The contents of ./state/zookeeper look correct, you should be okay there.
> >
> >
> > On Tue, Feb 28, 2017 at 2:19 PM Bryan Bende <[email protected]> wrote:
> >
> > Mark,
> >
> > I am not totally sure, but there could be an issue with the ports in
> > some of the connect strings.
> >
> > In zookeeper.properties there is an entry for clientPort which
> > defaults to 2181, the value of this property is what should be
> > referenced in nifi.zookeeper.connect.string and state-management.xml
> > Connect String, so if you left it alone then:
> >
> > FQDN1:2181,FQDN2:2181,FQDN3:2181
> >
> > In the server entries in zookeeper.properties, I believe they should
> > be referencing different ports. For example, when using the default
> > clientPort=2181 the server entries are typically like:
> >
> > server.1=localhost:2888:3888
> >
> > From the ZooKeeper docs the definition for these two ports is:
> >
> > "There are two port numbers nnnnn. The first followers use to connect
> > to the leader, and the second is for leader election. The leader
> > election port is only necessary if electionAlg is 1, 2, or 3
> > (default). If electionAlg is 0, then the second port is not necessary.
> > If you want to test multiple servers on a single machine, then
> > different ports can be used for each server."
> >
> > In your configs it looks like the clientPort and the first port in the
> > server string are both 11001, so I think making those different should
> > do the trick.
> >
> > -Bryan
> >
> >
> > On Tue, Feb 28, 2017 at 1:58 PM, Mark Bean <[email protected]>
> wrote:
> > > Relevant properties from nifi.properties:
> > > nifi.state.management.provider.cluster=zk-provider
> > > nifi.state.management.embedded.zookeeper.start=true
> > >
> > nifi.state.management.embedded.zookeeper.properties=
> ./conf/zookeeper.properties
> > > nifi.cluster.protocol.heartbeat.interval=5 sec
> > > nifi.cluster.protocol.is.secure=true
> > > ## Security properties verified; they work for https in non-cluster
> > > configuration
> > >
> > > nifi.cluster.is.node=true
> > > nifi.cluster.node.address=FQDN1
> > > nifi.cluster.node.protocol.port=9445
> > > nifi.cluster.node.protocol.threads=10
> > > nifi.cluster.node.event.history.size=25
> > > nifi.cluster.node.connection.timeout=5 sec
> > > nifi.cluster.node.read.timeout=5 sec
> > > nifi.cluster.firewall.file=
> > > nifi.cluster.flow.election.max.wait.time=5 mins
> > > nifi.cluster.flow.election.max.candidates=201
> > >
> > > nifi.zookeeper.connect.string=FQDN1:11001,FQDN2:11001,FQDN3:11001
> > > nifi.zookeeper.connect.tiemout=3 secs
> > > nifi.zookeeper.session.timeout=3 secs
> > > nifi.zookeeper.root.node=/nifi/test-cluster
> > >
> > > zookeeper.properties all default except added these lines:
> > > server.1=<FQND1>:11001:11000
> > > server.2=<FQND2>:11001:11000
> > > server.3=<FQND3>:11001:11000
> > >
> > > state-management.xml all default except the following in
> > <cluster-provider>:
> > > <property name="Connect
> > > String">FQDN1:11001,FQDN2:11001,FQDN3:11001</property>
> > > <property name="Root Node">/nifi/test-cluster</property>
> > >
> > > Also, the ./state/zookeeper/myid consists of only "1", "2", or "3"
> > > depending on the server within the cluster. Is this correct?
> > >
> > >
> > > On Tue, Feb 28, 2017 at 1:24 PM, Jeff <[email protected]> wrote:
> > >
> > >> Hello Mark,
> > >>
> > >> Sorry to hear that you're having issues with getting your cluster up
> and
> > >> running.  Could you provide the content of your nifi.properties file?
> > >> Also, please check the Admin guide for ZK setup [1], particularly the
> > Flow
> > >> Election and Basic Cluster Setup sections.
> > >>
> > >> By default, nifi.properties uses a 5-minute election duration to elect
> > the
> > >> primary node.  However, it does not have a default number of
> candidates
> > for
> > >> the election, so typically it will take 5 minutes for that election
> > process
> > >> when you have a 3-node cluster.  You could try
> > >> setting nifi.cluster.flow.election.max.candidates to 3, and restart
> the
> > >> cluster, but based on the errors you're seeing, I think there may be
> > some
> > >> other issues.
> > >>
> > >> Some key properties to check:
> > >>
> > >> nifi.properties:
> > >> nifi.state.management.embedded.zookeeper.start (true for embedded ZK,
> > >> false
> > >> or blank if you're using an external ZK)
> > >> nifi.zookeeper.connect.string (set to the connect string for your ZK
> > >> quorum, regardless of embedded or external ZK, e.g.
> > >> host1:2181,host2:2181,host3:2181)
> > >>
> > >> zookeeper.properties:
> > >> server.1 (server.1 through server.N, should be set to the
> hostname:port
> > of
> > >> each ZK server in your cluster, regardless of embedded or external ZK)
> > >>
> > >> state-management.xml, under cluster-provider element:
> > >> <property name="Connect String"></property> (set to the connect string
> > to
> > >> access your ZK quorum, used by processors to store cluster-based
> state)
> > >>
> > >> [1]
> > >> https://nifi.apache.org/docs/nifi-docs/html/administration-
> > >> guide.html#clustering
> > >>
> > >> On Tue, Feb 28, 2017 at 12:56 PM Mark Bean <[email protected]>
> > wrote:
> > >>
> > >> > I am attempting to setup a new Cluster with 3 Nodes initially. Each
> > node
> > >> is
> > >> > reporting zookeeper/curator errors, and the Cluster is not able to
> > >> connect
> > >> > the Nodes. The error is reported many times per second and is
> > continuous
> > >> on
> > >> > all Nodes:
> > >> >
> > >> > 2017-02-28 14:22:53,515 ERROR [Curator-Framework-0]
> > >> > o.a.c.f.imps.CuratorFrameworkImpl Background operation retry gave
> up
> > >> > org.apache.zookeeper.KeeperException$ConnectionLossException:
> > >> > KeeperErrorCode = ConnectionLoss
> > >> >         at
> > >> > org.apache.zookeeper.KeeperException.create(
> KeeperException.java.99)
> > >> > ~[zookeeper-3.4.6.jar:3.4.6-1569965]
> > >> >         at
> > >> >
> > >> > org.apache.curator.framework.imps.CuratorFrameworkImpl.
> > >> checkBackgroundRetry(CuratorFrameworkImpl.java:728)
> > >> > [curator-framework-2.11.0.jar:na]
> > >> > at
> > >> >
> > >> > org.apache.curator.framework.imps.CuratorFrameworkImpl.
> > >> performBackgroundOperation(CuratorFrameworkImpl.java:857)
> > >> > [curator-framework-2.11.0.jar:na]
> > >> > at
> > >> >
> > >> > org.apache.curator.framework.imps.CuratorFrameworkImpl.
> > >> backgroundOperationsLoop(CuratorFrameworkImpl.java:809)
> > >> > [curator-framework-2.11.0.jar:na]
> > >> > at
> > >> >
> > >> > org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(
> > >> CuratorFrameworkImpl.java:64)
> > >> > [curator-framework-2.11.0.jar:na]
> > >> > at
> > >> >
> > >> > org.apache.curator.framework.imps.CuratorFrameworkImpl.$4.
> > >> call(CuratorFrameworkImpl.java:267)
> > >> > [curator-framework-2.11.0.jar:na]
> > >> > at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> > >> [na:1.8.0_121]
> > >> > at
> > >> >
> > >> > java.util.concurrent.ScheduledThreadPoolExecutor$
> > >> ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> > >> > [na:1.8.0_121]
> > >> > at
> > >> >
> > >> > java.util.concurrent.ScheduledThreadPoolExecutor$
> > >> ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> > >> > [na:1.8.0_121]
> > >> > at
> > >> >
> > >> > java.util.concurrent.ThreadPoolExecutor.runWorker(
> > >> ThreadPoolExecutor.java:1142)
> > >> > [na:1.8.0_121]
> > >> > at
> > >> >
> > >> > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> > >> ThreadPoolExecutor.java:617)
> > >> > [na:1.8.0_121]
> > >> > at java.lang.Thread.run(Thread.java:745) [na:1.8.0_121]
> > >> > 2017-02-28 14:22:53,516 ERROR [Curator-Framework-0]
> > >> > o.a.c.f.imps.CuratorFrameworkImpl Background retry gave up
> > >> > org.apache.curator.CuratorConnectionLossException: KeeperErrorCode
> =
> > >> > ConnectionLoss
> > >> >         at
> > >> >
> > >> > org.apache.curator.framework.imps.CuratorFrameworkImpl.
> > >> performBackgroundOperation(CuratorFramworkImpl.java:838)
> > >> > [curator-framework-2.11.0.jar:na]
> > >> > at
> > >> >
> > >> > org.apache.curator.framework.imps.CuratorFrameworkImpl.
> > >> backgroundOperationsLoop(CuratorFrameworkImpl.java:809)
> > >> > [curator-framework-2.11.0.jar:na]
> > >> > at
> > >> >
> > >> > org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(
> > >> CuratorFrameworkImpl.java:64)
> > >> > [curator-framework-2.11.0.jar:na]
> > >> > at
> > >> >
> > >> > org.apache.curator.framework.imps.CuratorFrameworkImpl.$4.
> > >> call(CuratorFrameworkImpl.java:267)
> > >> > [curator-framework-2.11.0.jar:na]
> > >> > at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> > >> [na:1.8.0_121]
> > >> > at
> > >> >
> > >> > java.util.concurrent.ScheduledThreadPoolExecutor$
> > >> ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> > >> > [na:1.8.0_121]
> > >> > at
> > >> >
> > >> > java.util.concurrent.ScheduledThreadPoolExecutor$
> > >> ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> > >> > [na:1.8.0_121]
> > >> > at
> > >> >
> > >> > java.util.concurrent.ThreadPoolExecutor.runWorker(
> > >> ThreadPoolExecutor.java:1142)
> > >> > [na:1.8.0_121]
> > >> > at
> > >> >
> > >> > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> > >> ThreadPoolExecutor.java:617)
> > >> > [na:1.8.0_121]
> > >> > at java.lang.Thread.run(Thread.java:745) [na:1.8.0_121]
> > >> >
> > >> > While the above message was repeating in the log on one of the
> Nodes,
> > >> > another Node's log was "stuck" for a period of time with the last
> > message
> > >> > being:
> > >> >
> > >> > INFO [main] o.a.nifi.properties.NiFiPropertiesLoader Loaded 122
> > >> properties
> > >> > from <path>/nifi.properties
> > >> >
> > >> > The next message to appear after nearly 6 minutes is:
> > >> >
> > >> > INFO [main] o.a.nifi.util.FileBasedVariableRegistry Loaded 91
> > properties
> > >> > from system properties and environment variables.
> > >> >
> > >> > The 6 minute delay seems curious.
> > >> >
> > >> > Then, the Node appears to start the zookeeper server but hits this
> > error:
> > >> >
> > >> > ERROR [LearnerHandler-/10.6.218.9:22816]
> > >> > o.a.z.server.quorum.LearnerHandler
> > >> > Unexpected exception causing shutdown while sock still open
> > >> > java.io.EOFException: null
> > >> >         at java.io.DataInputStream.readInt(DataInputStream.java:
> 392)
> > >> > ~[na.1.8.0_121]
> > >> > at
> > >> > org.apache.jute.BinaryInputArchive.readString(
> > >> BinaryInputArchive.java:79)
> > >> > ~[zookeeper-3.4.6.jar:3.4.6-1569965]
> > >> > at org.apache.zookeeper.data.Id.deserialize(Id.java:55)
> > >> > ~[zookeeper-3.4.6.jar:3.4.6-1569965]
> > >> > at
> > >> > org.apache.jute.BinaryInputArchive.readRecord(
> > >> BinaryInputArchive.java:103)
> > >> > ~[zookeeper-3.4.6.jar:3.4.6-1569965]
> > >> > at
> > >> >
> > >> > org.apache.zookeeper.server.quorum.QuorumPacket.
> > >> deserialze(QuorumPacket.java:92)
> > >> > ~[zookeeper-3.4.6.jar:3.4.6-1569965]
> > >> > at
> > >> > org.apache.jute.BinaryInputArchive.readRecord(
> > >> BinaryInputArchive.java:103)
> > >> > ~[zookeeper-3.4.6.jar:3.4.6-1569965]
> > >> > at
> > >> >
> > >> > org.apache.zookeeper.server.quorum.LearnerHandler.run(
> > >> LearnerHandler.java:309)
> > >> > ~[zookeeper-3.4.6.jar:3.4.6-1569965]
> > >> >
> > >> > This is soon followed by the repeating errors shown above
> ("Background
> > >> > operation retry gave up")
> > >> >
> > >> > It is as if the quorum vote does not succeed within a given
> timeframe
> > and
> > >> > then it stops trying. Note: on one attempt to start the Cluster
> > >> > successfully, I removed all but one flow.xml.gz, and cleared all
> > >> > information in ./state directory (except the ./state/zookeeper/myid
> > >> file).
> > >> >
> > >> > Thanks for assistance in understanding what zookeeper is doing (or
> not
> > >> > doing) when starting up a new Cluster.
> > >> >
> > >> > -Mark
> > >> >
> > >>
> >
> >
>

Re: Zookeeper issues at initial Cluster startup

Reply via email to