Re: Zookeeper issues at initial Cluster startup

Jeff Tue, 28 Feb 2017 11:37:28 -0800

Mark,
There's some copy/paste errors in my last response as well.  Sorry!
server.1=<FQND1>:2888:3888
server.2=<FQND2>:2888:3888
server.3=<FQND3>:2888:3888


On Tue, Feb 28, 2017 at 2:31 PM Jeff <[email protected]> wrote:

> Mark,
>
> In my original response, I said that in zookeeper.propertiers, the
> server.N properties should be set to the host:port of your ZK server, and
> that was pretty ambiguous.  It should not be set to the same port as
> clientPort.
>
> As Bryan mentioned, with the default clientPort set to 2181, typically the
> server.N properties are set to hostname:2888:3888.  In your case, you might
> want to try something like the following, as long as these ports are not
> currently in use:
> server.1=<FQND1>:2888:3888
> server.2=<FQND1>:2888:3888
> server.3=<FQND1>:2888:3888
>
> Also, your settings for leader elections:
> nifi.cluster.flow.election.max.wait.time=5 mins
> nifi.cluster.flow.election.max.candidates=201
>
> This will wait for 201 election candidates to connect, or 5 minutes.  You
> might want to set the max candidates to 3, since you have 3 nodes in your
> cluster.
>
> The contents of ./state/zookeeper look correct, you should be okay there.
>
>
> On Tue, Feb 28, 2017 at 2:19 PM Bryan Bende <[email protected]> wrote:
>
> Mark,
>
> I am not totally sure, but there could be an issue with the ports in
> some of the connect strings.
>
> In zookeeper.properties there is an entry for clientPort which
> defaults to 2181, the value of this property is what should be
> referenced in nifi.zookeeper.connect.string and state-management.xml
> Connect String, so if you left it alone then:
>
> FQDN1:2181,FQDN2:2181,FQDN3:2181
>
> In the server entries in zookeeper.properties, I believe they should
> be referencing different ports. For example, when using the default
> clientPort=2181 the server entries are typically like:
>
> server.1=localhost:2888:3888
>
> From the ZooKeeper docs the definition for these two ports is:
>
> "There are two port numbers nnnnn. The first followers use to connect
> to the leader, and the second is for leader election. The leader
> election port is only necessary if electionAlg is 1, 2, or 3
> (default). If electionAlg is 0, then the second port is not necessary.
> If you want to test multiple servers on a single machine, then
> different ports can be used for each server."
>
> In your configs it looks like the clientPort and the first port in the
> server string are both 11001, so I think making those different should
> do the trick.
>
> -Bryan
>
>
> On Tue, Feb 28, 2017 at 1:58 PM, Mark Bean <[email protected]> wrote:
> > Relevant properties from nifi.properties:
> > nifi.state.management.provider.cluster=zk-provider
> > nifi.state.management.embedded.zookeeper.start=true
> >
> nifi.state.management.embedded.zookeeper.properties=./conf/zookeeper.properties
> > nifi.cluster.protocol.heartbeat.interval=5 sec
> > nifi.cluster.protocol.is.secure=true
> > ## Security properties verified; they work for https in non-cluster
> > configuration
> >
> > nifi.cluster.is.node=true
> > nifi.cluster.node.address=FQDN1
> > nifi.cluster.node.protocol.port=9445
> > nifi.cluster.node.protocol.threads=10
> > nifi.cluster.node.event.history.size=25
> > nifi.cluster.node.connection.timeout=5 sec
> > nifi.cluster.node.read.timeout=5 sec
> > nifi.cluster.firewall.file=
> > nifi.cluster.flow.election.max.wait.time=5 mins
> > nifi.cluster.flow.election.max.candidates=201
> >
> > nifi.zookeeper.connect.string=FQDN1:11001,FQDN2:11001,FQDN3:11001
> > nifi.zookeeper.connect.tiemout=3 secs
> > nifi.zookeeper.session.timeout=3 secs
> > nifi.zookeeper.root.node=/nifi/test-cluster
> >
> > zookeeper.properties all default except added these lines:
> > server.1=<FQND1>:11001:11000
> > server.2=<FQND2>:11001:11000
> > server.3=<FQND3>:11001:11000
> >
> > state-management.xml all default except the following in
> <cluster-provider>:
> > <property name="Connect
> > String">FQDN1:11001,FQDN2:11001,FQDN3:11001</property>
> > <property name="Root Node">/nifi/test-cluster</property>
> >
> > Also, the ./state/zookeeper/myid consists of only "1", "2", or "3"
> > depending on the server within the cluster. Is this correct?
> >
> >
> > On Tue, Feb 28, 2017 at 1:24 PM, Jeff <[email protected]> wrote:
> >
> >> Hello Mark,
> >>
> >> Sorry to hear that you're having issues with getting your cluster up and
> >> running.  Could you provide the content of your nifi.properties file?
> >> Also, please check the Admin guide for ZK setup [1], particularly the
> Flow
> >> Election and Basic Cluster Setup sections.
> >>
> >> By default, nifi.properties uses a 5-minute election duration to elect
> the
> >> primary node.  However, it does not have a default number of candidates
> for
> >> the election, so typically it will take 5 minutes for that election
> process
> >> when you have a 3-node cluster.  You could try
> >> setting nifi.cluster.flow.election.max.candidates to 3, and restart the
> >> cluster, but based on the errors you're seeing, I think there may be
> some
> >> other issues.
> >>
> >> Some key properties to check:
> >>
> >> nifi.properties:
> >> nifi.state.management.embedded.zookeeper.start (true for embedded ZK,
> >> false
> >> or blank if you're using an external ZK)
> >> nifi.zookeeper.connect.string (set to the connect string for your ZK
> >> quorum, regardless of embedded or external ZK, e.g.
> >> host1:2181,host2:2181,host3:2181)
> >>
> >> zookeeper.properties:
> >> server.1 (server.1 through server.N, should be set to the hostname:port
> of
> >> each ZK server in your cluster, regardless of embedded or external ZK)
> >>
> >> state-management.xml, under cluster-provider element:
> >> <property name="Connect String"></property> (set to the connect string
> to
> >> access your ZK quorum, used by processors to store cluster-based state)
> >>
> >> [1]
> >> https://nifi.apache.org/docs/nifi-docs/html/administration-
> >> guide.html#clustering
> >>
> >> On Tue, Feb 28, 2017 at 12:56 PM Mark Bean <[email protected]>
> wrote:
> >>
> >> > I am attempting to setup a new Cluster with 3 Nodes initially. Each
> node
> >> is
> >> > reporting zookeeper/curator errors, and the Cluster is not able to
> >> connect
> >> > the Nodes. The error is reported many times per second and is
> continuous
> >> on
> >> > all Nodes:
> >> >
> >> > 2017-02-28 14:22:53,515 ERROR [Curator-Framework-0]
> >> > o.a.c.f.imps.CuratorFrameworkImpl Background operation retry gave up
> >> > org.apache.zookeeper.KeeperException$ConnectionLossException:
> >> > KeeperErrorCode = ConnectionLoss
> >> >         at
> >> > org.apache.zookeeper.KeeperException.create(KeeperException.java.99)
> >> > ~[zookeeper-3.4.6.jar:3.4.6-1569965]
> >> >         at
> >> >
> >> > org.apache.curator.framework.imps.CuratorFrameworkImpl.
> >> checkBackgroundRetry(CuratorFrameworkImpl.java:728)
> >> > [curator-framework-2.11.0.jar:na]
> >> > at
> >> >
> >> > org.apache.curator.framework.imps.CuratorFrameworkImpl.
> >> performBackgroundOperation(CuratorFrameworkImpl.java:857)
> >> > [curator-framework-2.11.0.jar:na]
> >> > at
> >> >
> >> > org.apache.curator.framework.imps.CuratorFrameworkImpl.
> >> backgroundOperationsLoop(CuratorFrameworkImpl.java:809)
> >> > [curator-framework-2.11.0.jar:na]
> >> > at
> >> >
> >> > org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(
> >> CuratorFrameworkImpl.java:64)
> >> > [curator-framework-2.11.0.jar:na]
> >> > at
> >> >
> >> > org.apache.curator.framework.imps.CuratorFrameworkImpl.$4.
> >> call(CuratorFrameworkImpl.java:267)
> >> > [curator-framework-2.11.0.jar:na]
> >> > at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> >> [na:1.8.0_121]
> >> > at
> >> >
> >> > java.util.concurrent.ScheduledThreadPoolExecutor$
> >> ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> >> > [na:1.8.0_121]
> >> > at
> >> >
> >> > java.util.concurrent.ScheduledThreadPoolExecutor$
> >> ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> >> > [na:1.8.0_121]
> >> > at
> >> >
> >> > java.util.concurrent.ThreadPoolExecutor.runWorker(
> >> ThreadPoolExecutor.java:1142)
> >> > [na:1.8.0_121]
> >> > at
> >> >
> >> > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> >> ThreadPoolExecutor.java:617)
> >> > [na:1.8.0_121]
> >> > at java.lang.Thread.run(Thread.java:745) [na:1.8.0_121]
> >> > 2017-02-28 14:22:53,516 ERROR [Curator-Framework-0]
> >> > o.a.c.f.imps.CuratorFrameworkImpl Background retry gave up
> >> > org.apache.curator.CuratorConnectionLossException: KeeperErrorCode =
> >> > ConnectionLoss
> >> >         at
> >> >
> >> > org.apache.curator.framework.imps.CuratorFrameworkImpl.
> >> performBackgroundOperation(CuratorFramworkImpl.java:838)
> >> > [curator-framework-2.11.0.jar:na]
> >> > at
> >> >
> >> > org.apache.curator.framework.imps.CuratorFrameworkImpl.
> >> backgroundOperationsLoop(CuratorFrameworkImpl.java:809)
> >> > [curator-framework-2.11.0.jar:na]
> >> > at
> >> >
> >> > org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(
> >> CuratorFrameworkImpl.java:64)
> >> > [curator-framework-2.11.0.jar:na]
> >> > at
> >> >
> >> > org.apache.curator.framework.imps.CuratorFrameworkImpl.$4.
> >> call(CuratorFrameworkImpl.java:267)
> >> > [curator-framework-2.11.0.jar:na]
> >> > at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> >> [na:1.8.0_121]
> >> > at
> >> >
> >> > java.util.concurrent.ScheduledThreadPoolExecutor$
> >> ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> >> > [na:1.8.0_121]
> >> > at
> >> >
> >> > java.util.concurrent.ScheduledThreadPoolExecutor$
> >> ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> >> > [na:1.8.0_121]
> >> > at
> >> >
> >> > java.util.concurrent.ThreadPoolExecutor.runWorker(
> >> ThreadPoolExecutor.java:1142)
> >> > [na:1.8.0_121]
> >> > at
> >> >
> >> > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> >> ThreadPoolExecutor.java:617)
> >> > [na:1.8.0_121]
> >> > at java.lang.Thread.run(Thread.java:745) [na:1.8.0_121]
> >> >
> >> > While the above message was repeating in the log on one of the Nodes,
> >> > another Node's log was "stuck" for a period of time with the last
> message
> >> > being:
> >> >
> >> > INFO [main] o.a.nifi.properties.NiFiPropertiesLoader Loaded 122
> >> properties
> >> > from <path>/nifi.properties
> >> >
> >> > The next message to appear after nearly 6 minutes is:
> >> >
> >> > INFO [main] o.a.nifi.util.FileBasedVariableRegistry Loaded 91
> properties
> >> > from system properties and environment variables.
> >> >
> >> > The 6 minute delay seems curious.
> >> >
> >> > Then, the Node appears to start the zookeeper server but hits this
> error:
> >> >
> >> > ERROR [LearnerHandler-/10.6.218.9:22816]
> >> > o.a.z.server.quorum.LearnerHandler
> >> > Unexpected exception causing shutdown while sock still open
> >> > java.io.EOFException: null
> >> >         at java.io.DataInputStream.readInt(DataInputStream.java:392)
> >> > ~[na.1.8.0_121]
> >> > at
> >> > org.apache.jute.BinaryInputArchive.readString(
> >> BinaryInputArchive.java:79)
> >> > ~[zookeeper-3.4.6.jar:3.4.6-1569965]
> >> > at org.apache.zookeeper.data.Id.deserialize(Id.java:55)
> >> > ~[zookeeper-3.4.6.jar:3.4.6-1569965]
> >> > at
> >> > org.apache.jute.BinaryInputArchive.readRecord(
> >> BinaryInputArchive.java:103)
> >> > ~[zookeeper-3.4.6.jar:3.4.6-1569965]
> >> > at
> >> >
> >> > org.apache.zookeeper.server.quorum.QuorumPacket.
> >> deserialze(QuorumPacket.java:92)
> >> > ~[zookeeper-3.4.6.jar:3.4.6-1569965]
> >> > at
> >> > org.apache.jute.BinaryInputArchive.readRecord(
> >> BinaryInputArchive.java:103)
> >> > ~[zookeeper-3.4.6.jar:3.4.6-1569965]
> >> > at
> >> >
> >> > org.apache.zookeeper.server.quorum.LearnerHandler.run(
> >> LearnerHandler.java:309)
> >> > ~[zookeeper-3.4.6.jar:3.4.6-1569965]
> >> >
> >> > This is soon followed by the repeating errors shown above ("Background
> >> > operation retry gave up")
> >> >
> >> > It is as if the quorum vote does not succeed within a given timeframe
> and
> >> > then it stops trying. Note: on one attempt to start the Cluster
> >> > successfully, I removed all but one flow.xml.gz, and cleared all
> >> > information in ./state directory (except the ./state/zookeeper/myid
> >> file).
> >> >
> >> > Thanks for assistance in understanding what zookeeper is doing (or not
> >> > doing) when starting up a new Cluster.
> >> >
> >> > -Mark
> >> >
> >>
>
>

Re: Zookeeper issues at initial Cluster startup

Reply via email to