Re: Zookeeper issues at initial Cluster startup

Mark Bean Tue, 28 Feb 2017 10:58:47 -0800

Relevant properties from nifi.properties:
nifi.state.management.provider.cluster=zk-provider
nifi.state.management.embedded.zookeeper.start=true
nifi.state.management.embedded.zookeeper.properties=./conf/zookeeper.properties
nifi.cluster.protocol.heartbeat.interval=5 sec
nifi.cluster.protocol.is.secure=true
## Security properties verified; they work for https in non-cluster
configuration


nifi.cluster.is.node=true
nifi.cluster.node.address=FQDN1
nifi.cluster.node.protocol.port=9445
nifi.cluster.node.protocol.threads=10
nifi.cluster.node.event.history.size=25
nifi.cluster.node.connection.timeout=5 sec
nifi.cluster.node.read.timeout=5 sec
nifi.cluster.firewall.file=
nifi.cluster.flow.election.max.wait.time=5 mins
nifi.cluster.flow.election.max.candidates=201

nifi.zookeeper.connect.string=FQDN1:11001,FQDN2:11001,FQDN3:11001
nifi.zookeeper.connect.tiemout=3 secs
nifi.zookeeper.session.timeout=3 secs
nifi.zookeeper.root.node=/nifi/test-cluster

zookeeper.properties all default except added these lines:
server.1=<FQND1>:11001:11000
server.2=<FQND2>:11001:11000
server.3=<FQND3>:11001:11000

state-management.xml all default except the following in <cluster-provider>:
<property name="Connect
String">FQDN1:11001,FQDN2:11001,FQDN3:11001</property>
<property name="Root Node">/nifi/test-cluster</property>

Also, the ./state/zookeeper/myid consists of only "1", "2", or "3"
depending on the server within the cluster. Is this correct?


On Tue, Feb 28, 2017 at 1:24 PM, Jeff <[email protected]> wrote:

> Hello Mark,
>
> Sorry to hear that you're having issues with getting your cluster up and
> running.  Could you provide the content of your nifi.properties file?
> Also, please check the Admin guide for ZK setup [1], particularly the Flow
> Election and Basic Cluster Setup sections.
>
> By default, nifi.properties uses a 5-minute election duration to elect the
> primary node.  However, it does not have a default number of candidates for
> the election, so typically it will take 5 minutes for that election process
> when you have a 3-node cluster.  You could try
> setting nifi.cluster.flow.election.max.candidates to 3, and restart the
> cluster, but based on the errors you're seeing, I think there may be some
> other issues.
>
> Some key properties to check:
>
> nifi.properties:
> nifi.state.management.embedded.zookeeper.start (true for embedded ZK,
> false
> or blank if you're using an external ZK)
> nifi.zookeeper.connect.string (set to the connect string for your ZK
> quorum, regardless of embedded or external ZK, e.g.
> host1:2181,host2:2181,host3:2181)
>
> zookeeper.properties:
> server.1 (server.1 through server.N, should be set to the hostname:port of
> each ZK server in your cluster, regardless of embedded or external ZK)
>
> state-management.xml, under cluster-provider element:
> <property name="Connect String"></property> (set to the connect string to
> access your ZK quorum, used by processors to store cluster-based state)
>
> [1]
> https://nifi.apache.org/docs/nifi-docs/html/administration-
> guide.html#clustering
>
> On Tue, Feb 28, 2017 at 12:56 PM Mark Bean <[email protected]> wrote:
>
> > I am attempting to setup a new Cluster with 3 Nodes initially. Each node
> is
> > reporting zookeeper/curator errors, and the Cluster is not able to
> connect
> > the Nodes. The error is reported many times per second and is continuous
> on
> > all Nodes:
> >
> > 2017-02-28 14:22:53,515 ERROR [Curator-Framework-0]
> > o.a.c.f.imps.CuratorFrameworkImpl Background operation retry gave up
> > org.apache.zookeeper.KeeperException$ConnectionLossException:
> > KeeperErrorCode = ConnectionLoss
> >         at
> > org.apache.zookeeper.KeeperException.create(KeeperException.java.99)
> > ~[zookeeper-3.4.6.jar:3.4.6-1569965]
> >         at
> >
> > org.apache.curator.framework.imps.CuratorFrameworkImpl.
> checkBackgroundRetry(CuratorFrameworkImpl.java:728)
> > [curator-framework-2.11.0.jar:na]
> > at
> >
> > org.apache.curator.framework.imps.CuratorFrameworkImpl.
> performBackgroundOperation(CuratorFrameworkImpl.java:857)
> > [curator-framework-2.11.0.jar:na]
> > at
> >
> > org.apache.curator.framework.imps.CuratorFrameworkImpl.
> backgroundOperationsLoop(CuratorFrameworkImpl.java:809)
> > [curator-framework-2.11.0.jar:na]
> > at
> >
> > org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(
> CuratorFrameworkImpl.java:64)
> > [curator-framework-2.11.0.jar:na]
> > at
> >
> > org.apache.curator.framework.imps.CuratorFrameworkImpl.$4.
> call(CuratorFrameworkImpl.java:267)
> > [curator-framework-2.11.0.jar:na]
> > at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> [na:1.8.0_121]
> > at
> >
> > java.util.concurrent.ScheduledThreadPoolExecutor$
> ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> > [na:1.8.0_121]
> > at
> >
> > java.util.concurrent.ScheduledThreadPoolExecutor$
> ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> > [na:1.8.0_121]
> > at
> >
> > java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> > [na:1.8.0_121]
> > at
> >
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> > [na:1.8.0_121]
> > at java.lang.Thread.run(Thread.java:745) [na:1.8.0_121]
> > 2017-02-28 14:22:53,516 ERROR [Curator-Framework-0]
> > o.a.c.f.imps.CuratorFrameworkImpl Background retry gave up
> > org.apache.curator.CuratorConnectionLossException: KeeperErrorCode =
> > ConnectionLoss
> >         at
> >
> > org.apache.curator.framework.imps.CuratorFrameworkImpl.
> performBackgroundOperation(CuratorFramworkImpl.java:838)
> > [curator-framework-2.11.0.jar:na]
> > at
> >
> > org.apache.curator.framework.imps.CuratorFrameworkImpl.
> backgroundOperationsLoop(CuratorFrameworkImpl.java:809)
> > [curator-framework-2.11.0.jar:na]
> > at
> >
> > org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(
> CuratorFrameworkImpl.java:64)
> > [curator-framework-2.11.0.jar:na]
> > at
> >
> > org.apache.curator.framework.imps.CuratorFrameworkImpl.$4.
> call(CuratorFrameworkImpl.java:267)
> > [curator-framework-2.11.0.jar:na]
> > at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> [na:1.8.0_121]
> > at
> >
> > java.util.concurrent.ScheduledThreadPoolExecutor$
> ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> > [na:1.8.0_121]
> > at
> >
> > java.util.concurrent.ScheduledThreadPoolExecutor$
> ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> > [na:1.8.0_121]
> > at
> >
> > java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> > [na:1.8.0_121]
> > at
> >
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> > [na:1.8.0_121]
> > at java.lang.Thread.run(Thread.java:745) [na:1.8.0_121]
> >
> > While the above message was repeating in the log on one of the Nodes,
> > another Node's log was "stuck" for a period of time with the last message
> > being:
> >
> > INFO [main] o.a.nifi.properties.NiFiPropertiesLoader Loaded 122
> properties
> > from <path>/nifi.properties
> >
> > The next message to appear after nearly 6 minutes is:
> >
> > INFO [main] o.a.nifi.util.FileBasedVariableRegistry Loaded 91 properties
> > from system properties and environment variables.
> >
> > The 6 minute delay seems curious.
> >
> > Then, the Node appears to start the zookeeper server but hits this error:
> >
> > ERROR [LearnerHandler-/10.6.218.9:22816]
> > o.a.z.server.quorum.LearnerHandler
> > Unexpected exception causing shutdown while sock still open
> > java.io.EOFException: null
> >         at java.io.DataInputStream.readInt(DataInputStream.java:392)
> > ~[na.1.8.0_121]
> > at
> > org.apache.jute.BinaryInputArchive.readString(
> BinaryInputArchive.java:79)
> > ~[zookeeper-3.4.6.jar:3.4.6-1569965]
> > at org.apache.zookeeper.data.Id.deserialize(Id.java:55)
> > ~[zookeeper-3.4.6.jar:3.4.6-1569965]
> > at
> > org.apache.jute.BinaryInputArchive.readRecord(
> BinaryInputArchive.java:103)
> > ~[zookeeper-3.4.6.jar:3.4.6-1569965]
> > at
> >
> > org.apache.zookeeper.server.quorum.QuorumPacket.
> deserialze(QuorumPacket.java:92)
> > ~[zookeeper-3.4.6.jar:3.4.6-1569965]
> > at
> > org.apache.jute.BinaryInputArchive.readRecord(
> BinaryInputArchive.java:103)
> > ~[zookeeper-3.4.6.jar:3.4.6-1569965]
> > at
> >
> > org.apache.zookeeper.server.quorum.LearnerHandler.run(
> LearnerHandler.java:309)
> > ~[zookeeper-3.4.6.jar:3.4.6-1569965]
> >
> > This is soon followed by the repeating errors shown above ("Background
> > operation retry gave up")
> >
> > It is as if the quorum vote does not succeed within a given timeframe and
> > then it stops trying. Note: on one attempt to start the Cluster
> > successfully, I removed all but one flow.xml.gz, and cleared all
> > information in ./state directory (except the ./state/zookeeper/myid
> file).
> >
> > Thanks for assistance in understanding what zookeeper is doing (or not
> > doing) when starting up a new Cluster.
> >
> > -Mark
> >
>

Re: Zookeeper issues at initial Cluster startup

Reply via email to