Re: Zookeeper issues at initial Cluster startup

Bryan Bende Tue, 28 Feb 2017 11:20:13 -0800

Mark,

I am not totally sure, but there could be an issue with the ports in
some of the connect strings.


In zookeeper.properties there is an entry for clientPort which
defaults to 2181, the value of this property is what should be
referenced in nifi.zookeeper.connect.string and state-management.xml
Connect String, so if you left it alone then:

FQDN1:2181,FQDN2:2181,FQDN3:2181

In the server entries in zookeeper.properties, I believe they should
be referencing different ports. For example, when using the default
clientPort=2181 the server entries are typically like:

server.1=localhost:2888:3888

>From the ZooKeeper docs the definition for these two ports is:

"There are two port numbers nnnnn. The first followers use to connect
to the leader, and the second is for leader election. The leader
election port is only necessary if electionAlg is 1, 2, or 3
(default). If electionAlg is 0, then the second port is not necessary.
If you want to test multiple servers on a single machine, then
different ports can be used for each server."

In your configs it looks like the clientPort and the first port in the
server string are both 11001, so I think making those different should
do the trick.

-Bryan


On Tue, Feb 28, 2017 at 1:58 PM, Mark Bean <[email protected]> wrote:
> Relevant properties from nifi.properties:
> nifi.state.management.provider.cluster=zk-provider
> nifi.state.management.embedded.zookeeper.start=true
> nifi.state.management.embedded.zookeeper.properties=./conf/zookeeper.properties
> nifi.cluster.protocol.heartbeat.interval=5 sec
> nifi.cluster.protocol.is.secure=true
> ## Security properties verified; they work for https in non-cluster
> configuration
>
> nifi.cluster.is.node=true
> nifi.cluster.node.address=FQDN1
> nifi.cluster.node.protocol.port=9445
> nifi.cluster.node.protocol.threads=10
> nifi.cluster.node.event.history.size=25
> nifi.cluster.node.connection.timeout=5 sec
> nifi.cluster.node.read.timeout=5 sec
> nifi.cluster.firewall.file=
> nifi.cluster.flow.election.max.wait.time=5 mins
> nifi.cluster.flow.election.max.candidates=201
>
> nifi.zookeeper.connect.string=FQDN1:11001,FQDN2:11001,FQDN3:11001
> nifi.zookeeper.connect.tiemout=3 secs
> nifi.zookeeper.session.timeout=3 secs
> nifi.zookeeper.root.node=/nifi/test-cluster
>
> zookeeper.properties all default except added these lines:
> server.1=<FQND1>:11001:11000
> server.2=<FQND2>:11001:11000
> server.3=<FQND3>:11001:11000
>
> state-management.xml all default except the following in <cluster-provider>:
> <property name="Connect
> String">FQDN1:11001,FQDN2:11001,FQDN3:11001</property>
> <property name="Root Node">/nifi/test-cluster</property>
>
> Also, the ./state/zookeeper/myid consists of only "1", "2", or "3"
> depending on the server within the cluster. Is this correct?
>
>
> On Tue, Feb 28, 2017 at 1:24 PM, Jeff <[email protected]> wrote:
>
>> Hello Mark,
>>
>> Sorry to hear that you're having issues with getting your cluster up and
>> running.  Could you provide the content of your nifi.properties file?
>> Also, please check the Admin guide for ZK setup [1], particularly the Flow
>> Election and Basic Cluster Setup sections.
>>
>> By default, nifi.properties uses a 5-minute election duration to elect the
>> primary node.  However, it does not have a default number of candidates for
>> the election, so typically it will take 5 minutes for that election process
>> when you have a 3-node cluster.  You could try
>> setting nifi.cluster.flow.election.max.candidates to 3, and restart the
>> cluster, but based on the errors you're seeing, I think there may be some
>> other issues.
>>
>> Some key properties to check:
>>
>> nifi.properties:
>> nifi.state.management.embedded.zookeeper.start (true for embedded ZK,
>> false
>> or blank if you're using an external ZK)
>> nifi.zookeeper.connect.string (set to the connect string for your ZK
>> quorum, regardless of embedded or external ZK, e.g.
>> host1:2181,host2:2181,host3:2181)
>>
>> zookeeper.properties:
>> server.1 (server.1 through server.N, should be set to the hostname:port of
>> each ZK server in your cluster, regardless of embedded or external ZK)
>>
>> state-management.xml, under cluster-provider element:
>> <property name="Connect String"></property> (set to the connect string to
>> access your ZK quorum, used by processors to store cluster-based state)
>>
>> [1]
>> https://nifi.apache.org/docs/nifi-docs/html/administration-
>> guide.html#clustering
>>
>> On Tue, Feb 28, 2017 at 12:56 PM Mark Bean <[email protected]> wrote:
>>
>> > I am attempting to setup a new Cluster with 3 Nodes initially. Each node
>> is
>> > reporting zookeeper/curator errors, and the Cluster is not able to
>> connect
>> > the Nodes. The error is reported many times per second and is continuous
>> on
>> > all Nodes:
>> >
>> > 2017-02-28 14:22:53,515 ERROR [Curator-Framework-0]
>> > o.a.c.f.imps.CuratorFrameworkImpl Background operation retry gave up
>> > org.apache.zookeeper.KeeperException$ConnectionLossException:
>> > KeeperErrorCode = ConnectionLoss
>> >         at
>> > org.apache.zookeeper.KeeperException.create(KeeperException.java.99)
>> > ~[zookeeper-3.4.6.jar:3.4.6-1569965]
>> >         at
>> >
>> > org.apache.curator.framework.imps.CuratorFrameworkImpl.
>> checkBackgroundRetry(CuratorFrameworkImpl.java:728)
>> > [curator-framework-2.11.0.jar:na]
>> > at
>> >
>> > org.apache.curator.framework.imps.CuratorFrameworkImpl.
>> performBackgroundOperation(CuratorFrameworkImpl.java:857)
>> > [curator-framework-2.11.0.jar:na]
>> > at
>> >
>> > org.apache.curator.framework.imps.CuratorFrameworkImpl.
>> backgroundOperationsLoop(CuratorFrameworkImpl.java:809)
>> > [curator-framework-2.11.0.jar:na]
>> > at
>> >
>> > org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(
>> CuratorFrameworkImpl.java:64)
>> > [curator-framework-2.11.0.jar:na]
>> > at
>> >
>> > org.apache.curator.framework.imps.CuratorFrameworkImpl.$4.
>> call(CuratorFrameworkImpl.java:267)
>> > [curator-framework-2.11.0.jar:na]
>> > at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>> [na:1.8.0_121]
>> > at
>> >
>> > java.util.concurrent.ScheduledThreadPoolExecutor$
>> ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>> > [na:1.8.0_121]
>> > at
>> >
>> > java.util.concurrent.ScheduledThreadPoolExecutor$
>> ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>> > [na:1.8.0_121]
>> > at
>> >
>> > java.util.concurrent.ThreadPoolExecutor.runWorker(
>> ThreadPoolExecutor.java:1142)
>> > [na:1.8.0_121]
>> > at
>> >
>> > java.util.concurrent.ThreadPoolExecutor$Worker.run(
>> ThreadPoolExecutor.java:617)
>> > [na:1.8.0_121]
>> > at java.lang.Thread.run(Thread.java:745) [na:1.8.0_121]
>> > 2017-02-28 14:22:53,516 ERROR [Curator-Framework-0]
>> > o.a.c.f.imps.CuratorFrameworkImpl Background retry gave up
>> > org.apache.curator.CuratorConnectionLossException: KeeperErrorCode =
>> > ConnectionLoss
>> >         at
>> >
>> > org.apache.curator.framework.imps.CuratorFrameworkImpl.
>> performBackgroundOperation(CuratorFramworkImpl.java:838)
>> > [curator-framework-2.11.0.jar:na]
>> > at
>> >
>> > org.apache.curator.framework.imps.CuratorFrameworkImpl.
>> backgroundOperationsLoop(CuratorFrameworkImpl.java:809)
>> > [curator-framework-2.11.0.jar:na]
>> > at
>> >
>> > org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(
>> CuratorFrameworkImpl.java:64)
>> > [curator-framework-2.11.0.jar:na]
>> > at
>> >
>> > org.apache.curator.framework.imps.CuratorFrameworkImpl.$4.
>> call(CuratorFrameworkImpl.java:267)
>> > [curator-framework-2.11.0.jar:na]
>> > at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>> [na:1.8.0_121]
>> > at
>> >
>> > java.util.concurrent.ScheduledThreadPoolExecutor$
>> ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>> > [na:1.8.0_121]
>> > at
>> >
>> > java.util.concurrent.ScheduledThreadPoolExecutor$
>> ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>> > [na:1.8.0_121]
>> > at
>> >
>> > java.util.concurrent.ThreadPoolExecutor.runWorker(
>> ThreadPoolExecutor.java:1142)
>> > [na:1.8.0_121]
>> > at
>> >
>> > java.util.concurrent.ThreadPoolExecutor$Worker.run(
>> ThreadPoolExecutor.java:617)
>> > [na:1.8.0_121]
>> > at java.lang.Thread.run(Thread.java:745) [na:1.8.0_121]
>> >
>> > While the above message was repeating in the log on one of the Nodes,
>> > another Node's log was "stuck" for a period of time with the last message
>> > being:
>> >
>> > INFO [main] o.a.nifi.properties.NiFiPropertiesLoader Loaded 122
>> properties
>> > from <path>/nifi.properties
>> >
>> > The next message to appear after nearly 6 minutes is:
>> >
>> > INFO [main] o.a.nifi.util.FileBasedVariableRegistry Loaded 91 properties
>> > from system properties and environment variables.
>> >
>> > The 6 minute delay seems curious.
>> >
>> > Then, the Node appears to start the zookeeper server but hits this error:
>> >
>> > ERROR [LearnerHandler-/10.6.218.9:22816]
>> > o.a.z.server.quorum.LearnerHandler
>> > Unexpected exception causing shutdown while sock still open
>> > java.io.EOFException: null
>> >         at java.io.DataInputStream.readInt(DataInputStream.java:392)
>> > ~[na.1.8.0_121]
>> > at
>> > org.apache.jute.BinaryInputArchive.readString(
>> BinaryInputArchive.java:79)
>> > ~[zookeeper-3.4.6.jar:3.4.6-1569965]
>> > at org.apache.zookeeper.data.Id.deserialize(Id.java:55)
>> > ~[zookeeper-3.4.6.jar:3.4.6-1569965]
>> > at
>> > org.apache.jute.BinaryInputArchive.readRecord(
>> BinaryInputArchive.java:103)
>> > ~[zookeeper-3.4.6.jar:3.4.6-1569965]
>> > at
>> >
>> > org.apache.zookeeper.server.quorum.QuorumPacket.
>> deserialze(QuorumPacket.java:92)
>> > ~[zookeeper-3.4.6.jar:3.4.6-1569965]
>> > at
>> > org.apache.jute.BinaryInputArchive.readRecord(
>> BinaryInputArchive.java:103)
>> > ~[zookeeper-3.4.6.jar:3.4.6-1569965]
>> > at
>> >
>> > org.apache.zookeeper.server.quorum.LearnerHandler.run(
>> LearnerHandler.java:309)
>> > ~[zookeeper-3.4.6.jar:3.4.6-1569965]
>> >
>> > This is soon followed by the repeating errors shown above ("Background
>> > operation retry gave up")
>> >
>> > It is as if the quorum vote does not succeed within a given timeframe and
>> > then it stops trying. Note: on one attempt to start the Cluster
>> > successfully, I removed all but one flow.xml.gz, and cleared all
>> > information in ./state directory (except the ./state/zookeeper/myid
>> file).
>> >
>> > Thanks for assistance in understanding what zookeeper is doing (or not
>> > doing) when starting up a new Cluster.
>> >
>> > -Mark
>> >
>>

Re: Zookeeper issues at initial Cluster startup

Reply via email to