Mark, There's some copy/paste errors in my last response as well. Sorry! server.1=<FQND1>:2888:3888 server.2=<FQND2>:2888:3888 server.3=<FQND3>:2888:3888
On Tue, Feb 28, 2017 at 2:31 PM Jeff <[email protected]> wrote: > Mark, > > In my original response, I said that in zookeeper.propertiers, the > server.N properties should be set to the host:port of your ZK server, and > that was pretty ambiguous. It should not be set to the same port as > clientPort. > > As Bryan mentioned, with the default clientPort set to 2181, typically the > server.N properties are set to hostname:2888:3888. In your case, you might > want to try something like the following, as long as these ports are not > currently in use: > server.1=<FQND1>:2888:3888 > server.2=<FQND1>:2888:3888 > server.3=<FQND1>:2888:3888 > > Also, your settings for leader elections: > nifi.cluster.flow.election.max.wait.time=5 mins > nifi.cluster.flow.election.max.candidates=201 > > This will wait for 201 election candidates to connect, or 5 minutes. You > might want to set the max candidates to 3, since you have 3 nodes in your > cluster. > > The contents of ./state/zookeeper look correct, you should be okay there. > > > On Tue, Feb 28, 2017 at 2:19 PM Bryan Bende <[email protected]> wrote: > > Mark, > > I am not totally sure, but there could be an issue with the ports in > some of the connect strings. > > In zookeeper.properties there is an entry for clientPort which > defaults to 2181, the value of this property is what should be > referenced in nifi.zookeeper.connect.string and state-management.xml > Connect String, so if you left it alone then: > > FQDN1:2181,FQDN2:2181,FQDN3:2181 > > In the server entries in zookeeper.properties, I believe they should > be referencing different ports. For example, when using the default > clientPort=2181 the server entries are typically like: > > server.1=localhost:2888:3888 > > From the ZooKeeper docs the definition for these two ports is: > > "There are two port numbers nnnnn. The first followers use to connect > to the leader, and the second is for leader election. The leader > election port is only necessary if electionAlg is 1, 2, or 3 > (default). If electionAlg is 0, then the second port is not necessary. > If you want to test multiple servers on a single machine, then > different ports can be used for each server." > > In your configs it looks like the clientPort and the first port in the > server string are both 11001, so I think making those different should > do the trick. > > -Bryan > > > On Tue, Feb 28, 2017 at 1:58 PM, Mark Bean <[email protected]> wrote: > > Relevant properties from nifi.properties: > > nifi.state.management.provider.cluster=zk-provider > > nifi.state.management.embedded.zookeeper.start=true > > > nifi.state.management.embedded.zookeeper.properties=./conf/zookeeper.properties > > nifi.cluster.protocol.heartbeat.interval=5 sec > > nifi.cluster.protocol.is.secure=true > > ## Security properties verified; they work for https in non-cluster > > configuration > > > > nifi.cluster.is.node=true > > nifi.cluster.node.address=FQDN1 > > nifi.cluster.node.protocol.port=9445 > > nifi.cluster.node.protocol.threads=10 > > nifi.cluster.node.event.history.size=25 > > nifi.cluster.node.connection.timeout=5 sec > > nifi.cluster.node.read.timeout=5 sec > > nifi.cluster.firewall.file= > > nifi.cluster.flow.election.max.wait.time=5 mins > > nifi.cluster.flow.election.max.candidates=201 > > > > nifi.zookeeper.connect.string=FQDN1:11001,FQDN2:11001,FQDN3:11001 > > nifi.zookeeper.connect.tiemout=3 secs > > nifi.zookeeper.session.timeout=3 secs > > nifi.zookeeper.root.node=/nifi/test-cluster > > > > zookeeper.properties all default except added these lines: > > server.1=<FQND1>:11001:11000 > > server.2=<FQND2>:11001:11000 > > server.3=<FQND3>:11001:11000 > > > > state-management.xml all default except the following in > <cluster-provider>: > > <property name="Connect > > String">FQDN1:11001,FQDN2:11001,FQDN3:11001</property> > > <property name="Root Node">/nifi/test-cluster</property> > > > > Also, the ./state/zookeeper/myid consists of only "1", "2", or "3" > > depending on the server within the cluster. Is this correct? > > > > > > On Tue, Feb 28, 2017 at 1:24 PM, Jeff <[email protected]> wrote: > > > >> Hello Mark, > >> > >> Sorry to hear that you're having issues with getting your cluster up and > >> running. Could you provide the content of your nifi.properties file? > >> Also, please check the Admin guide for ZK setup [1], particularly the > Flow > >> Election and Basic Cluster Setup sections. > >> > >> By default, nifi.properties uses a 5-minute election duration to elect > the > >> primary node. However, it does not have a default number of candidates > for > >> the election, so typically it will take 5 minutes for that election > process > >> when you have a 3-node cluster. You could try > >> setting nifi.cluster.flow.election.max.candidates to 3, and restart the > >> cluster, but based on the errors you're seeing, I think there may be > some > >> other issues. > >> > >> Some key properties to check: > >> > >> nifi.properties: > >> nifi.state.management.embedded.zookeeper.start (true for embedded ZK, > >> false > >> or blank if you're using an external ZK) > >> nifi.zookeeper.connect.string (set to the connect string for your ZK > >> quorum, regardless of embedded or external ZK, e.g. > >> host1:2181,host2:2181,host3:2181) > >> > >> zookeeper.properties: > >> server.1 (server.1 through server.N, should be set to the hostname:port > of > >> each ZK server in your cluster, regardless of embedded or external ZK) > >> > >> state-management.xml, under cluster-provider element: > >> <property name="Connect String"></property> (set to the connect string > to > >> access your ZK quorum, used by processors to store cluster-based state) > >> > >> [1] > >> https://nifi.apache.org/docs/nifi-docs/html/administration- > >> guide.html#clustering > >> > >> On Tue, Feb 28, 2017 at 12:56 PM Mark Bean <[email protected]> > wrote: > >> > >> > I am attempting to setup a new Cluster with 3 Nodes initially. Each > node > >> is > >> > reporting zookeeper/curator errors, and the Cluster is not able to > >> connect > >> > the Nodes. The error is reported many times per second and is > continuous > >> on > >> > all Nodes: > >> > > >> > 2017-02-28 14:22:53,515 ERROR [Curator-Framework-0] > >> > o.a.c.f.imps.CuratorFrameworkImpl Background operation retry gave up > >> > org.apache.zookeeper.KeeperException$ConnectionLossException: > >> > KeeperErrorCode = ConnectionLoss > >> > at > >> > org.apache.zookeeper.KeeperException.create(KeeperException.java.99) > >> > ~[zookeeper-3.4.6.jar:3.4.6-1569965] > >> > at > >> > > >> > org.apache.curator.framework.imps.CuratorFrameworkImpl. > >> checkBackgroundRetry(CuratorFrameworkImpl.java:728) > >> > [curator-framework-2.11.0.jar:na] > >> > at > >> > > >> > org.apache.curator.framework.imps.CuratorFrameworkImpl. > >> performBackgroundOperation(CuratorFrameworkImpl.java:857) > >> > [curator-framework-2.11.0.jar:na] > >> > at > >> > > >> > org.apache.curator.framework.imps.CuratorFrameworkImpl. > >> backgroundOperationsLoop(CuratorFrameworkImpl.java:809) > >> > [curator-framework-2.11.0.jar:na] > >> > at > >> > > >> > org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300( > >> CuratorFrameworkImpl.java:64) > >> > [curator-framework-2.11.0.jar:na] > >> > at > >> > > >> > org.apache.curator.framework.imps.CuratorFrameworkImpl.$4. > >> call(CuratorFrameworkImpl.java:267) > >> > [curator-framework-2.11.0.jar:na] > >> > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > >> [na:1.8.0_121] > >> > at > >> > > >> > java.util.concurrent.ScheduledThreadPoolExecutor$ > >> ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > >> > [na:1.8.0_121] > >> > at > >> > > >> > java.util.concurrent.ScheduledThreadPoolExecutor$ > >> ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > >> > [na:1.8.0_121] > >> > at > >> > > >> > java.util.concurrent.ThreadPoolExecutor.runWorker( > >> ThreadPoolExecutor.java:1142) > >> > [na:1.8.0_121] > >> > at > >> > > >> > java.util.concurrent.ThreadPoolExecutor$Worker.run( > >> ThreadPoolExecutor.java:617) > >> > [na:1.8.0_121] > >> > at java.lang.Thread.run(Thread.java:745) [na:1.8.0_121] > >> > 2017-02-28 14:22:53,516 ERROR [Curator-Framework-0] > >> > o.a.c.f.imps.CuratorFrameworkImpl Background retry gave up > >> > org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = > >> > ConnectionLoss > >> > at > >> > > >> > org.apache.curator.framework.imps.CuratorFrameworkImpl. > >> performBackgroundOperation(CuratorFramworkImpl.java:838) > >> > [curator-framework-2.11.0.jar:na] > >> > at > >> > > >> > org.apache.curator.framework.imps.CuratorFrameworkImpl. > >> backgroundOperationsLoop(CuratorFrameworkImpl.java:809) > >> > [curator-framework-2.11.0.jar:na] > >> > at > >> > > >> > org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300( > >> CuratorFrameworkImpl.java:64) > >> > [curator-framework-2.11.0.jar:na] > >> > at > >> > > >> > org.apache.curator.framework.imps.CuratorFrameworkImpl.$4. > >> call(CuratorFrameworkImpl.java:267) > >> > [curator-framework-2.11.0.jar:na] > >> > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > >> [na:1.8.0_121] > >> > at > >> > > >> > java.util.concurrent.ScheduledThreadPoolExecutor$ > >> ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > >> > [na:1.8.0_121] > >> > at > >> > > >> > java.util.concurrent.ScheduledThreadPoolExecutor$ > >> ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > >> > [na:1.8.0_121] > >> > at > >> > > >> > java.util.concurrent.ThreadPoolExecutor.runWorker( > >> ThreadPoolExecutor.java:1142) > >> > [na:1.8.0_121] > >> > at > >> > > >> > java.util.concurrent.ThreadPoolExecutor$Worker.run( > >> ThreadPoolExecutor.java:617) > >> > [na:1.8.0_121] > >> > at java.lang.Thread.run(Thread.java:745) [na:1.8.0_121] > >> > > >> > While the above message was repeating in the log on one of the Nodes, > >> > another Node's log was "stuck" for a period of time with the last > message > >> > being: > >> > > >> > INFO [main] o.a.nifi.properties.NiFiPropertiesLoader Loaded 122 > >> properties > >> > from <path>/nifi.properties > >> > > >> > The next message to appear after nearly 6 minutes is: > >> > > >> > INFO [main] o.a.nifi.util.FileBasedVariableRegistry Loaded 91 > properties > >> > from system properties and environment variables. > >> > > >> > The 6 minute delay seems curious. > >> > > >> > Then, the Node appears to start the zookeeper server but hits this > error: > >> > > >> > ERROR [LearnerHandler-/10.6.218.9:22816] > >> > o.a.z.server.quorum.LearnerHandler > >> > Unexpected exception causing shutdown while sock still open > >> > java.io.EOFException: null > >> > at java.io.DataInputStream.readInt(DataInputStream.java:392) > >> > ~[na.1.8.0_121] > >> > at > >> > org.apache.jute.BinaryInputArchive.readString( > >> BinaryInputArchive.java:79) > >> > ~[zookeeper-3.4.6.jar:3.4.6-1569965] > >> > at org.apache.zookeeper.data.Id.deserialize(Id.java:55) > >> > ~[zookeeper-3.4.6.jar:3.4.6-1569965] > >> > at > >> > org.apache.jute.BinaryInputArchive.readRecord( > >> BinaryInputArchive.java:103) > >> > ~[zookeeper-3.4.6.jar:3.4.6-1569965] > >> > at > >> > > >> > org.apache.zookeeper.server.quorum.QuorumPacket. > >> deserialze(QuorumPacket.java:92) > >> > ~[zookeeper-3.4.6.jar:3.4.6-1569965] > >> > at > >> > org.apache.jute.BinaryInputArchive.readRecord( > >> BinaryInputArchive.java:103) > >> > ~[zookeeper-3.4.6.jar:3.4.6-1569965] > >> > at > >> > > >> > org.apache.zookeeper.server.quorum.LearnerHandler.run( > >> LearnerHandler.java:309) > >> > ~[zookeeper-3.4.6.jar:3.4.6-1569965] > >> > > >> > This is soon followed by the repeating errors shown above ("Background > >> > operation retry gave up") > >> > > >> > It is as if the quorum vote does not succeed within a given timeframe > and > >> > then it stops trying. Note: on one attempt to start the Cluster > >> > successfully, I removed all but one flow.xml.gz, and cleared all > >> > information in ./state directory (except the ./state/zookeeper/myid > >> file). > >> > > >> > Thanks for assistance in understanding what zookeeper is doing (or not > >> > doing) when starting up a new Cluster. > >> > > >> > -Mark > >> > > >> > >
