Relevant properties from nifi.properties: nifi.state.management.provider.cluster=zk-provider nifi.state.management.embedded.zookeeper.start=true nifi.state.management.embedded.zookeeper.properties=./conf/zookeeper.properties nifi.cluster.protocol.heartbeat.interval=5 sec nifi.cluster.protocol.is.secure=true ## Security properties verified; they work for https in non-cluster configuration
nifi.cluster.is.node=true nifi.cluster.node.address=FQDN1 nifi.cluster.node.protocol.port=9445 nifi.cluster.node.protocol.threads=10 nifi.cluster.node.event.history.size=25 nifi.cluster.node.connection.timeout=5 sec nifi.cluster.node.read.timeout=5 sec nifi.cluster.firewall.file= nifi.cluster.flow.election.max.wait.time=5 mins nifi.cluster.flow.election.max.candidates=201 nifi.zookeeper.connect.string=FQDN1:11001,FQDN2:11001,FQDN3:11001 nifi.zookeeper.connect.tiemout=3 secs nifi.zookeeper.session.timeout=3 secs nifi.zookeeper.root.node=/nifi/test-cluster zookeeper.properties all default except added these lines: server.1=<FQND1>:11001:11000 server.2=<FQND2>:11001:11000 server.3=<FQND3>:11001:11000 state-management.xml all default except the following in <cluster-provider>: <property name="Connect String">FQDN1:11001,FQDN2:11001,FQDN3:11001</property> <property name="Root Node">/nifi/test-cluster</property> Also, the ./state/zookeeper/myid consists of only "1", "2", or "3" depending on the server within the cluster. Is this correct? On Tue, Feb 28, 2017 at 1:24 PM, Jeff <[email protected]> wrote: > Hello Mark, > > Sorry to hear that you're having issues with getting your cluster up and > running. Could you provide the content of your nifi.properties file? > Also, please check the Admin guide for ZK setup [1], particularly the Flow > Election and Basic Cluster Setup sections. > > By default, nifi.properties uses a 5-minute election duration to elect the > primary node. However, it does not have a default number of candidates for > the election, so typically it will take 5 minutes for that election process > when you have a 3-node cluster. You could try > setting nifi.cluster.flow.election.max.candidates to 3, and restart the > cluster, but based on the errors you're seeing, I think there may be some > other issues. > > Some key properties to check: > > nifi.properties: > nifi.state.management.embedded.zookeeper.start (true for embedded ZK, > false > or blank if you're using an external ZK) > nifi.zookeeper.connect.string (set to the connect string for your ZK > quorum, regardless of embedded or external ZK, e.g. > host1:2181,host2:2181,host3:2181) > > zookeeper.properties: > server.1 (server.1 through server.N, should be set to the hostname:port of > each ZK server in your cluster, regardless of embedded or external ZK) > > state-management.xml, under cluster-provider element: > <property name="Connect String"></property> (set to the connect string to > access your ZK quorum, used by processors to store cluster-based state) > > [1] > https://nifi.apache.org/docs/nifi-docs/html/administration- > guide.html#clustering > > On Tue, Feb 28, 2017 at 12:56 PM Mark Bean <[email protected]> wrote: > > > I am attempting to setup a new Cluster with 3 Nodes initially. Each node > is > > reporting zookeeper/curator errors, and the Cluster is not able to > connect > > the Nodes. The error is reported many times per second and is continuous > on > > all Nodes: > > > > 2017-02-28 14:22:53,515 ERROR [Curator-Framework-0] > > o.a.c.f.imps.CuratorFrameworkImpl Background operation retry gave up > > org.apache.zookeeper.KeeperException$ConnectionLossException: > > KeeperErrorCode = ConnectionLoss > > at > > org.apache.zookeeper.KeeperException.create(KeeperException.java.99) > > ~[zookeeper-3.4.6.jar:3.4.6-1569965] > > at > > > > org.apache.curator.framework.imps.CuratorFrameworkImpl. > checkBackgroundRetry(CuratorFrameworkImpl.java:728) > > [curator-framework-2.11.0.jar:na] > > at > > > > org.apache.curator.framework.imps.CuratorFrameworkImpl. > performBackgroundOperation(CuratorFrameworkImpl.java:857) > > [curator-framework-2.11.0.jar:na] > > at > > > > org.apache.curator.framework.imps.CuratorFrameworkImpl. > backgroundOperationsLoop(CuratorFrameworkImpl.java:809) > > [curator-framework-2.11.0.jar:na] > > at > > > > org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300( > CuratorFrameworkImpl.java:64) > > [curator-framework-2.11.0.jar:na] > > at > > > > org.apache.curator.framework.imps.CuratorFrameworkImpl.$4. > call(CuratorFrameworkImpl.java:267) > > [curator-framework-2.11.0.jar:na] > > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > [na:1.8.0_121] > > at > > > > java.util.concurrent.ScheduledThreadPoolExecutor$ > ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > > [na:1.8.0_121] > > at > > > > java.util.concurrent.ScheduledThreadPoolExecutor$ > ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > > [na:1.8.0_121] > > at > > > > java.util.concurrent.ThreadPoolExecutor.runWorker( > ThreadPoolExecutor.java:1142) > > [na:1.8.0_121] > > at > > > > java.util.concurrent.ThreadPoolExecutor$Worker.run( > ThreadPoolExecutor.java:617) > > [na:1.8.0_121] > > at java.lang.Thread.run(Thread.java:745) [na:1.8.0_121] > > 2017-02-28 14:22:53,516 ERROR [Curator-Framework-0] > > o.a.c.f.imps.CuratorFrameworkImpl Background retry gave up > > org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = > > ConnectionLoss > > at > > > > org.apache.curator.framework.imps.CuratorFrameworkImpl. > performBackgroundOperation(CuratorFramworkImpl.java:838) > > [curator-framework-2.11.0.jar:na] > > at > > > > org.apache.curator.framework.imps.CuratorFrameworkImpl. > backgroundOperationsLoop(CuratorFrameworkImpl.java:809) > > [curator-framework-2.11.0.jar:na] > > at > > > > org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300( > CuratorFrameworkImpl.java:64) > > [curator-framework-2.11.0.jar:na] > > at > > > > org.apache.curator.framework.imps.CuratorFrameworkImpl.$4. > call(CuratorFrameworkImpl.java:267) > > [curator-framework-2.11.0.jar:na] > > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > [na:1.8.0_121] > > at > > > > java.util.concurrent.ScheduledThreadPoolExecutor$ > ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) > > [na:1.8.0_121] > > at > > > > java.util.concurrent.ScheduledThreadPoolExecutor$ > ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) > > [na:1.8.0_121] > > at > > > > java.util.concurrent.ThreadPoolExecutor.runWorker( > ThreadPoolExecutor.java:1142) > > [na:1.8.0_121] > > at > > > > java.util.concurrent.ThreadPoolExecutor$Worker.run( > ThreadPoolExecutor.java:617) > > [na:1.8.0_121] > > at java.lang.Thread.run(Thread.java:745) [na:1.8.0_121] > > > > While the above message was repeating in the log on one of the Nodes, > > another Node's log was "stuck" for a period of time with the last message > > being: > > > > INFO [main] o.a.nifi.properties.NiFiPropertiesLoader Loaded 122 > properties > > from <path>/nifi.properties > > > > The next message to appear after nearly 6 minutes is: > > > > INFO [main] o.a.nifi.util.FileBasedVariableRegistry Loaded 91 properties > > from system properties and environment variables. > > > > The 6 minute delay seems curious. > > > > Then, the Node appears to start the zookeeper server but hits this error: > > > > ERROR [LearnerHandler-/10.6.218.9:22816] > > o.a.z.server.quorum.LearnerHandler > > Unexpected exception causing shutdown while sock still open > > java.io.EOFException: null > > at java.io.DataInputStream.readInt(DataInputStream.java:392) > > ~[na.1.8.0_121] > > at > > org.apache.jute.BinaryInputArchive.readString( > BinaryInputArchive.java:79) > > ~[zookeeper-3.4.6.jar:3.4.6-1569965] > > at org.apache.zookeeper.data.Id.deserialize(Id.java:55) > > ~[zookeeper-3.4.6.jar:3.4.6-1569965] > > at > > org.apache.jute.BinaryInputArchive.readRecord( > BinaryInputArchive.java:103) > > ~[zookeeper-3.4.6.jar:3.4.6-1569965] > > at > > > > org.apache.zookeeper.server.quorum.QuorumPacket. > deserialze(QuorumPacket.java:92) > > ~[zookeeper-3.4.6.jar:3.4.6-1569965] > > at > > org.apache.jute.BinaryInputArchive.readRecord( > BinaryInputArchive.java:103) > > ~[zookeeper-3.4.6.jar:3.4.6-1569965] > > at > > > > org.apache.zookeeper.server.quorum.LearnerHandler.run( > LearnerHandler.java:309) > > ~[zookeeper-3.4.6.jar:3.4.6-1569965] > > > > This is soon followed by the repeating errors shown above ("Background > > operation retry gave up") > > > > It is as if the quorum vote does not succeed within a given timeframe and > > then it stops trying. Note: on one attempt to start the Cluster > > successfully, I removed all but one flow.xml.gz, and cleared all > > information in ./state directory (except the ./state/zookeeper/myid > file). > > > > Thanks for assistance in understanding what zookeeper is doing (or not > > doing) when starting up a new Cluster. > > > > -Mark > > >
