I am attempting to setup a new Cluster with 3 Nodes initially. Each node is
reporting zookeeper/curator errors, and the Cluster is not able to connect
the Nodes. The error is reported many times per second and is continuous on
all Nodes:
2017-02-28 14:22:53,515 ERROR [Curator-Framework-0]
o.a.c.f.imps.CuratorFrameworkImpl Background operation retry gave up
org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss
at
org.apache.zookeeper.KeeperException.create(KeeperException.java.99)
~[zookeeper-3.4.6.jar:3.4.6-1569965]
at
org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:728)
[curator-framework-2.11.0.jar:na]
at
org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:857)
[curator-framework-2.11.0.jar:na]
at
org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:809)
[curator-framework-2.11.0.jar:na]
at
org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:64)
[curator-framework-2.11.0.jar:na]
at
org.apache.curator.framework.imps.CuratorFrameworkImpl.$4.call(CuratorFrameworkImpl.java:267)
[curator-framework-2.11.0.jar:na]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_121]
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
[na:1.8.0_121]
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
[na:1.8.0_121]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[na:1.8.0_121]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[na:1.8.0_121]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_121]
2017-02-28 14:22:53,516 ERROR [Curator-Framework-0]
o.a.c.f.imps.CuratorFrameworkImpl Background retry gave up
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode =
ConnectionLoss
at
org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFramworkImpl.java:838)
[curator-framework-2.11.0.jar:na]
at
org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:809)
[curator-framework-2.11.0.jar:na]
at
org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:64)
[curator-framework-2.11.0.jar:na]
at
org.apache.curator.framework.imps.CuratorFrameworkImpl.$4.call(CuratorFrameworkImpl.java:267)
[curator-framework-2.11.0.jar:na]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_121]
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
[na:1.8.0_121]
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
[na:1.8.0_121]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[na:1.8.0_121]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[na:1.8.0_121]
at java.lang.Thread.run(Thread.java:745) [na:1.8.0_121]
While the above message was repeating in the log on one of the Nodes,
another Node's log was "stuck" for a period of time with the last message
being:
INFO [main] o.a.nifi.properties.NiFiPropertiesLoader Loaded 122 properties
from <path>/nifi.properties
The next message to appear after nearly 6 minutes is:
INFO [main] o.a.nifi.util.FileBasedVariableRegistry Loaded 91 properties
from system properties and environment variables.
The 6 minute delay seems curious.
Then, the Node appears to start the zookeeper server but hits this error:
ERROR [LearnerHandler-/10.6.218.9:22816] o.a.z.server.quorum.LearnerHandler
Unexpected exception causing shutdown while sock still open
java.io.EOFException: null
at java.io.DataInputStream.readInt(DataInputStream.java:392)
~[na.1.8.0_121]
at
org.apache.jute.BinaryInputArchive.readString(BinaryInputArchive.java:79)
~[zookeeper-3.4.6.jar:3.4.6-1569965]
at org.apache.zookeeper.data.Id.deserialize(Id.java:55)
~[zookeeper-3.4.6.jar:3.4.6-1569965]
at
org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
~[zookeeper-3.4.6.jar:3.4.6-1569965]
at
org.apache.zookeeper.server.quorum.QuorumPacket.deserialze(QuorumPacket.java:92)
~[zookeeper-3.4.6.jar:3.4.6-1569965]
at
org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
~[zookeeper-3.4.6.jar:3.4.6-1569965]
at
org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:309)
~[zookeeper-3.4.6.jar:3.4.6-1569965]
This is soon followed by the repeating errors shown above ("Background
operation retry gave up")
It is as if the quorum vote does not succeed within a given timeframe and
then it stops trying. Note: on one attempt to start the Cluster
successfully, I removed all but one flow.xml.gz, and cleared all
information in ./state directory (except the ./state/zookeeper/myid file).
Thanks for assistance in understanding what zookeeper is doing (or not
doing) when starting up a new Cluster.
-Mark