Tifenn LE GOFF created ZOOKEEPER-4371:
-----------------------------------------
Summary: False ID conflict when ZK try to connect to cluster
Key: ZOOKEEPER-4371
URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4371
Project: ZooKeeper
Issue Type: Bug
Affects Versions: 3.6.2
Environment: We are on Zookeeper 3.6.2 on Docker with official image.
It's can be reproduce with something like
{code:java}
docker run --add-host=zk_fqdn:zk_ip --ulimit nofile=64000:64000 -p
ip_zk:2181:2181 -p ip_zk:2888:2888 -p ip_zk:3888:3888 -p ip_zk:7000:7000 -p
ip_zk:8080:8080 -v /data/zookeeper/data:/data -v /data/zookeeper/log:/datalog
--hostname "zk_fqdn" --env-file "/data/zookeeper/conf/zk.env"--name zookeeper
zookeeper:3.6.2
{code}
with
{code:java}
ZOO_MY_ID=zk_id
ZOO_INIT_LIMIT=10
ZOO_SYNC_LIMIT=5
ZOO_MAX_CLIENT_CNXNS=0
ZOO_4LW_COMMANDS_WHITELIST=stat,mntr,conf,ruok
ZOO_STANDALONE_ENABLED=False
ZOO_CFG_EXTRA=metricsProvider.className=org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider
metricsProvider.httpPort=7000 metricsProvider.exportJvmInfo=true
ZOO_SERVERS=server.1=zk1_fqdn:2888:3888;2181 server.2=zk2_fqdn:2888:3888;2181
server.3=zk3_fqdn:2888:3888;2181
{code}
Reporter: Tifenn LE GOFF
Some ZK cannot join cluster after moment
{code:java}
echo stat|nc $HOSTNAME 2181
This ZooKeeper instance is not currently serving requests
{code}
We have 3 ZK, zk1 with id1, zk2 with id2 and zk3 with id3.
ZK2 and ZK3 are already running. When ZK1 connect to ZK, we have
{code:java}
2021-09-07 13:33:09,585 [myid:1] - INFO
[QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=disabled):FastLeaderElection@979]
- Notification time out: 60000
2021-09-07 13:33:09,586 [myid:1] - INFO
[QuorumConnectionThread-[myid=1]-55:QuorumCnxManager@513] - Have smaller server
identifier, so dropping the connection: (myId:1 --> sid:2)
2021-09-07 13:33:09,586 [myid:1] - INFO
[QuorumConnectionThread-[myid=1]-56:QuorumCnxManager@513] - Have smaller server
identifier, so dropping the connection: (myId:1 --> sid:3)
2021-09-07 13:33:30,269 [myid:1] - WARN [NIOWorkerThread-1:NIOServerCnxn@373]
- Close of session 0x0
java.io.IOException: ZooKeeperServer not running
at
org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:544)
at
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:332)
at
org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
at
org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown
Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
Source)
at java.base/java.lang.Thread.run(Unknown Source)
2021-09-07 13:33:30,941 [myid:1] - WARN [NIOWorkerThread-2:NIOServerCnxn@373]
- Close of session 0x0
java.io.IOException: ZooKeeperServer not running
at
org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:544)
at
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:332)
at
org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
at
org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown
Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$W{code}
and on zk2 (actual leader), we have
{code:java}
2021-09-07 13:33:09,587 [myid:2] - INFO
[ListenerHandler-fqdn-zk2/172.17.0.2:3888:QuorumCnxManager$Listener$ListenerHandler@1070]
- Received connection request from /ip-zk1:53102
2021-09-07 13:33:09,588 [myid:2] - WARN
[SendWorker:1:QuorumCnxManager$SendWorker@1281] - Interrupted while waiting for
message on queue
java.lang.InterruptedException
at
java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(Unknown
Source)
at
java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(Unknown
Source)
at
org.apache.zookeeper.util.CircularBlockingQueue.poll(CircularBlockingQueue.java:105)
at
org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1446)
at
org.apache.zookeeper.server.quorum.QuorumCnxManager.access$900(QuorumCnxManager.java:98)
at
org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:1270)
2021-09-07 13:33:09,588 [myid:2] - WARN
[SendWorker:1:QuorumCnxManager$SendWorker@1293] - Send worker leaving thread id
1 my id = 2
2021-09-07 13:33:09,588 [myid:2] - WARN
[RecvWorker:1:QuorumCnxManager$RecvWorker@1395] - Connection broken for id 1,
my id = 2
java.net.SocketException: Socket closed
at java.base/java.net.SocketInputStream.socketRead0(Native Method)
at java.base/java.net.SocketInputStream.socketRead(Unknown Source)
at java.base/java.net.SocketInputStream.read(Unknown Source)
at java.base/java.net.SocketInputStream.read(Unknown Source)
at java.base/java.io.BufferedInputStream.fill(Unknown Source)
at java.base/java.io.BufferedInputStream.read(Unknown Source)
at java.base/java.io.DataInputStream.readInt(Unknown Source)
at
org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1383)
2021-09-07 13:33:09,589 [myid:2] - WARN
[RecvWorker:1:QuorumCnxManager$RecvWorker@1401] - Interrupting SendWorker
thread from RecvWorker. sid: 1. myId: 2
2021-09-07 13:33:09,589 [myid:2] - INFO
[ListenerHandler-fqdn-zk2/172.17.0.2:3888:QuorumCnxManager$Listener$ListenerHandler@1070]
- Received connection request from /172.17.0.2:35380
2021-09-07 13:33:09,590 [myid:2] - WARN
[ListenerHandler-fqdn-zk2/172.17.0.2:3888:QuorumCnxManager@662] - We got a
connection request from a server with our own ID. This should be either a
configuration error, or a bug.
{code}
If we restart leader, it works. This issue happen very often since we have
migrate our ZK services on docker instances.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)