Tifenn LE GOFF created ZOOKEEPER-4371:
-----------------------------------------

             Summary: False ID conflict when ZK try to connect to cluster
                 Key: ZOOKEEPER-4371
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4371
             Project: ZooKeeper
          Issue Type: Bug
    Affects Versions: 3.6.2
         Environment: We are on Zookeeper 3.6.2 on Docker with official image. 
It's can be reproduce with something like

 
{code:java}
docker run --add-host=zk_fqdn:zk_ip --ulimit nofile=64000:64000 -p 
ip_zk:2181:2181 -p ip_zk:2888:2888 -p ip_zk:3888:3888 -p ip_zk:7000:7000 -p 
ip_zk:8080:8080 -v /data/zookeeper/data:/data -v /data/zookeeper/log:/datalog 
--hostname "zk_fqdn" --env-file "/data/zookeeper/conf/zk.env"--name zookeeper 
zookeeper:3.6.2
{code}
with
{code:java}
ZOO_MY_ID=zk_id
ZOO_INIT_LIMIT=10
ZOO_SYNC_LIMIT=5
ZOO_MAX_CLIENT_CNXNS=0
ZOO_4LW_COMMANDS_WHITELIST=stat,mntr,conf,ruok
ZOO_STANDALONE_ENABLED=False
ZOO_CFG_EXTRA=metricsProvider.className=org.apache.zookeeper.metrics.prometheus.PrometheusMetricsProvider
 metricsProvider.httpPort=7000 metricsProvider.exportJvmInfo=true
ZOO_SERVERS=server.1=zk1_fqdn:2888:3888;2181 server.2=zk2_fqdn:2888:3888;2181 
server.3=zk3_fqdn:2888:3888;2181 

{code}
 
            Reporter: Tifenn LE GOFF


Some ZK cannot join cluster after moment
{code:java}
echo stat|nc $HOSTNAME 2181
This ZooKeeper instance is not currently serving requests
{code}
We have 3 ZK, zk1 with id1, zk2 with id2 and zk3 with id3.

ZK2 and ZK3 are already running. When ZK1 connect to ZK, we have
{code:java}
2021-09-07 13:33:09,585 [myid:1] - INFO  
[QuorumPeer[myid=1](plain=0.0.0.0:2181)(secure=disabled):FastLeaderElection@979]
 - Notification time out: 60000
2021-09-07 13:33:09,586 [myid:1] - INFO  
[QuorumConnectionThread-[myid=1]-55:QuorumCnxManager@513] - Have smaller server 
identifier, so dropping the connection: (myId:1 --> sid:2)
2021-09-07 13:33:09,586 [myid:1] - INFO  
[QuorumConnectionThread-[myid=1]-56:QuorumCnxManager@513] - Have smaller server 
identifier, so dropping the connection: (myId:1 --> sid:3)
2021-09-07 13:33:30,269 [myid:1] - WARN  [NIOWorkerThread-1:NIOServerCnxn@373] 
- Close of session 0x0
java.io.IOException: ZooKeeperServer not running
        at 
org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:544)
        at 
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:332)
        at 
org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
        at 
org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown 
Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown 
Source)
        at java.base/java.lang.Thread.run(Unknown Source)
2021-09-07 13:33:30,941 [myid:1] - WARN  [NIOWorkerThread-2:NIOServerCnxn@373] 
- Close of session 0x0
java.io.IOException: ZooKeeperServer not running
        at 
org.apache.zookeeper.server.NIOServerCnxn.readLength(NIOServerCnxn.java:544)
        at 
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:332)
        at 
org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
        at 
org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown 
Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$W{code}
and on zk2 (actual leader), we have
{code:java}
2021-09-07 13:33:09,587 [myid:2] - INFO  
[ListenerHandler-fqdn-zk2/172.17.0.2:3888:QuorumCnxManager$Listener$ListenerHandler@1070]
 - Received connection request from /ip-zk1:53102
2021-09-07 13:33:09,588 [myid:2] - WARN  
[SendWorker:1:QuorumCnxManager$SendWorker@1281] - Interrupted while waiting for 
message on queue
java.lang.InterruptedException
        at 
java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(Unknown
 Source)
        at 
java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(Unknown
 Source)
        at 
org.apache.zookeeper.util.CircularBlockingQueue.poll(CircularBlockingQueue.java:105)
        at 
org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1446)
        at 
org.apache.zookeeper.server.quorum.QuorumCnxManager.access$900(QuorumCnxManager.java:98)
        at 
org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:1270)
2021-09-07 13:33:09,588 [myid:2] - WARN  
[SendWorker:1:QuorumCnxManager$SendWorker@1293] - Send worker leaving thread id 
1 my id = 2
2021-09-07 13:33:09,588 [myid:2] - WARN  
[RecvWorker:1:QuorumCnxManager$RecvWorker@1395] - Connection broken for id 1, 
my id = 2
java.net.SocketException: Socket closed
        at java.base/java.net.SocketInputStream.socketRead0(Native Method)
        at java.base/java.net.SocketInputStream.socketRead(Unknown Source)
        at java.base/java.net.SocketInputStream.read(Unknown Source)
        at java.base/java.net.SocketInputStream.read(Unknown Source)
        at java.base/java.io.BufferedInputStream.fill(Unknown Source)
        at java.base/java.io.BufferedInputStream.read(Unknown Source)
        at java.base/java.io.DataInputStream.readInt(Unknown Source)
        at 
org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1383)
2021-09-07 13:33:09,589 [myid:2] - WARN  
[RecvWorker:1:QuorumCnxManager$RecvWorker@1401] - Interrupting SendWorker 
thread from RecvWorker. sid: 1. myId: 2
2021-09-07 13:33:09,589 [myid:2] - INFO  
[ListenerHandler-fqdn-zk2/172.17.0.2:3888:QuorumCnxManager$Listener$ListenerHandler@1070]
 - Received connection request from /172.17.0.2:35380
2021-09-07 13:33:09,590 [myid:2] - WARN  
[ListenerHandler-fqdn-zk2/172.17.0.2:3888:QuorumCnxManager@662] - We got a 
connection request from a server with our own ID. This should be either a 
configuration error, or a bug.
{code}
If we restart leader, it works. This issue happen very often since we have 
migrate our ZK services on docker instances.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to