[jira] [Commented] (ZOOKEEPER-3828) zookeeper clients gets connection timeout when the leader node is restarted

Jeff Walters (Jira) Thu, 09 Jul 2020 13:24:18 -0700


    [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17154907#comment-17154907
 ]


Jeff Walters commented on ZOOKEEPER-3828:
-----------------------------------------

I have seen this issue as well..  I have several environments, all running 
3.6.1 (some physical, some on VirtualBox on Windows 10, none are Dockerized).

My configuration is the same on all:

server.1=<server1 IP>:2888:3888
server.2=<server2 IP>:2888:3888
server.3=<server3 IP>:2888:3888

Starting the cluster initially and zkCli.sh will connect on all nodes.  
Stopping only one ZK instance (server.2 or server.3) while the other two are 
operational and connected, then reconnecting the failed instance, and the issue 
doesn't present.  Stopping ZK on server.1 (server.2 and server.3 are connected 
and operational), then restarting server.1 and zkCli.sh will not connect.

By issuing a restart (zkServer.sh restart) command on all 3 nodes (in order, 1, 
2, 3) I was able to restore the cluster to operational state, however there is 
a moment of downtime.

As a test, I configured a new cluster (VirtualBox environment on Windows 10) as 
follows:

server.2=<server1 IP>:2888:3888
server.3=<server2 IP>:2888:3888
server.4=<server3 IP>:2888:3888

(myid entry corresponds to the server.x number above)

The issue did not reappear.  I was able to stop/start (zkServer.sh 
stop/zkServer.sh start) and restart (zkServer.sh restart) all nodes and 
(zkCli.sh -server localhost:2181) reconnected fine.

Based on my experience, the issue is isolated to the node that has myid=1 and 
no other node.  Not assigning a node the number "1" kept the issue from 
presenting

> zookeeper clients gets connection timeout when the leader node is restarted
> ---------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-3828
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3828
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: java client
>    Affects Versions: 3.6.1, 3.5.8
>            Reporter: Aishwarya Soni
>            Priority: Minor
>         Attachments: debug_logs.zip, node1.txt, node2.txt, node3.txt, 
> node4.txt, node5.txt
>
>
> I have configured 5 nodes zookeeper cluster using 3.6.1 version in a docker 
> containerized environment. As a part of some destructive testing, I restarted 
> zookeeper leader. Now, re-election happened and all 5 nodes (containers) are 
> back in good state with new leader. But when I login to one of the container 
> and go inside zk Cli (./zkCli.sh) and run the cmd *ls /* I see below error,
>  {color:#000000} {color}
>  *{color:#000000}[zk: localhost:2181(CONNECTING) 1]{color}* 
> *{color:#000000}[zk: localhost:2181(CONNECTING) 1] ls /{color}*
> *{color:#000000}2020-05-14 23:48:26,556 [myid:localhost:2181] - WARN  
> [main-SendThread(localhost:2181):ClientCnxn$SendThread@1229] - Client session 
> timed out, have not heard from server in 30001ms for session id 0x0{color}*
> *{color:#000000}2020-05-14 23:48:26,556 [myid:localhost:2181] - WARN  
> [main-SendThread(localhost:2181):ClientCnxn$SendThread@1272] - Session 0x0 
> for sever localhost/127.0.0.1:2181, Closing socket connection. Attempting 
> reconnect except it is a SessionExpiredException.{color}*
> *{color:#000000}org.apache.zookeeper.ClientCnxn$SessionTimeoutException: 
> Client session timed out, have not heard from server in 30001ms for session 
> id 0x0{color}*
>  *{color:#000000}at 
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1230){color}*
> *{color:#000000}KeeperErrorCode = ConnectionLoss for /{color}*
> *{color:#000000}[zk: localhost:2181(CONNECTING) 2] 2020-05-14 23:48:28,089 
> [myid:localhost:2181] - INFO  
> [main-SendThread(localhost:2181):ClientCnxn$SendThread@1154] - Opening socket 
> connection to server localhost/127.0.0.1:2181.{color}*
> *{color:#000000}2020-05-14 23:48:28,089 [myid:localhost:2181] - INFO  
> [main-SendThread(localhost:2181):ClientCnxn$SendThread@1156] - SASL config 
> status: Will not attempt to authenticate using SASL (unknown error){color}*
> *{color:#000000}2020-05-14 23:48:28,090 [myid:localhost:2181] - INFO  
> [main-SendThread(localhost:2181):ClientCnxn$SendThread@986] - Socket 
> connection established, initiating session, client: /127.0.0.1:60384, server: 
> localhost/127.0.0.1:2181{color}*
> *{color:#000000}2020-05-14 23:48:58,119 [myid:localhost:2181] - WARN  
> [main-SendThread(localhost:2181):ClientCnxn$SendThread@1229] - Client session 
> timed out, have not heard from server in 30030ms for session id 0x0{color}*
> *{color:#000000}2020-05-14 23:48:58,120 [myid:localhost:2181] - WARN  
> [main-SendThread(localhost:2181):ClientCnxn$SendThread@1272] - Session 0x0 
> for sever localhost/127.0.0.1:2181, Closing socket connection. Attempting 
> reconnect except it is a SessionExpiredException.{color}*
> *{color:#000000}org.apache.zookeeper.ClientCnxn$SessionTimeoutException: 
> Client session timed out, have not heard from server in 30030ms for session 
> id 0x0{color}*
>  *{color:#000000}at 
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1230){color}*
> *{color:#000000}2020-05-14 23:49:00,003 [myid:localhost:2181] - INFO  
> [main-SendThread(localhost:2181):ClientCnxn$SendThread@1154] - Opening socket 
> connection to server localhost/127.0.0.1:2181.{color}*
> *{color:#000000}2020-05-14 23:49:00,004 [myid:localhost:2181] - INFO  
> [main-SendThread(localhost:2181):ClientCnxn$SendThread@1156] - SASL config 
> status: Will not attempt to authenticate using SASL (unknown error){color}*
> *{color:#000000}2020-05-14 23:49:00,004 [myid:localhost:2181] - INFO  
> [main-SendThread(localhost:2181):ClientCnxn$SendThread@986] - Socket 
> connection established, initiating session, client: /127.0.0.1:32936, server: 
> localhost/127.0.0.1:2181{color}*
> *{color:#000000}2020-05-14 23:49:30,032 [myid:localhost:2181] - WARN  
> [main-SendThread(localhost:2181):ClientCnxn$SendThread@1229] - Client session 
> timed out, have not heard from server in 30029ms for session id 0x0{color}*
> *{color:#000000}2020-05-14 23:49:30,033 [myid:localhost:2181] - WARN  
> [main-SendThread(localhost:2181):ClientCnxn$SendThread@1272] - Session 0x0 
> for sever localhost/127.0.0.1:2181, Closing socket connection. Attempting 
> reconnect except it is a SessionExpiredException.{color}*
> *{color:#000000}org.apache.zookeeper.ClientCnxn$SessionTimeoutException: 
> Client session timed out, have not heard from server in 30029ms for session 
> id 0x0{color}*
>  *{color:#000000}at 
> org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1230){color}*
> *{color:#000000}2020-05-14 23:49:31,230 [myid:localhost:2181] - INFO  
> [main-SendThread(localhost:2181):ClientCnxn$SendThread@1154] - Opening socket 
> connection to server localhost/127.0.0.1:2181.{color}*
> *{color:#000000}2020-05-14 23:49:31,230 [myid:localhost:2181] - INFO  
> [main-SendThread(localhost:2181):ClientCnxn$SendThread@1156] - SASL config 
> status: Will not attempt to authenticate using SASL (unknown error){color}*
> *{color:#000000}2020-05-14 23:49:31,230 [myid:localhost:2181] - INFO  
> [main-SendThread(localhost:2181):ClientCnxn$SendThread@986] - Socket 
> connection established, initiating session, client: /127.0.0.1:33766, server: 
> localhost/127.0.0.1:2181{color}*
> {color:#000000}Does anyone know what could possibly be wrong? For reference: 
> https://issues.apache.org/jira/browse/ZOOKEEPER-2164{color}
> This behavior is observed on all the nodes when the leader is restarted. All 
> is good when a follower is restarted.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ZOOKEEPER-3828) zookeeper clients gets connection timeout when the leader node is restarted

Reply via email to