Viktor Feklin created ZOOKEEPER-4613:
----------------------------------------

             Summary: Zookeper client suddenly stops sending ping requests in 
idle state
                 Key: ZOOKEEPER-4613
                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4613
             Project: ZooKeeper
          Issue Type: Bug
          Components: java client
    Affects Versions: 3.5.7
            Reporter: Viktor Feklin
         Attachments: clint-lost-ping.log

Expected behavior: from documentation: "The session is kept alive by requests 
sent by the client. If the session is idle for a period of time that would 
timeout the session, the client will send a PING request to keep the session 
alive."
Actual behavior: periodically client not sending any ping after last 
successfuly request.

 

Client operates as expected (handling server notifications and sending pings 
during idle periods). But after handle another notification it suddenly stops 
sending any ping (no matter how long idle interval last). Client sleeps until 
external events is happend (server notification, or program call).
After wakeup client starts to check session timeouts, detects long pause and 
triggers connection to state SUSPENDED. Client "closing socket connection and 
attempting reconnect". If period of inactivity was smaller than the session 
timeout (session is alive) connection triggers to state RECONNECTED, otherwise 
new seession created and connection goes to state LOST and then to state 
RECONNECTED. 

 

In the attached log:
1) We have multiple clients working in one JVM.
2) There is a number of zookeeper servers in a cluster.
3) Zookeeper client with seession 0x3008266486f2172 connected to 
10.36.193.111:2181 with negotiated timeout = 30_000
4) Client handles last notification at:

 
{code:java}
15:19:48,874 (mt-[15H15M11S3617]-[module-3]-SendThread(10.36.193.111:2181)) 
DEBUG org.apache.zookeeper.ClientCnxn$SendThread:923 - Reading reply 
sessionid:0x3008266486f2172, packet:: 
clientPath:/tc/default/supervisors/wf_1057 
serverPath:/tc/default/supervisors/wf_1057 finished:false header:: 6131,4  
replyHeader:: 6131,42964550082,0  request:: '/tc/default/supervisors/wf_1057,T  
response:: 
#***,s{42964550082,42964550082,1663244388854,1663244388854,0,0,0,144258549042716267,897,0,42964550082}{code}
 

5) Client shows no activity from this moment until watcher notification from 
server (client wakeups and detects session timeout):

 
{code:java}
15:20:08,884 INFO  [stdout] 
(mt-[15H15M11S3617]-[module-3]-SendThread(10.36.193.111:2181))  15:20:08 WARN  
org.apache.zookeeper.ClientCnxn$SendThread:1190 - Client session timed out, 
have not heard from server in 20010ms for sessionid 0x3008266486f2172
15:20:08,885 INFO  [stdout] 
(mt-[15H15M11S3617]-[module-3]-SendThread(10.36.193.111:2181))  15:20:08 INFO  
org.apache.zookeeper.ClientCnxn$SendThread:1238 - Client session timed out, 
have not heard from server in 20010ms for sessionid 0x3008266486f2172, closing 
socket connection and attempting reconnect{code}
 

6) Ping period = 10 seconds (actualy there were no pings in 20 seconds).
7) Annother clients (running in the same jvm and connected to the same zk 
server) operate normaly while broken client do nothing.

 
{code:java}
15:19:58,885 INFO  [stdout] 
(mt-[15H15M11S3617]-[module-5]-SendThread(10.36.193.111:2181))  15:19:58 DEBUG 
org.apache.zookeeper.ClientCnxn$SendThread:821 - Got ping response for 
sessionid: 0x3008266486f216b after 0ms
15:19:58,975 INFO  [stdout] 
(mt-[15H15M11S3617]-[module-4]-SendThread(10.36.193.111:2181))  15:19:58 DEBUG 
org.apache.zookeeper.ClientCnxn$SendThread:821 - Got ping response for 
sessionid: 0x3008266486f2171 after 1ms{code}
 

So we can claim: 
 * there is no long GC pauses on client (we also has GC log for period of 
incident without any issues)
 * there is no problem with zk server (no reboots etc, cause it successfuly 
handles another clints requests)    

We continuously see this behavior after zk version upgrade:
zk client 3.4.8 ->  3.5.7 (curator 2.12.0 -> 4.3.0)
zk server upgrade to 3.5.8 (runing with default configuration)

We have state recovery after session exiration - so connection loss trigger 
nodes recreation. Another clients react to add/remove events (doing some 
computations). Having thousands of clients repeadly reconnecting - we get 
anomaly work load even in idle state...

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to