[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15574315#comment-15574315
 ] 

guoping.gp edited comment on ZOOKEEPER-1382 at 10/14/16 5:47 AM:
-----------------------------------------------------------------

the same issue is still can be found on 3.4.6, 
https://issues.apache.org/jira/browse/ZOOKEEPER-2615.

when the client app reconnect to a server, the setWatchs and close still stay 
in the race condition as following possible:

Daemon Thread [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:6181] (Suspended 
(breakpoint at line 966 in NIOServerCnxn)) 
NIOServerCnxn.close() line: 966 
NIOServerCnxn.doIO(SelectionKey) line: 360      
NIOServerCnxnFactory.run() line: 208    
Thread.run() line: 745

and then

Thread [CommitProcessor:1] (Suspended (breakpoint at line 303 in 
FinalRequestProcessor))        
FinalRequestProcessor.processRequest(Request) line: 303 
CommitProcessor.run() line: 74

the issue cause our production leak about 1 million watchs


was (Author: guoping.gp):
the same issue is still can be found on 3.4.6, 
https://issues.apache.org/jira/browse/ZOOKEEPER-2615.

when the client app reconnect to a server, the setWatchs and close still stay 
in the race condition as following possible:

Daemon Thread [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:6181] (Suspended 
(breakpoint at line 966 in NIOServerCnxn)) 
NIOServerCnxn.close() line: 966 
NIOServerCnxn.doIO(SelectionKey) line: 360      
NIOServerCnxnFactory.run() line: 208    
Thread.run() line: 745

and then

Thread [CommitProcessor:1] (Suspended (breakpoint at line 303 in 
FinalRequestProcessor))        
FinalRequestProcessor.processRequest(Request) line: 303 
CommitProcessor.run() line: 74

> Zookeeper server holds onto dead/expired session ids in the watch data 
> structures
> ---------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1382
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1382
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.4.5
>            Reporter: Neha Narkhede
>            Assignee: Germán Blanco
>            Priority: Critical
>             Fix For: 3.4.6, 3.5.0
>
>         Attachments: ZOOKEEPER-1382-branch-3.4.patch, 
> ZOOKEEPER-1382-branch-3.4.patch, ZOOKEEPER-1382-branch-3.4.patch, 
> ZOOKEEPER-1382-branch-3.4.patch, ZOOKEEPER-1382.patch, ZOOKEEPER-1382.patch, 
> ZOOKEEPER-1382.patch, ZOOKEEPER-1382.patch, ZOOKEEPER-1382.patch, 
> ZOOKEEPER-1382_3.3.4.patch
>
>
> I've observed that zookeeper server holds onto expired session ids in the 
> watcher data structures. The result is the wchp command reports session ids 
> that cannot be found through cons/dump and those expired session ids sit 
> there maybe until the server is restarted. Here are snippets from the client 
> and the server logs that lead to this state, for one particular session id 
> 0x134485fd7bcb26f -
> There are 4 servers in the zookeeper cluster - 223, 224, 225 (leader), 226 
> and I'm using ZkClient to connect to the cluster
> From the application log -
> application.log.2012-01-26-325.gz:2012/01/26 04:56:36.177 INFO [ClientCnxn] 
> [main-SendThread(223.prod:12913)] [application Session establishment complete 
> on server 223.prod/172.17.135.38:12913, sessionid = 0x134485fd7bcb26f, 
> negotiated timeout = 6000
> application.log.2012-01-27.gz:2012/01/27 09:52:37.714 INFO [ClientCnxn] 
> [main-SendThread(223.prod:12913)] [application] Client session timed out, 
> have not heard from server in 9827ms for sessionid 0x134485fd7bcb26f, closing 
> socket connection and attempting reconnect
> application.log.2012-01-27.gz:2012/01/27 09:52:38.191 INFO [ClientCnxn] 
> [main-SendThread(226.prod:12913)] [application] Unable to reconnect to 
> ZooKeeper service, session 0x134485fd7bcb26f has expired, closing socket 
> connection
> On the leader zk, 225 -
> zookeeper.log.2012-01-27-leader-225.gz:2012-01-27 09:52:34,010 - INFO  
> [SessionTracker:ZooKeeperServer@314] - Expiring session 0x134485fd7bcb26f, 
> timeout of 6000ms exceeded
> zookeeper.log.2012-01-27-leader-225.gz:2012-01-27 09:52:34,010 - INFO  
> [ProcessThread:-1:PrepRequestProcessor@391] - Processed session termination 
> for sessionid: 0x134485fd7bcb26f
> On the server, the client was initially connected to, 223 -
> zookeeper.log.2012-01-26-223.gz:2012-01-26 04:56:36,173 - INFO  
> [CommitProcessor:1:NIOServerCnxn@1580] - Established session 
> 0x134485fd7bcb26f with negotiated timeout 6000 for client /172.17.136.82:45020
> zookeeper.log.2012-01-27-223.gz:2012-01-27 09:52:34,018 - INFO  
> [CommitProcessor:1:NIOServerCnxn@1435] - Closed socket connection for client 
> /172.17.136.82:45020 which had sessionid 0x134485fd7bcb26f
> Here are the log snippets from 226, which is the server, the client 
> reconnected to, before getting session expired event -
> 2012-01-27 09:52:38,190 - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:12913:NIOServerCnxn@770] - Client 
> attempting to renew session 0x134485fd7bcb26f at /172.17.136.82:49367
> 2012-01-27 09:52:38,191 - INFO  
> [QuorumPeer:/0.0.0.0:12913:NIOServerCnxn@1573] - Invalid session 
> 0x134485fd7bcb26f for client /172.17.136.82:49367, probably expired
> 2012-01-27 09:52:38,191 - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:12913:NIOServerCnxn@1435] - Closed 
> socket connection for client /172.17.136.82:49367 which had sessionid 
> 0x134485fd7bcb26f
> wchp output from 226, taken on 01/30 -
> nnarkhed-ld:zk-cons-wchp-2012013000 nnarkhed$ grep 0x134485fd7bcb26f 
> *226.*wchp* | wc -l
> 3
> wchp output from 223, taken on 01/30 -
> nnarkhed-ld:zk-cons-wchp-2012013000 nnarkhed$ grep 0x134485fd7bcb26f 
> *223.*wchp* | wc -l
> 0
> cons output from 223 and 226, taken on 01/30 -
> nnarkhed-ld:zk-cons-wchp-2012013000 nnarkhed$ grep 0x134485fd7bcb26f 
> *226.*cons* | wc -l
> 0
> nnarkhed-ld:zk-cons-wchp-2012013000 nnarkhed$ grep 0x134485fd7bcb26f 
> *223.*cons* | wc -l
> 0
> So, what seems to have happened is that the client was able to re-register 
> the watches on the new server (226), after it got disconnected from 223, 
> inspite of having an expired session id. 
> In NIOServerCnxn, I saw that after suspecting that a session is expired, a 
> server removes the cnxn and its watches from its internal data structures. 
> But before that it allows more requests to be processed even if the session 
> is expired -
>             // Now that the session is ready we can start receiving packets
>             synchronized (this.factory) {
>                 sk.selector().wakeup();
>                 enableRecv();
>             }
>         } catch (Exception e) {
>             LOG.warn("Exception while establishing session, closing", e);
>             close();
>         }
> I wonder if the client somehow sneaked in the set watches, right after the 
> server removed the connection through removeCnxn() API ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to