[jira] [Commented] (ZOOKEEPER-1382) Zookeeper server holds onto dead/expired session ids in the watch data structures

2013-02-11 Thread Michael Morello (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13575670#comment-13575670
 ] 

Michael Morello commented on ZOOKEEPER-1382:


Since we monitor sessions and watches we note that we encounter this problem 
from time to time.
This is not a big issue but it causes a memory leak.
I'll try to see if I can work on it but it does seem hard to reproduce in a 
unit test.

 Zookeeper server holds onto dead/expired session ids in the watch data 
 structures
 -

 Key: ZOOKEEPER-1382
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1382
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.3.4
Reporter: Neha Narkhede
Assignee: Neha Narkhede
Priority: Critical
 Fix For: 3.4.6

 Attachments: ZOOKEEPER-1382_3.3.4.patch


 I've observed that zookeeper server holds onto expired session ids in the 
 watcher data structures. The result is the wchp command reports session ids 
 that cannot be found through cons/dump and those expired session ids sit 
 there maybe until the server is restarted. Here are snippets from the client 
 and the server logs that lead to this state, for one particular session id 
 0x134485fd7bcb26f -
 There are 4 servers in the zookeeper cluster - 223, 224, 225 (leader), 226 
 and I'm using ZkClient to connect to the cluster
 From the application log -
 application.log.2012-01-26-325.gz:2012/01/26 04:56:36.177 INFO [ClientCnxn] 
 [main-SendThread(223.prod:12913)] [application Session establishment complete 
 on server 223.prod/172.17.135.38:12913, sessionid = 0x134485fd7bcb26f, 
 negotiated timeout = 6000
 application.log.2012-01-27.gz:2012/01/27 09:52:37.714 INFO [ClientCnxn] 
 [main-SendThread(223.prod:12913)] [application] Client session timed out, 
 have not heard from server in 9827ms for sessionid 0x134485fd7bcb26f, closing 
 socket connection and attempting reconnect
 application.log.2012-01-27.gz:2012/01/27 09:52:38.191 INFO [ClientCnxn] 
 [main-SendThread(226.prod:12913)] [application] Unable to reconnect to 
 ZooKeeper service, session 0x134485fd7bcb26f has expired, closing socket 
 connection
 On the leader zk, 225 -
 zookeeper.log.2012-01-27-leader-225.gz:2012-01-27 09:52:34,010 - INFO  
 [SessionTracker:ZooKeeperServer@314] - Expiring session 0x134485fd7bcb26f, 
 timeout of 6000ms exceeded
 zookeeper.log.2012-01-27-leader-225.gz:2012-01-27 09:52:34,010 - INFO  
 [ProcessThread:-1:PrepRequestProcessor@391] - Processed session termination 
 for sessionid: 0x134485fd7bcb26f
 On the server, the client was initially connected to, 223 -
 zookeeper.log.2012-01-26-223.gz:2012-01-26 04:56:36,173 - INFO  
 [CommitProcessor:1:NIOServerCnxn@1580] - Established session 
 0x134485fd7bcb26f with negotiated timeout 6000 for client /172.17.136.82:45020
 zookeeper.log.2012-01-27-223.gz:2012-01-27 09:52:34,018 - INFO  
 [CommitProcessor:1:NIOServerCnxn@1435] - Closed socket connection for client 
 /172.17.136.82:45020 which had sessionid 0x134485fd7bcb26f
 Here are the log snippets from 226, which is the server, the client 
 reconnected to, before getting session expired event -
 2012-01-27 09:52:38,190 - INFO  
 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:12913:NIOServerCnxn@770] - Client 
 attempting to renew session 0x134485fd7bcb26f at /172.17.136.82:49367
 2012-01-27 09:52:38,191 - INFO  
 [QuorumPeer:/0.0.0.0:12913:NIOServerCnxn@1573] - Invalid session 
 0x134485fd7bcb26f for client /172.17.136.82:49367, probably expired
 2012-01-27 09:52:38,191 - INFO  
 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:12913:NIOServerCnxn@1435] - Closed 
 socket connection for client /172.17.136.82:49367 which had sessionid 
 0x134485fd7bcb26f
 wchp output from 226, taken on 01/30 -
 nnarkhed-ld:zk-cons-wchp-2012013000 nnarkhed$ grep 0x134485fd7bcb26f 
 *226.*wchp* | wc -l
 3
 wchp output from 223, taken on 01/30 -
 nnarkhed-ld:zk-cons-wchp-2012013000 nnarkhed$ grep 0x134485fd7bcb26f 
 *223.*wchp* | wc -l
 0
 cons output from 223 and 226, taken on 01/30 -
 nnarkhed-ld:zk-cons-wchp-2012013000 nnarkhed$ grep 0x134485fd7bcb26f 
 *226.*cons* | wc -l
 0
 nnarkhed-ld:zk-cons-wchp-2012013000 nnarkhed$ grep 0x134485fd7bcb26f 
 *223.*cons* | wc -l
 0
 So, what seems to have happened is that the client was able to re-register 
 the watches on the new server (226), after it got disconnected from 223, 
 inspite of having an expired session id. 
 In NIOServerCnxn, I saw that after suspecting that a session is expired, a 
 server removes the cnxn and its watches from its internal data structures. 
 But before that it allows more requests to be processed even if the session 
 is expired -
 // Now that the session is ready we can start receiving packets
 

[jira] [Updated] (ZOOKEEPER-1382) Zookeeper server holds onto dead/expired session ids in the watch data structures

2013-03-03 Thread Michael Morello (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Morello updated ZOOKEEPER-1382:
---

Attachment: ZOOKEEPER-1382.patch

I managed to create a JUnit test case that demonstrate the watches leak, 
unfortunately i had to promote a few methods to the public scope.

As Neha pointed out the unit test shows that if a client send a connection 
request with an expired session the NIOServerCnxn class still processes 
upcoming messages like OpCode.setWatches.

Actually the fix simply consists to call cnxn.enableRecv() only if the session 
has been validated by the leader.

 Zookeeper server holds onto dead/expired session ids in the watch data 
 structures
 -

 Key: ZOOKEEPER-1382
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1382
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.4.5
Reporter: Neha Narkhede
Assignee: Neha Narkhede
Priority: Critical
 Fix For: 3.4.6

 Attachments: ZOOKEEPER-1382_3.3.4.patch, ZOOKEEPER-1382.patch


 I've observed that zookeeper server holds onto expired session ids in the 
 watcher data structures. The result is the wchp command reports session ids 
 that cannot be found through cons/dump and those expired session ids sit 
 there maybe until the server is restarted. Here are snippets from the client 
 and the server logs that lead to this state, for one particular session id 
 0x134485fd7bcb26f -
 There are 4 servers in the zookeeper cluster - 223, 224, 225 (leader), 226 
 and I'm using ZkClient to connect to the cluster
 From the application log -
 application.log.2012-01-26-325.gz:2012/01/26 04:56:36.177 INFO [ClientCnxn] 
 [main-SendThread(223.prod:12913)] [application Session establishment complete 
 on server 223.prod/172.17.135.38:12913, sessionid = 0x134485fd7bcb26f, 
 negotiated timeout = 6000
 application.log.2012-01-27.gz:2012/01/27 09:52:37.714 INFO [ClientCnxn] 
 [main-SendThread(223.prod:12913)] [application] Client session timed out, 
 have not heard from server in 9827ms for sessionid 0x134485fd7bcb26f, closing 
 socket connection and attempting reconnect
 application.log.2012-01-27.gz:2012/01/27 09:52:38.191 INFO [ClientCnxn] 
 [main-SendThread(226.prod:12913)] [application] Unable to reconnect to 
 ZooKeeper service, session 0x134485fd7bcb26f has expired, closing socket 
 connection
 On the leader zk, 225 -
 zookeeper.log.2012-01-27-leader-225.gz:2012-01-27 09:52:34,010 - INFO  
 [SessionTracker:ZooKeeperServer@314] - Expiring session 0x134485fd7bcb26f, 
 timeout of 6000ms exceeded
 zookeeper.log.2012-01-27-leader-225.gz:2012-01-27 09:52:34,010 - INFO  
 [ProcessThread:-1:PrepRequestProcessor@391] - Processed session termination 
 for sessionid: 0x134485fd7bcb26f
 On the server, the client was initially connected to, 223 -
 zookeeper.log.2012-01-26-223.gz:2012-01-26 04:56:36,173 - INFO  
 [CommitProcessor:1:NIOServerCnxn@1580] - Established session 
 0x134485fd7bcb26f with negotiated timeout 6000 for client /172.17.136.82:45020
 zookeeper.log.2012-01-27-223.gz:2012-01-27 09:52:34,018 - INFO  
 [CommitProcessor:1:NIOServerCnxn@1435] - Closed socket connection for client 
 /172.17.136.82:45020 which had sessionid 0x134485fd7bcb26f
 Here are the log snippets from 226, which is the server, the client 
 reconnected to, before getting session expired event -
 2012-01-27 09:52:38,190 - INFO  
 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:12913:NIOServerCnxn@770] - Client 
 attempting to renew session 0x134485fd7bcb26f at /172.17.136.82:49367
 2012-01-27 09:52:38,191 - INFO  
 [QuorumPeer:/0.0.0.0:12913:NIOServerCnxn@1573] - Invalid session 
 0x134485fd7bcb26f for client /172.17.136.82:49367, probably expired
 2012-01-27 09:52:38,191 - INFO  
 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:12913:NIOServerCnxn@1435] - Closed 
 socket connection for client /172.17.136.82:49367 which had sessionid 
 0x134485fd7bcb26f
 wchp output from 226, taken on 01/30 -
 nnarkhed-ld:zk-cons-wchp-2012013000 nnarkhed$ grep 0x134485fd7bcb26f 
 *226.*wchp* | wc -l
 3
 wchp output from 223, taken on 01/30 -
 nnarkhed-ld:zk-cons-wchp-2012013000 nnarkhed$ grep 0x134485fd7bcb26f 
 *223.*wchp* | wc -l
 0
 cons output from 223 and 226, taken on 01/30 -
 nnarkhed-ld:zk-cons-wchp-2012013000 nnarkhed$ grep 0x134485fd7bcb26f 
 *226.*cons* | wc -l
 0
 nnarkhed-ld:zk-cons-wchp-2012013000 nnarkhed$ grep 0x134485fd7bcb26f 
 *223.*cons* | wc -l
 0
 So, what seems to have happened is that the client was able to re-register 
 the watches on the new server (226), after it got disconnected from 223, 
 inspite of having an expired session id. 
 In NIOServerCnxn, I saw that after suspecting that a session is expired, a 
 server removes the cnxn and its 

[jira] [Updated] (ZOOKEEPER-1382) Zookeeper server holds onto dead/expired session ids in the watch data structures

2013-03-03 Thread Michael Morello (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Morello updated ZOOKEEPER-1382:
---

Attachment: (was: ZOOKEEPER-1382.patch)

 Zookeeper server holds onto dead/expired session ids in the watch data 
 structures
 -

 Key: ZOOKEEPER-1382
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1382
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.4.5
Reporter: Neha Narkhede
Assignee: Neha Narkhede
Priority: Critical
 Fix For: 3.4.6

 Attachments: ZOOKEEPER-1382_3.3.4.patch


 I've observed that zookeeper server holds onto expired session ids in the 
 watcher data structures. The result is the wchp command reports session ids 
 that cannot be found through cons/dump and those expired session ids sit 
 there maybe until the server is restarted. Here are snippets from the client 
 and the server logs that lead to this state, for one particular session id 
 0x134485fd7bcb26f -
 There are 4 servers in the zookeeper cluster - 223, 224, 225 (leader), 226 
 and I'm using ZkClient to connect to the cluster
 From the application log -
 application.log.2012-01-26-325.gz:2012/01/26 04:56:36.177 INFO [ClientCnxn] 
 [main-SendThread(223.prod:12913)] [application Session establishment complete 
 on server 223.prod/172.17.135.38:12913, sessionid = 0x134485fd7bcb26f, 
 negotiated timeout = 6000
 application.log.2012-01-27.gz:2012/01/27 09:52:37.714 INFO [ClientCnxn] 
 [main-SendThread(223.prod:12913)] [application] Client session timed out, 
 have not heard from server in 9827ms for sessionid 0x134485fd7bcb26f, closing 
 socket connection and attempting reconnect
 application.log.2012-01-27.gz:2012/01/27 09:52:38.191 INFO [ClientCnxn] 
 [main-SendThread(226.prod:12913)] [application] Unable to reconnect to 
 ZooKeeper service, session 0x134485fd7bcb26f has expired, closing socket 
 connection
 On the leader zk, 225 -
 zookeeper.log.2012-01-27-leader-225.gz:2012-01-27 09:52:34,010 - INFO  
 [SessionTracker:ZooKeeperServer@314] - Expiring session 0x134485fd7bcb26f, 
 timeout of 6000ms exceeded
 zookeeper.log.2012-01-27-leader-225.gz:2012-01-27 09:52:34,010 - INFO  
 [ProcessThread:-1:PrepRequestProcessor@391] - Processed session termination 
 for sessionid: 0x134485fd7bcb26f
 On the server, the client was initially connected to, 223 -
 zookeeper.log.2012-01-26-223.gz:2012-01-26 04:56:36,173 - INFO  
 [CommitProcessor:1:NIOServerCnxn@1580] - Established session 
 0x134485fd7bcb26f with negotiated timeout 6000 for client /172.17.136.82:45020
 zookeeper.log.2012-01-27-223.gz:2012-01-27 09:52:34,018 - INFO  
 [CommitProcessor:1:NIOServerCnxn@1435] - Closed socket connection for client 
 /172.17.136.82:45020 which had sessionid 0x134485fd7bcb26f
 Here are the log snippets from 226, which is the server, the client 
 reconnected to, before getting session expired event -
 2012-01-27 09:52:38,190 - INFO  
 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:12913:NIOServerCnxn@770] - Client 
 attempting to renew session 0x134485fd7bcb26f at /172.17.136.82:49367
 2012-01-27 09:52:38,191 - INFO  
 [QuorumPeer:/0.0.0.0:12913:NIOServerCnxn@1573] - Invalid session 
 0x134485fd7bcb26f for client /172.17.136.82:49367, probably expired
 2012-01-27 09:52:38,191 - INFO  
 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:12913:NIOServerCnxn@1435] - Closed 
 socket connection for client /172.17.136.82:49367 which had sessionid 
 0x134485fd7bcb26f
 wchp output from 226, taken on 01/30 -
 nnarkhed-ld:zk-cons-wchp-2012013000 nnarkhed$ grep 0x134485fd7bcb26f 
 *226.*wchp* | wc -l
 3
 wchp output from 223, taken on 01/30 -
 nnarkhed-ld:zk-cons-wchp-2012013000 nnarkhed$ grep 0x134485fd7bcb26f 
 *223.*wchp* | wc -l
 0
 cons output from 223 and 226, taken on 01/30 -
 nnarkhed-ld:zk-cons-wchp-2012013000 nnarkhed$ grep 0x134485fd7bcb26f 
 *226.*cons* | wc -l
 0
 nnarkhed-ld:zk-cons-wchp-2012013000 nnarkhed$ grep 0x134485fd7bcb26f 
 *223.*cons* | wc -l
 0
 So, what seems to have happened is that the client was able to re-register 
 the watches on the new server (226), after it got disconnected from 223, 
 inspite of having an expired session id. 
 In NIOServerCnxn, I saw that after suspecting that a session is expired, a 
 server removes the cnxn and its watches from its internal data structures. 
 But before that it allows more requests to be processed even if the session 
 is expired -
 // Now that the session is ready we can start receiving packets
 synchronized (this.factory) {
 sk.selector().wakeup();
 enableRecv();
 }
 } catch (Exception e) {
 LOG.warn(Exception while establishing session, closing, e);
 close();
 

[jira] [Updated] (ZOOKEEPER-1382) Zookeeper server holds onto dead/expired session ids in the watch data structures

2013-03-03 Thread Michael Morello (JIRA)

 [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Morello updated ZOOKEEPER-1382:
---

Attachment: ZOOKEEPER-1382-branch-3.4.patch

Current patch is for branch 3.4
Sorry

 Zookeeper server holds onto dead/expired session ids in the watch data 
 structures
 -

 Key: ZOOKEEPER-1382
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1382
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.4.5
Reporter: Neha Narkhede
Assignee: Neha Narkhede
Priority: Critical
 Fix For: 3.4.6

 Attachments: ZOOKEEPER-1382_3.3.4.patch, 
 ZOOKEEPER-1382-branch-3.4.patch


 I've observed that zookeeper server holds onto expired session ids in the 
 watcher data structures. The result is the wchp command reports session ids 
 that cannot be found through cons/dump and those expired session ids sit 
 there maybe until the server is restarted. Here are snippets from the client 
 and the server logs that lead to this state, for one particular session id 
 0x134485fd7bcb26f -
 There are 4 servers in the zookeeper cluster - 223, 224, 225 (leader), 226 
 and I'm using ZkClient to connect to the cluster
 From the application log -
 application.log.2012-01-26-325.gz:2012/01/26 04:56:36.177 INFO [ClientCnxn] 
 [main-SendThread(223.prod:12913)] [application Session establishment complete 
 on server 223.prod/172.17.135.38:12913, sessionid = 0x134485fd7bcb26f, 
 negotiated timeout = 6000
 application.log.2012-01-27.gz:2012/01/27 09:52:37.714 INFO [ClientCnxn] 
 [main-SendThread(223.prod:12913)] [application] Client session timed out, 
 have not heard from server in 9827ms for sessionid 0x134485fd7bcb26f, closing 
 socket connection and attempting reconnect
 application.log.2012-01-27.gz:2012/01/27 09:52:38.191 INFO [ClientCnxn] 
 [main-SendThread(226.prod:12913)] [application] Unable to reconnect to 
 ZooKeeper service, session 0x134485fd7bcb26f has expired, closing socket 
 connection
 On the leader zk, 225 -
 zookeeper.log.2012-01-27-leader-225.gz:2012-01-27 09:52:34,010 - INFO  
 [SessionTracker:ZooKeeperServer@314] - Expiring session 0x134485fd7bcb26f, 
 timeout of 6000ms exceeded
 zookeeper.log.2012-01-27-leader-225.gz:2012-01-27 09:52:34,010 - INFO  
 [ProcessThread:-1:PrepRequestProcessor@391] - Processed session termination 
 for sessionid: 0x134485fd7bcb26f
 On the server, the client was initially connected to, 223 -
 zookeeper.log.2012-01-26-223.gz:2012-01-26 04:56:36,173 - INFO  
 [CommitProcessor:1:NIOServerCnxn@1580] - Established session 
 0x134485fd7bcb26f with negotiated timeout 6000 for client /172.17.136.82:45020
 zookeeper.log.2012-01-27-223.gz:2012-01-27 09:52:34,018 - INFO  
 [CommitProcessor:1:NIOServerCnxn@1435] - Closed socket connection for client 
 /172.17.136.82:45020 which had sessionid 0x134485fd7bcb26f
 Here are the log snippets from 226, which is the server, the client 
 reconnected to, before getting session expired event -
 2012-01-27 09:52:38,190 - INFO  
 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:12913:NIOServerCnxn@770] - Client 
 attempting to renew session 0x134485fd7bcb26f at /172.17.136.82:49367
 2012-01-27 09:52:38,191 - INFO  
 [QuorumPeer:/0.0.0.0:12913:NIOServerCnxn@1573] - Invalid session 
 0x134485fd7bcb26f for client /172.17.136.82:49367, probably expired
 2012-01-27 09:52:38,191 - INFO  
 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:12913:NIOServerCnxn@1435] - Closed 
 socket connection for client /172.17.136.82:49367 which had sessionid 
 0x134485fd7bcb26f
 wchp output from 226, taken on 01/30 -
 nnarkhed-ld:zk-cons-wchp-2012013000 nnarkhed$ grep 0x134485fd7bcb26f 
 *226.*wchp* | wc -l
 3
 wchp output from 223, taken on 01/30 -
 nnarkhed-ld:zk-cons-wchp-2012013000 nnarkhed$ grep 0x134485fd7bcb26f 
 *223.*wchp* | wc -l
 0
 cons output from 223 and 226, taken on 01/30 -
 nnarkhed-ld:zk-cons-wchp-2012013000 nnarkhed$ grep 0x134485fd7bcb26f 
 *226.*cons* | wc -l
 0
 nnarkhed-ld:zk-cons-wchp-2012013000 nnarkhed$ grep 0x134485fd7bcb26f 
 *223.*cons* | wc -l
 0
 So, what seems to have happened is that the client was able to re-register 
 the watches on the new server (226), after it got disconnected from 223, 
 inspite of having an expired session id. 
 In NIOServerCnxn, I saw that after suspecting that a session is expired, a 
 server removes the cnxn and its watches from its internal data structures. 
 But before that it allows more requests to be processed even if the session 
 is expired -
 // Now that the session is ready we can start receiving packets
 synchronized (this.factory) {
 sk.selector().wakeup();
 enableRecv();
 }
 } catch (Exception e) {
 

[jira] [Commented] (ZOOKEEPER-1382) Zookeeper server holds onto dead/expired session ids in the watch data structures

2013-03-04 Thread Michael Morello (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13592254#comment-13592254
 ] 

Michael Morello commented on ZOOKEEPER-1382:


It seems that the behavior of disableRecv() / enableRecv() has changed a bit in 
3.5 : there is a new thread (SelectorThread) in charge of interest op updates 
and updates are processed in a non blocking fashion (updates are queued to be 
handled later by the SelectorThread, if i understand correctly it is more a 
throttle mechanism than something else). The unit test is no more relevant.

If we want to fix the two branches (3.4 and 3.5) it is imho no more a good idea 
to rely on enableRecv().
May be it would be cleaner to introduce a new function 
ServerCnxn.setValidSession(boolean valid) that informs the implementation 
(NIOServerCnxn or NettyServerCnxn) that the session has been (re)validated and 
then allow subsequent messages to be processed.

If you don't mind i will try to provide an other patch and a unit test for both 
branches.




 Zookeeper server holds onto dead/expired session ids in the watch data 
 structures
 -

 Key: ZOOKEEPER-1382
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1382
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.4.5
Reporter: Neha Narkhede
Assignee: Neha Narkhede
Priority: Critical
 Fix For: 3.4.6

 Attachments: ZOOKEEPER-1382_3.3.4.patch, 
 ZOOKEEPER-1382-branch-3.4.patch


 I've observed that zookeeper server holds onto expired session ids in the 
 watcher data structures. The result is the wchp command reports session ids 
 that cannot be found through cons/dump and those expired session ids sit 
 there maybe until the server is restarted. Here are snippets from the client 
 and the server logs that lead to this state, for one particular session id 
 0x134485fd7bcb26f -
 There are 4 servers in the zookeeper cluster - 223, 224, 225 (leader), 226 
 and I'm using ZkClient to connect to the cluster
 From the application log -
 application.log.2012-01-26-325.gz:2012/01/26 04:56:36.177 INFO [ClientCnxn] 
 [main-SendThread(223.prod:12913)] [application Session establishment complete 
 on server 223.prod/172.17.135.38:12913, sessionid = 0x134485fd7bcb26f, 
 negotiated timeout = 6000
 application.log.2012-01-27.gz:2012/01/27 09:52:37.714 INFO [ClientCnxn] 
 [main-SendThread(223.prod:12913)] [application] Client session timed out, 
 have not heard from server in 9827ms for sessionid 0x134485fd7bcb26f, closing 
 socket connection and attempting reconnect
 application.log.2012-01-27.gz:2012/01/27 09:52:38.191 INFO [ClientCnxn] 
 [main-SendThread(226.prod:12913)] [application] Unable to reconnect to 
 ZooKeeper service, session 0x134485fd7bcb26f has expired, closing socket 
 connection
 On the leader zk, 225 -
 zookeeper.log.2012-01-27-leader-225.gz:2012-01-27 09:52:34,010 - INFO  
 [SessionTracker:ZooKeeperServer@314] - Expiring session 0x134485fd7bcb26f, 
 timeout of 6000ms exceeded
 zookeeper.log.2012-01-27-leader-225.gz:2012-01-27 09:52:34,010 - INFO  
 [ProcessThread:-1:PrepRequestProcessor@391] - Processed session termination 
 for sessionid: 0x134485fd7bcb26f
 On the server, the client was initially connected to, 223 -
 zookeeper.log.2012-01-26-223.gz:2012-01-26 04:56:36,173 - INFO  
 [CommitProcessor:1:NIOServerCnxn@1580] - Established session 
 0x134485fd7bcb26f with negotiated timeout 6000 for client /172.17.136.82:45020
 zookeeper.log.2012-01-27-223.gz:2012-01-27 09:52:34,018 - INFO  
 [CommitProcessor:1:NIOServerCnxn@1435] - Closed socket connection for client 
 /172.17.136.82:45020 which had sessionid 0x134485fd7bcb26f
 Here are the log snippets from 226, which is the server, the client 
 reconnected to, before getting session expired event -
 2012-01-27 09:52:38,190 - INFO  
 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:12913:NIOServerCnxn@770] - Client 
 attempting to renew session 0x134485fd7bcb26f at /172.17.136.82:49367
 2012-01-27 09:52:38,191 - INFO  
 [QuorumPeer:/0.0.0.0:12913:NIOServerCnxn@1573] - Invalid session 
 0x134485fd7bcb26f for client /172.17.136.82:49367, probably expired
 2012-01-27 09:52:38,191 - INFO  
 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:12913:NIOServerCnxn@1435] - Closed 
 socket connection for client /172.17.136.82:49367 which had sessionid 
 0x134485fd7bcb26f
 wchp output from 226, taken on 01/30 -
 nnarkhed-ld:zk-cons-wchp-2012013000 nnarkhed$ grep 0x134485fd7bcb26f 
 *226.*wchp* | wc -l
 3
 wchp output from 223, taken on 01/30 -
 nnarkhed-ld:zk-cons-wchp-2012013000 nnarkhed$ grep 0x134485fd7bcb26f 
 *223.*wchp* | wc -l
 0
 cons output from 223 and 226, taken on 01/30 -
 nnarkhed-ld:zk-cons-wchp-2012013000 nnarkhed$ grep 

[jira] [Commented] (ZOOKEEPER-1382) Zookeeper server holds onto dead/expired session ids in the watch data structures

2013-03-08 Thread Michael Morello (JIRA)

[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13596979#comment-13596979
 ] 

Michael Morello commented on ZOOKEEPER-1382:


My bad, I have not spend enough time on the new branch, you are right. The same 
fix could be applied.
Nevertheless the unit test is no more relevant.
I'll try to work on that.

 Zookeeper server holds onto dead/expired session ids in the watch data 
 structures
 -

 Key: ZOOKEEPER-1382
 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1382
 Project: ZooKeeper
  Issue Type: Bug
  Components: server
Affects Versions: 3.4.5
Reporter: Neha Narkhede
Assignee: Neha Narkhede
Priority: Critical
 Fix For: 3.4.6

 Attachments: ZOOKEEPER-1382_3.3.4.patch, 
 ZOOKEEPER-1382-branch-3.4.patch


 I've observed that zookeeper server holds onto expired session ids in the 
 watcher data structures. The result is the wchp command reports session ids 
 that cannot be found through cons/dump and those expired session ids sit 
 there maybe until the server is restarted. Here are snippets from the client 
 and the server logs that lead to this state, for one particular session id 
 0x134485fd7bcb26f -
 There are 4 servers in the zookeeper cluster - 223, 224, 225 (leader), 226 
 and I'm using ZkClient to connect to the cluster
 From the application log -
 application.log.2012-01-26-325.gz:2012/01/26 04:56:36.177 INFO [ClientCnxn] 
 [main-SendThread(223.prod:12913)] [application Session establishment complete 
 on server 223.prod/172.17.135.38:12913, sessionid = 0x134485fd7bcb26f, 
 negotiated timeout = 6000
 application.log.2012-01-27.gz:2012/01/27 09:52:37.714 INFO [ClientCnxn] 
 [main-SendThread(223.prod:12913)] [application] Client session timed out, 
 have not heard from server in 9827ms for sessionid 0x134485fd7bcb26f, closing 
 socket connection and attempting reconnect
 application.log.2012-01-27.gz:2012/01/27 09:52:38.191 INFO [ClientCnxn] 
 [main-SendThread(226.prod:12913)] [application] Unable to reconnect to 
 ZooKeeper service, session 0x134485fd7bcb26f has expired, closing socket 
 connection
 On the leader zk, 225 -
 zookeeper.log.2012-01-27-leader-225.gz:2012-01-27 09:52:34,010 - INFO  
 [SessionTracker:ZooKeeperServer@314] - Expiring session 0x134485fd7bcb26f, 
 timeout of 6000ms exceeded
 zookeeper.log.2012-01-27-leader-225.gz:2012-01-27 09:52:34,010 - INFO  
 [ProcessThread:-1:PrepRequestProcessor@391] - Processed session termination 
 for sessionid: 0x134485fd7bcb26f
 On the server, the client was initially connected to, 223 -
 zookeeper.log.2012-01-26-223.gz:2012-01-26 04:56:36,173 - INFO  
 [CommitProcessor:1:NIOServerCnxn@1580] - Established session 
 0x134485fd7bcb26f with negotiated timeout 6000 for client /172.17.136.82:45020
 zookeeper.log.2012-01-27-223.gz:2012-01-27 09:52:34,018 - INFO  
 [CommitProcessor:1:NIOServerCnxn@1435] - Closed socket connection for client 
 /172.17.136.82:45020 which had sessionid 0x134485fd7bcb26f
 Here are the log snippets from 226, which is the server, the client 
 reconnected to, before getting session expired event -
 2012-01-27 09:52:38,190 - INFO  
 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:12913:NIOServerCnxn@770] - Client 
 attempting to renew session 0x134485fd7bcb26f at /172.17.136.82:49367
 2012-01-27 09:52:38,191 - INFO  
 [QuorumPeer:/0.0.0.0:12913:NIOServerCnxn@1573] - Invalid session 
 0x134485fd7bcb26f for client /172.17.136.82:49367, probably expired
 2012-01-27 09:52:38,191 - INFO  
 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:12913:NIOServerCnxn@1435] - Closed 
 socket connection for client /172.17.136.82:49367 which had sessionid 
 0x134485fd7bcb26f
 wchp output from 226, taken on 01/30 -
 nnarkhed-ld:zk-cons-wchp-2012013000 nnarkhed$ grep 0x134485fd7bcb26f 
 *226.*wchp* | wc -l
 3
 wchp output from 223, taken on 01/30 -
 nnarkhed-ld:zk-cons-wchp-2012013000 nnarkhed$ grep 0x134485fd7bcb26f 
 *223.*wchp* | wc -l
 0
 cons output from 223 and 226, taken on 01/30 -
 nnarkhed-ld:zk-cons-wchp-2012013000 nnarkhed$ grep 0x134485fd7bcb26f 
 *226.*cons* | wc -l
 0
 nnarkhed-ld:zk-cons-wchp-2012013000 nnarkhed$ grep 0x134485fd7bcb26f 
 *223.*cons* | wc -l
 0
 So, what seems to have happened is that the client was able to re-register 
 the watches on the new server (226), after it got disconnected from 223, 
 inspite of having an expired session id. 
 In NIOServerCnxn, I saw that after suspecting that a session is expired, a 
 server removes the cnxn and its watches from its internal data structures. 
 But before that it allows more requests to be processed even if the session 
 is expired -
 // Now that the session is ready we can start receiving packets
 synchronized