[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13838906#comment-13838906
 ] 

Rakesh R commented on ZOOKEEPER-1382:
-------------------------------------

Hi German, 
Patch looks very nice, especially test cases are really good:)

Add few suggestions:
- Please add message while asserting, this will give more clarity, like
{code}
assertEquals("There must be NO watches!", 0, watchCount);
{code}
- Formatting is not proper in MockSelectorThread.java. Normally, between 
methods there will be a newline.
- This is a suggestion and FYI. I would like to use 
@RunWith(Parameterized.class), when there are situations to run the same test 
case with different input values. In the patch, we have testcases with 
sessiontimeout and without sessiontimeout flags. Please refer 
MultiTransactionTest to understand more on this pattern. This way it will again 
make the test cases very simple.


> Zookeeper server holds onto dead/expired session ids in the watch data 
> structures
> ---------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1382
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1382
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.4.5
>            Reporter: Neha Narkhede
>            Assignee: Germán Blanco
>            Priority: Critical
>             Fix For: 3.4.6, 3.5.0
>
>         Attachments: ZOOKEEPER-1382-branch-3.4.patch, 
> ZOOKEEPER-1382-branch-3.4.patch, ZOOKEEPER-1382-branch-3.4.patch, 
> ZOOKEEPER-1382.patch, ZOOKEEPER-1382.patch, ZOOKEEPER-1382.patch, 
> ZOOKEEPER-1382.patch, ZOOKEEPER-1382_3.3.4.patch
>
>
> I've observed that zookeeper server holds onto expired session ids in the 
> watcher data structures. The result is the wchp command reports session ids 
> that cannot be found through cons/dump and those expired session ids sit 
> there maybe until the server is restarted. Here are snippets from the client 
> and the server logs that lead to this state, for one particular session id 
> 0x134485fd7bcb26f -
> There are 4 servers in the zookeeper cluster - 223, 224, 225 (leader), 226 
> and I'm using ZkClient to connect to the cluster
> From the application log -
> application.log.2012-01-26-325.gz:2012/01/26 04:56:36.177 INFO [ClientCnxn] 
> [main-SendThread(223.prod:12913)] [application Session establishment complete 
> on server 223.prod/172.17.135.38:12913, sessionid = 0x134485fd7bcb26f, 
> negotiated timeout = 6000
> application.log.2012-01-27.gz:2012/01/27 09:52:37.714 INFO [ClientCnxn] 
> [main-SendThread(223.prod:12913)] [application] Client session timed out, 
> have not heard from server in 9827ms for sessionid 0x134485fd7bcb26f, closing 
> socket connection and attempting reconnect
> application.log.2012-01-27.gz:2012/01/27 09:52:38.191 INFO [ClientCnxn] 
> [main-SendThread(226.prod:12913)] [application] Unable to reconnect to 
> ZooKeeper service, session 0x134485fd7bcb26f has expired, closing socket 
> connection
> On the leader zk, 225 -
> zookeeper.log.2012-01-27-leader-225.gz:2012-01-27 09:52:34,010 - INFO  
> [SessionTracker:ZooKeeperServer@314] - Expiring session 0x134485fd7bcb26f, 
> timeout of 6000ms exceeded
> zookeeper.log.2012-01-27-leader-225.gz:2012-01-27 09:52:34,010 - INFO  
> [ProcessThread:-1:PrepRequestProcessor@391] - Processed session termination 
> for sessionid: 0x134485fd7bcb26f
> On the server, the client was initially connected to, 223 -
> zookeeper.log.2012-01-26-223.gz:2012-01-26 04:56:36,173 - INFO  
> [CommitProcessor:1:NIOServerCnxn@1580] - Established session 
> 0x134485fd7bcb26f with negotiated timeout 6000 for client /172.17.136.82:45020
> zookeeper.log.2012-01-27-223.gz:2012-01-27 09:52:34,018 - INFO  
> [CommitProcessor:1:NIOServerCnxn@1435] - Closed socket connection for client 
> /172.17.136.82:45020 which had sessionid 0x134485fd7bcb26f
> Here are the log snippets from 226, which is the server, the client 
> reconnected to, before getting session expired event -
> 2012-01-27 09:52:38,190 - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:12913:NIOServerCnxn@770] - Client 
> attempting to renew session 0x134485fd7bcb26f at /172.17.136.82:49367
> 2012-01-27 09:52:38,191 - INFO  
> [QuorumPeer:/0.0.0.0:12913:NIOServerCnxn@1573] - Invalid session 
> 0x134485fd7bcb26f for client /172.17.136.82:49367, probably expired
> 2012-01-27 09:52:38,191 - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:12913:NIOServerCnxn@1435] - Closed 
> socket connection for client /172.17.136.82:49367 which had sessionid 
> 0x134485fd7bcb26f
> wchp output from 226, taken on 01/30 -
> nnarkhed-ld:zk-cons-wchp-2012013000 nnarkhed$ grep 0x134485fd7bcb26f 
> *226.*wchp* | wc -l
> 3
> wchp output from 223, taken on 01/30 -
> nnarkhed-ld:zk-cons-wchp-2012013000 nnarkhed$ grep 0x134485fd7bcb26f 
> *223.*wchp* | wc -l
> 0
> cons output from 223 and 226, taken on 01/30 -
> nnarkhed-ld:zk-cons-wchp-2012013000 nnarkhed$ grep 0x134485fd7bcb26f 
> *226.*cons* | wc -l
> 0
> nnarkhed-ld:zk-cons-wchp-2012013000 nnarkhed$ grep 0x134485fd7bcb26f 
> *223.*cons* | wc -l
> 0
> So, what seems to have happened is that the client was able to re-register 
> the watches on the new server (226), after it got disconnected from 223, 
> inspite of having an expired session id. 
> In NIOServerCnxn, I saw that after suspecting that a session is expired, a 
> server removes the cnxn and its watches from its internal data structures. 
> But before that it allows more requests to be processed even if the session 
> is expired -
>             // Now that the session is ready we can start receiving packets
>             synchronized (this.factory) {
>                 sk.selector().wakeup();
>                 enableRecv();
>             }
>         } catch (Exception e) {
>             LOG.warn("Exception while establishing session, closing", e);
>             close();
>         }
> I wonder if the client somehow sneaked in the set watches, right after the 
> server removed the connection through removeCnxn() API ?



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to