[jira] [Commented] (ZOOKEEPER-1449) Ephemeral znode not deleted after session has expired on one follower (quorum is in an inconsistent state)

Camille Fournier (Commented) (JIRA) Tue, 17 Apr 2012 12:03:44 -0700

    [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13255836#comment-13255836
 ]


Camille Fournier commented on ZOOKEEPER-1449:
---------------------------------------------

Can you reproduce it with a more recent release? 3.3.3 is a bit old at this 
point and we've fixed a few things between that and 3.3.5.
                
> Ephemeral znode not deleted after session has expired on one follower (quorum 
> is in an inconsistent state) 
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1449
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1449
>             Project: ZooKeeper
>          Issue Type: Bug
>            Reporter: Daniel Lord
>         Attachments: zk.zip
>
>
> I've been running in to this situation in our labs fairly regularly where 
> we'll get a single follower that will be in an inconsistent state with 
> dangling ephemeral znodes.  Here is all of the information that I have right 
> now.  Please ask if there is anything else that is useful.
> Here is a quick snapshot of the state of the ensemble where you can see it is 
> out of sync across several znodes: 
> -bash-3.2$ echo srvr | nc il23n04sa-zk001 2181
> Zookeeper version: 3.3.3-cdh3u2--1, built on 10/14/2011 05:17 GMT
> Latency min/avg/max: 0/7/25802
> Received: 64002
> Sent: 63985
> Outstanding: 0
> Zxid: 0x500000a41
> Mode: follower
> Node count: 497
> -bash-3.2$ echo srvr | nc il23n04sa-zk002 2181
> Zookeeper version: 3.3.3-cdh3u2--1, built on 10/14/2011 05:17 GMT
> Latency min/avg/max: 0/13/79032
> Received: 74320
> Sent: 74276
> Outstanding: 0
> Zxid: 0x500000a41
> Mode: leader
> Node count: 493
> -bash-3.2$ echo srvr | nc il23n04sa-zk003 2181
> Zookeeper version: 3.3.3-cdh3u2--1, built on 10/14/2011 05:17 GMT
> Latency min/avg/max: 0/2/25234
> Received: 187310
> Sent: 187320
> Outstanding: 0
> Zxid: 0x500000a41
> Mode: follower
> Node count: 493
> All of the zxids match up just fine but zk001 has 4 more nodes in its node 
> count than the other two (including the leader..).  When I use a zookeeper 
> client connect to connect directly to zk001 I can see the following znode 
> that should no longer exist: 
> [zk: localhost:2181(CONNECTED) 0] stat 
> /siri/Douroucouli/clients/il23n04sa-app004.siri.apple.com:38096
> cZxid = 0x40000001a
> ctime = Mon Apr 16 11:00:47 PDT 2012
> mZxid = 0x40000001a
> mtime = Mon Apr 16 11:00:47 PDT 2012
> pZxid = 0x40000001a
> cversion = 0
> dataVersion = 0
> aclVersion = 0
> ephemeralOwner = 0x236bc504cb50002
> dataLength = 0
> numChildren = 0
> This node does not exist using the client to connect to either of the other 
> two members of the ensemble.
> I searched through the logs for that session id and it looks like it was 
> established and closed cleanly.  There were several leadership/quorum 
> problems during the course of the session but it looks like it should have 
> been shut down properly.  Neither the session nor the znode show up in the 
> "dump" on the leader but the problem znode does show up in the "dump" on 
> zk001.
> 2012-04-16 11:00:47,637 - INFO  [CommitProcessor:2:NIOServerCnxn@1580] - 
> Established session 0x236bc504cb50002 with negotiated timeout 15000 for 
> client /17.202.71.201:38971
> 2012-04-16 11:20:59,341 - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@770] - Client 
> attempting to renew session 0x236bc504cb50002 at /17.202.71.201:50841
> 2012-04-16 11:20:59,342 - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1580] - Established 
> session 0x236bc504cb50002 with negotiated timeout 15000 for client 
> /17.202.71.201:50841
> 2012-04-16 11:21:09,343 - WARN  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@634] - 
> EndOfStreamException: Unable to read additional data from client sessionid 
> 0x236bc504cb50002, likely client has closed socket
> 2012-04-16 11:21:09,343 - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed 
> socket connection for client /17.202.71.201:50841 which had sessionid 
> 0x236bc504cb50002
> 2012-04-16 11:21:20,352 - INFO  
> [QuorumPeer:/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@1435] - Closed socket 
> connection for client /17.202.71.201:38971 which had sessionid 
> 0x236bc504cb50002
> 2012-04-16 11:21:22,151 - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@770] - Client 
> attempting to renew session 0x236bc504cb50002 at /17.202.71.201:38166
> 2012-04-16 11:21:22,152 - INFO  
> [QuorumPeer:/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@1580] - Established session 
> 0x236bc504cb50002 with negotiated timeout 15000 for client 
> /17.202.71.201:38166
> 2012-04-16 11:27:17,902 - INFO  [ProcessThread:-1:PrepRequestProcessor@387] - 
> Processed session termination for sessionid: 0x236bc504cb50002
> 2012-04-16 11:27:17,904 - INFO  
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed 
> socket connection for client /17.202.71.201:38166 which had sessionid 
> 0x236bc504cb50002
> The only way I've been able to recover from this situation is to shut down 
> the problem follower, delete its snapshots and let it resync with the leader.
> I'll attach the full log4j logs, the txn logs, the snapshots and the conf 
> files for each member of the ensemble.  Please let me know what other 
> information is useful.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-1449) Ephemeral znode not deleted after session has expired on one follower (quorum is in an inconsistent state)

Reply via email to