[jira] [Commented] (ZOOKEEPER-1319) Missing data after restarting+expanding a cluster

Mahadev konar (Commented) (JIRA) Fri, 09 Dec 2011 10:41:02 -0800

    [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13166420#comment-13166420
 ]


Mahadev konar commented on ZOOKEEPER-1319:
------------------------------------------

I am going ahead and checking in Pat's patch. I have opened ZOOKEEPER-1324 to 
track the duplicate NEWLEADER packets. Just being paranoid here and making 
minimal changes for the RC.
                
> Missing data after restarting+expanding a cluster
> -------------------------------------------------
>
>                 Key: ZOOKEEPER-1319
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1319
>             Project: ZooKeeper
>          Issue Type: Bug
>    Affects Versions: 3.4.0
>         Environment: Linux (Debian Squeeze)
>            Reporter: Jeremy Stribling
>            Assignee: Patrick Hunt
>            Priority: Blocker
>              Labels: cluster, data
>             Fix For: 3.5.0, 3.4.1
>
>         Attachments: ZOOKEEPER-1319.patch, ZOOKEEPER-1319.patch, 
> ZOOKEEPER-1319_trunk.patch, ZOOKEEPER-1319_trunk2.patch, logs.tgz
>
>
> I've been trying to update to ZK 3.4.0 and have had some issues where some 
> data become inaccessible after adding a node to a cluster.  My use case is a 
> bit strange (as explained before on this list) in that I try to grow the 
> cluster dynamically by having an external program automatically restart 
> Zookeeper servers in a controlled way whenever the list of participating ZK 
> servers needs to change.  This used to work just fine in 3.3.3 (and before), 
> so this represents a regression.
> The scenario I see is this:
> 1) Start up a 1-server ZK cluster (the server has ZK ID 0).
> 2) A client connects to the server, and makes a bunch of znodes, in 
> particular a znode called "/membership".
> 3) Shut down the cluster.
> 4) Bring up a 2-server ZK cluster, including the original server 0 with its 
> existing data, and a new server with ZK ID 1.
> 5) Node 0 has the highest zxid and is elected leader.
> 6) A client connecting to server 1 tries to "get /membership" and gets back a 
> -101 error code (no such znode).
> 7) The same client then tries to "create /membership" and gets back a -110 
> error code (znode already exists).
> 8) Clients connecting to server 0 can successfully "get /membership".
> I will attach a tarball with debug logs for both servers, annotating where 
> steps #1 and #4 happen.  You can see that the election involves a proposal 
> for zxid 110 from server 0, but immediately following the election server 1 
> has these lines:
> 2011-12-05 17:18:48,308 9299 [QuorumPeer[myid=1]/127.0.0.1:2901] WARN 
> org.apache.zookeeper.server.quorum.Learner  - Got zxid 0x100000001 expected 
> 0x1
> 2011-12-05 17:18:48,313 9304 [SyncThread:1] INFO 
> org.apache.zookeeper.server.persistence.FileTxnLog  - Creating new log file: 
> log.100000001
> Perhaps that's not relevant, but it struck me as odd.  At the end of server 
> 1's log you can see a repeated cycle of getData->create->getData as the 
> client tries to make sense of the inconsistent responses.
> The other piece of information is that if I try to use the on-disk 
> directories for either of the servers to start a new one-node ZK cluster, all 
> the data are accessible.
> I haven't tried writing a program outside of my application to reproduce 
> this, but I can do it very easily with some of my app's tests if anyone needs 
> more information.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (ZOOKEEPER-1319) Missing data after restarting+expanding a cluster

Reply via email to