[
https://issues.apache.org/jira/browse/ZOOKEEPER-1046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13049817#comment-13049817
]
Benjamin Reed commented on ZOOKEEPER-1046:
------------------------------------------
nice observation flavio! i haven't seen anyone using cversion outside of the
sequence number on sequence znodes.
> Creating a new sequential node results in a ZNODEEXISTS error
> -------------------------------------------------------------
>
> Key: ZOOKEEPER-1046
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1046
> Project: ZooKeeper
> Issue Type: Bug
> Components: server
> Affects Versions: 3.3.2, 3.3.3
> Environment: A 3 node-cluster running Debian squeeze.
> Reporter: Jeremy Stribling
> Assignee: Vishal K
> Priority: Blocker
> Labels: sequence
> Fix For: 3.4.0
>
> Attachments: ZOOKEEPER-1046-for333, ZOOKEEPER-1046.patch,
> ZOOKEEPER-1046.patch, ZOOKEEPER-1046.patch1, ZOOKEEPER-1046.tgz
>
>
> On several occasions, I've seen a create() with the sequential flag set fail
> with a ZNODEEXISTS error, and I don't think that should ever be possible. In
> past runs, I've been able to closely inspect the state of the system with the
> command line client, and saw that the parent znode's cversion is smaller than
> the sequential number of existing children znode under that parent. In one
> example:
> {noformat}
> [zk:<ip:port>(CONNECTED) 3] stat /zkrsm
> cZxid = 0x5
> ctime = Mon Jan 17 18:28:19 PST 2011
> mZxid = 0x5
> mtime = Mon Jan 17 18:28:19 PST 2011
> pZxid = 0x1d819
> cversion = 120710
> dataVersion = 0
> aclVersion = 0
> ephemeralOwner = 0x0
> dataLength = 0
> numChildren = 2955
> {noformat}
> However, the znode /zkrsm/000000000000002d_record0000120804 existed on disk.
> In a recent run, I was able to capture the Zookeeper logs, and I will attach
> them to this JIRA. The logs are named as nodeX.<zxid_prefixes>.log, and each
> new log represents an application process restart.
> Here's the scenario:
> # There's a cluster with nodes 1,2,3 using zxid 0x3.
> # All three nodes restart, forming a cluster of zxid 0x4.
> # Node 3 restarts, leading to a cluster of 0x5.
> At this point, it seems like node 1 is the leader of the 0x5 epoch. In its
> log (node1.0x4-0x5.log) you can see the first (of many) instances of the
> following message:
> {noformat}
> 2011-04-11 21:16:12,607 16649 [ProcessThread:-1] INFO
> org.apache.zookeeper.server.PrepRequestProcessor - Got user-level
> KeeperException when processing sessionid:0x512f466bd44e0002 type:create
> cxid:0x4da376ab zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a Error
> Path:/zkrsm/00000000000000b2_record0001761440 Error:KeeperErrorCode =
> NodeExists for /zkrsm/00000000000000b2_record0001761440
> {noformat}
> This then repeats forever as my application isn't expecting to ever get this
> error message on a sequential node create, and just continually retries. The
> message even transfers over to node3.0x5-0x6.log once the 0x6 epoch comes
> into play.
> I don't see anything terribly fishy in the transition between the epochs; the
> correct snapshots seem to be getting transferred, etc. Unfortunately I don't
> have a ZK snapshot/log that exhibits the problem when starting with a fresh
> system.
> Some oddities you might notice in these logs:
> * Between epochs 0x3 and 0x4, the zookeeper IDs of the nodes changed due to a
> bug in our application code. (They are assigned randomly, but are supposed
> to be consistent across restarts.)
> * We manage node membership dynamically, and our application restarts the
> ZooKeeperServer classes whenever a new node wants to join (without restarting
> the entire application process). This is why you'll see messages like the
> following in node1.0x4-0x5.log before a new election begins:
> {noformat}
> 2011-04-11 21:16:00,762 4804 [QuorumPeer:/0.0.0.0:2888] INFO
> org.apache.zookeeper.server.quorum.Learner - shutdown called
> {noformat}
> * There is in fact one of these dynamic membership changes in
> node1.0x4-0x5.log, just before the 0x4 epoch is formed. I'm not sure how
> this would be related though, as no transactions are done during this period.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira