[ https://issues.apache.org/jira/browse/ZOOKEEPER-1090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13082061#comment-13082061 ]
Hudson commented on ZOOKEEPER-1090: ----------------------------------- Integrated in ZooKeeper-trunk #1258 (See [https://builds.apache.org/job/ZooKeeper-trunk/1258/]) ZOOKEEPER-1090. Race condition while taking snapshot can lead to not restoring data tree correctly. breed : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1151738 Files : * /zookeeper/trunk/src/java/main/org/apache/zookeeper/server/DataTree.java * /zookeeper/trunk/src/java/test/org/apache/zookeeper/test/LoadFromLogTest.java * /zookeeper/trunk/CHANGES.txt > Race condition while taking snapshot can lead to not restoring data tree > correctly > ---------------------------------------------------------------------------------- > > Key: ZOOKEEPER-1090 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1090 > Project: ZooKeeper > Issue Type: Bug > Components: server > Affects Versions: 3.3.3 > Reporter: Vishal Kher > Assignee: Vishal Kher > Priority: Critical > Labels: persistence, server, snapshot > Fix For: 3.4.0 > > Attachments: ZOOKEEPER-1090 > > > I think I have found a bug in the snapshot mechanism. > The problem occurs because dt.lastProcessedZxid is not synchronized (or > rather set before the data tree is modified): > FileTxnSnapLog: > {code} > public void save(DataTree dataTree, > ConcurrentHashMap<Long, Integer> sessionsWithTimeouts) > throws IOException { > long lastZxid = dataTree.lastProcessedZxid; > LOG.info("Snapshotting: " + Long.toHexString(lastZxid)); > File snapshot=new File( > snapDir, Util.makeSnapshotName(lastZxid)); > snapLog.serialize(dataTree, sessionsWithTimeouts, snapshot); <=== > the Datatree may not have the modification for lastProcessedZxid > } > {code} > DataTree: > {code} > public ProcessTxnResult processTxn(TxnHeader header, Record txn) { > ProcessTxnResult rc = new ProcessTxnResult(); > String debug = ""; > try { > rc.clientId = header.getClientId(); > rc.cxid = header.getCxid(); > rc.zxid = header.getZxid(); > rc.type = header.getType(); > rc.err = 0; > if (rc.zxid > lastProcessedZxid) { > lastProcessedZxid = rc.zxid; > } > [...modify data tree...] > } > {code} > The lastProcessedZxid must be set after the modification is done. > As a result, if server crashes after taking the snapshot (and the snapshot > does not contain change corresponding to lastProcessedZxid) restore will not > restore the data tree correctly: > {code} > public long restore(DataTree dt, Map<Long, Integer> sessions, > PlayBackListener listener) throws IOException { > snapLog.deserialize(dt, sessions); > FileTxnLog txnLog = new FileTxnLog(dataDir); > TxnIterator itr = txnLog.read(dt.lastProcessedZxid+1); <=== Assumes > lastProcessedZxid is deserialized > } > {code} > I have had offline discussion with Ben and Camille on this. I will be posting > the discussion shortly. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira