[ https://issues.apache.org/jira/browse/ZOOKEEPER-2325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15713016#comment-15713016 ]
ASF GitHub Bot commented on ZOOKEEPER-2325: ------------------------------------------- Github user hanm commented on a diff in the pull request: https://github.com/apache/zookeeper/pull/117#discussion_r90516950 --- Diff: src/java/main/org/apache/zookeeper/server/persistence/FileTxnSnapLog.java --- @@ -165,8 +165,22 @@ public File getSnapDir() { */ public long restore(DataTree dt, Map<Long, Integer> sessions, PlayBackListener listener) throws IOException { - snapLog.deserialize(dt, sessions); + long deserializeResult = snapLog.deserialize(dt, sessions); FileTxnLog txnLog = new FileTxnLog(dataDir); + if (-1L == deserializeResult) { + /* this means that we couldn't find any snapshot, so we need to + * initialize an empty database */ + if (txnLog.getLastLoggedZxid() != -1) { + throw new IOException( + "No snapshot found, but there are log entries. " + + "Something is broken!"); + } + /* TODO: (br33d) we should either put a ConcurrentHashMap on restore() + * or use Map on save() */ + save(dt, (ConcurrentHashMap<Long, Integer>)sessions); --- End diff -- I think we need it here because if we are getting here then the zxid of this server must be -1, so it would not win leader election if at least one other server is sane (with valid snapshot/txn log to recover.), so this server will become a follow and sync the (none empty) snapshot from the leader. If all servers have empty snapshots then this save is also required to bootstrap the recover process. > Data inconsistency if all snapshots empty or missing > ---------------------------------------------------- > > Key: ZOOKEEPER-2325 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2325 > Project: ZooKeeper > Issue Type: Bug > Components: server > Affects Versions: 3.4.6 > Reporter: Andrew Grasso > Assignee: Andrew Grasso > Priority: Critical > Attachments: ZOOKEEPER-2325-test.patch, ZOOKEEPER-2325.001.patch, > zk.patch > > Original Estimate: 1h > Remaining Estimate: 1h > > When loading state from snapshots on startup, FileTxnSnapLog.java ignores the > result of FileSnap.deserialize, which is -1L if no valid snapshots are found. > Recovery proceeds with dt.lastProcessed == 0, its initial value. > The result is that Zookeeper will process the transaction logs and then begin > serving requests with a different state than the rest of the ensemble. > To reproduce: > In a healthy zookeeper cluster of size >= 3, shut down one node. > Either delete all snapshots for this node or change all to be empty files. > Restart the node. > We believe this can happen organically if a node runs out of disk space. -- This message was sent by Atlassian JIRA (v6.3.4#6332)