[ https://issues.apache.org/jira/browse/ZOOKEEPER-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mate Szalay-Beko updated ZOOKEEPER-3642: ---------------------------------------- Fix Version/s: 3.5.10 > Data inconsistency when the leader crashes right after sending SNAP sync > ------------------------------------------------------------------------ > > Key: ZOOKEEPER-3642 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3642 > Project: ZooKeeper > Issue Type: Bug > Components: server > Affects Versions: 3.6.0, 3.7.0, 3.5.5, 3.5.6 > Environment: Linux 4.19.29 x86_64 > Reporter: Alex Mirgorodskiy > Assignee: Fangmin Lv > Priority: Major > Labels: pull-request-available > Fix For: 3.5.10, 3.7.0 > > Time Spent: 3h 10m > Remaining Estimate: 0h > > If the leader crashes after sending a SNAP sync to a learner, but before > sending the NEWLEADER message, the learner will not save the snapshot to > disk. But it will advance its lastProcessedZxid to that from the snapshot > (call it Zxid X) > A new leader will get elected, and it will resync our learner again > immediately. But this time, it will use the incremental DIFF method, starting > from Zxid X. A DIFF-based resync does not trigger snapshots, so the learner > is still holding the original snapshot purely in memory. If the learner > restarts after that, it will silently lose all the data up to Zxid X. > An easy way to reproduce is to insert System.exit into LearnerHandler.java > right before sending the NEWLEADER message (on the one instance that is > currently running the leader, but not the others): > {noformat} > LOG.debug("Sending NEWLEADER message to " + sid); > + if (leader.self.getId() == 1 && sid == 3) { > + LOG.debug("Bail when server.1 resyncs server.3"); > + System.exit(0); > + } > {noformat} > If I remember right, the repro steps are as follows. Run with that patch in a > 4-instance ensemble where server.3 is an Observer, the rest are voting > members, and server.1 is the current Leader. Start server.3 after the other > instances are up. It will get the initial snapshot from server.1 and server.1 > will stop immediately because of the patch. Say, server.2 takes over as the > new Leader. Server.3 will receive a Diff resync from server.2, but will skip > persisting the snapshot. A subsequent restart of server.3 will make that > instance come up with a blank data tree. > The above steps assumed that server.3 is an Observer, but it can presumably > happen for voting members too. Just need a 5-instance ensemble. > Our workaround is to take the snapshot unconditionally on receiving NEWLEADER: > {noformat} > - if (snapshotNeeded) { > + // Take the snapshot unconditionally. The first leader > may have crashed > + // after sending us a SNAP, but before sending NEWLEADER. > The second leader will > + // send us a DIFF, and we'd still like to take a > snapshot, even though > + // the upstream code used to skip it. > + if (true || snapshotNeeded) { > zk.takeSnapshot(); > } > {noformat} > This is what 3.4.x series used to do. But I assume it is not the ideal fix, > since it essentially disables the "snapshotNeeded" optimization. -- This message was sent by Atlassian Jira (v8.20.7#820007)