[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mate Szalay-Beko updated ZOOKEEPER-3642:
----------------------------------------
    Fix Version/s: 3.5.10

> Data inconsistency when the leader crashes right after sending SNAP sync
> ------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-3642
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3642
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.6.0, 3.7.0, 3.5.5, 3.5.6
>         Environment: Linux 4.19.29 x86_64
>            Reporter: Alex Mirgorodskiy
>            Assignee: Fangmin Lv
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 3.5.10, 3.7.0
>
>          Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> If the leader crashes after sending a SNAP sync to a learner, but before 
> sending the NEWLEADER message, the learner will not save the snapshot to 
> disk. But it will advance its lastProcessedZxid to that from the snapshot 
> (call it Zxid X)
> A new leader will get elected, and it will resync our learner again 
> immediately. But this time, it will use the incremental DIFF method, starting 
> from Zxid X. A DIFF-based resync does not trigger snapshots, so the learner 
> is still holding the original snapshot purely in memory. If the learner 
> restarts after that, it will silently lose all the data up to Zxid X.
> An easy way to reproduce is to insert System.exit into LearnerHandler.java 
> right before sending the NEWLEADER message (on the one instance that is 
> currently running the leader, but not the others):
> {noformat}
>              LOG.debug("Sending NEWLEADER message to " + sid);
> +            if (leader.self.getId() == 1 && sid == 3) {
> +               LOG.debug("Bail when server.1 resyncs server.3");
> +               System.exit(0);
> +            }
> {noformat}
> If I remember right, the repro steps are as follows. Run with that patch in a 
> 4-instance ensemble where server.3 is an Observer, the rest are voting 
> members, and server.1 is the current Leader. Start server.3 after the other 
> instances are up. It will get the initial snapshot from server.1 and server.1 
> will stop immediately because of the patch. Say, server.2 takes over as the 
> new Leader. Server.3 will receive a Diff resync from server.2, but will skip 
> persisting the snapshot. A subsequent restart of server.3 will make that 
> instance come up with a blank data tree.
> The above steps assumed that server.3 is an Observer, but it can presumably 
> happen for voting members too. Just need a 5-instance ensemble.
> Our workaround is to take the snapshot unconditionally on receiving NEWLEADER:
> {noformat}
> -                   if (snapshotNeeded) {
> +                   // Take the snapshot unconditionally. The first leader 
> may have crashed
> +                   // after sending us a SNAP, but before sending NEWLEADER. 
> The second leader will
> +                   // send us a DIFF, and we'd still like to take a 
> snapshot, even though
> +                   // the upstream code used to skip it.
> +                   if (true || snapshotNeeded) {
>                         zk.takeSnapshot();
>                     }
> {noformat}
> This is what 3.4.x series used to do. But I assume it is not the ideal fix, 
> since it essentially disables the "snapshotNeeded" optimization.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to