[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14984796#comment-14984796
 ] 

Abhishek Rai commented on ZOOKEEPER-2310:
-----------------------------------------

Thanks for your response [~fpj].  I think my claim about the "diff" being 
present in the snapshot and not in the log looks incorrect.  When pushing a 
diff, leader (LearnerHandler) pushes individual transactions which the follower 
writes to its log (Learner.syncWithLeader).  Leader eventually sends a 
"NEWLEADER", in response, the follower snapshots.  Ultimately, the diff is 
visible in both the log and snapshot.

But consider the case of leader (LearnerHandler) pushing a full snapshot to the 
follower.  In this case, the follower does not receive the individual 
transactions contributing to that snapshot.  In fact, it's not practical to do 
so - by design, the snapshot is sent when the diff is too large.  Thus, the 
follower can have a snapshot which reflects some transactions that are not 
present in the log.  After writing the snapshot, the follower continues writing 
subsequent transactions to the log.

Imagine a crash + recovery is induced at this point, such that the latest 
snapshot file is incomplete or non-existent.  The follower would try to load 
the preceding healthy snapshot, and replay the log since then.  Since the log 
does not contain some transactions corresponding to the missing snapshot file, 
the follower would never find out about them.  This would cause the 
inconsistency scenario I described above.

Without syncing the snapshot file (and its parent directory) to disk, we cannot 
guarantee that the snapshot file exists during recovery.  And the loss of 
finalized snapshot files can result in data loss since all transactions may not 
be present in the log.

> Snapshot files must be synced to prevent inconsistency or data loss
> -------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-2310
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2310
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.4.6
>            Reporter: Abhishek Rai
>            Assignee: Abhishek Rai
>         Attachments: zookeeper-2310.patch
>
>
> Today, Zookeeper server syncs transaction log files to disk by default, but 
> does not sync snapshot files.  Consequently, an untimely crash may result in 
> a lost or incomplete snapshot file.  During recovery, if the server finds a 
> valid older snapshot file, it will load it and replay subsequent log(s), 
> skipping the incomplete snapshot file.  It's possible that the skipped file 
> had some transactions which are not present in the replayed transaction logs. 
>  Since quorum synchronization is based on last transaction ID of each server, 
> this will never get noticed, resulting in inconsistency between servers and 
> possible data loss.
> Following sequence of events describes a sample scenario where this can 
> happen:
> # Server F is a follower in a Zookeeper ensemble.
> # F's most recent valid snapshot file is named "snapshot.10" containing state 
> up to zxid = 10.  F is currently writing to the transaction log file 
> "log.11", with the most recent zxid = 20.
> # Fresh round of election.
> # F receives a few new transactions 21 to 30 from new leader L as the "diff". 
>  Current server behavior is to dump current state plus diff to a new snapshot 
> file, "snapshot.30".
> # F finalizes the snapshot file, but file contents are still buffered in OS 
> caches.  Zookeeper does not sync snapshot file contents to disk.
> # F receives a new transaction 31 from the leader, which it appends to the 
> existing transaction log file, "log.11" and syncs the file to disk.
> # Server machine crashes or is cold rebooted.
> # After recovery, snapshot file "snapshot.30" may not exist or may be empty.  
> See below for why that may happen.
> # In either case, F looks for the last finalized snapshot file, finds and 
> loads "snapshot.10".  It then replays transactions from "log.11".  
> Ultimately, its last seen zxid will be 31, but it would not have replayed 
> transactions 21 to 30 received via the "diff" from the leader.
> # Clients which are connected to F may see different data than clients 
> connected to other members of the ensemble, violating single system image 
> invariant.  Also, if F were to become a leader at some point, it could use 
> its state to seed other servers, and they all could lose the writes in the 
> missing interval above.
> *Notes:*
> - Reason why snapshot file may be missing or incomplete:
> -- Zookeeper does not sync the data directory after creating a snapshot file. 
>  Even if a newly created file is synced to disk, if the corresponding 
> directory entry is not, then the file will not be visible in the namespace.
> -- Zookeeper does not sync snapshot files.  So, they may be empty or 
> incomplete during recovery from an untimely crash.
> - In step (6) above, the server could also have written the new transaction 
> 31 to a new log file, "log.31".  The final outcome would still be the same.
> We are able to deterministically reproduce this problem using the following 
> steps:
> # Create a new Zookeeper ensemble on 3 hosts: A, B, and C.
> # Ensured each server has at least one snapshot file in its data dir.
> # Stop Zookeeper process on server A.
> # Slow down disk syncs on server A (see example script below). This ensures 
> that snapshot files written by Zookeeper don't make it to disk spontaneously. 
>  Log files will be written to disk as Zookeeper explicitly issues a sync call 
> on such files.
> # Connect to server B and create a new znode /test1.
> # Start Zookeeper process on A, wait for it to write a new snapshot to its 
> datadir.  This snapshot would contain /test1 but it won’t be synced to disk 
> yet.
> # Connect to A and verify that /test1 is visible.
> # Connect to B and create another znode /test2.  This will cause A’s 
> transaction log to grow further to receive /test2.
> # Cold reboot A.
> # A’s last snapshot is a zero-sized file or is missing altogether since it 
> did not get synced to disk before reboot.  We have seen both in different 
> runs.
> # Connect to A and verify that /test1 does not exist.  It exists on B and C.
> Slowing down disk syncs:
> {noformat}
> echo 360000 | sudo tee /proc/sys/vm/dirty_writeback_centisecs
> echo 360000 | sudo tee /proc/sys/vm/dirty_expire_centisecs
> echo 99 | sudo tee /proc/sys/vm/dirty_background_ratio
> echo 99 | sudo tee /proc/sys/vm/dirty_ratio
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to