[
https://issues.apache.org/jira/browse/ZOOKEEPER-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15269602#comment-15269602
]
Abhishek Rai commented on ZOOKEEPER-2310:
-----------------------------------------
Thanks for bringing this up [~zhangyongxyz]. As you pointed out, FileChannel
does not provide a way of accomplishing this in Windows. There are conflicting
opinions online about whether it's even necessary for Windows based on how it
automatically handles updates to folders.
I've provided a modified patch (zookeeper-2310-version-2.patch) which skips
syncing of directory on Windows. The pattern I used has been used elsewhere in
Zookeeper source, so should be safe.
> Snapshot files must be synced to prevent inconsistency or data loss
> -------------------------------------------------------------------
>
> Key: ZOOKEEPER-2310
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2310
> Project: ZooKeeper
> Issue Type: Bug
> Components: server
> Affects Versions: 3.4.6
> Reporter: Abhishek Rai
> Assignee: Abhishek Rai
> Attachments: zookeeper-2310-version-2.patch, zookeeper-2310.patch
>
>
> Today, Zookeeper server syncs transaction log files to disk by default, but
> does not sync snapshot files. Consequently, an untimely crash may result in
> a lost or incomplete snapshot file. During recovery, if the server finds a
> valid older snapshot file, it will load it and replay subsequent log(s),
> skipping the incomplete snapshot file. It's possible that the skipped file
> had some transactions which are not present in the replayed transaction logs.
> Since quorum synchronization is based on last transaction ID of each server,
> this will never get noticed, resulting in inconsistency between servers and
> possible data loss.
> Following sequence of events describes a sample scenario where this can
> happen:
> # Server F is a follower in a Zookeeper ensemble.
> # F's most recent valid snapshot file is named "snapshot.10" containing state
> up to zxid = 10. F is currently writing to the transaction log file
> "log.11", with the most recent zxid = 20.
> # Fresh round of election.
> # F receives a few new transactions 21 to 30 from new leader L as the "diff".
> Current server behavior is to dump current state plus diff to a new snapshot
> file, "snapshot.30".
> # F finalizes the snapshot file, but file contents are still buffered in OS
> caches. Zookeeper does not sync snapshot file contents to disk.
> # F receives a new transaction 31 from the leader, which it appends to the
> existing transaction log file, "log.11" and syncs the file to disk.
> # Server machine crashes or is cold rebooted.
> # After recovery, snapshot file "snapshot.30" may not exist or may be empty.
> See below for why that may happen.
> # In either case, F looks for the last finalized snapshot file, finds and
> loads "snapshot.10". It then replays transactions from "log.11".
> Ultimately, its last seen zxid will be 31, but it would not have replayed
> transactions 21 to 30 received via the "diff" from the leader.
> # Clients which are connected to F may see different data than clients
> connected to other members of the ensemble, violating single system image
> invariant. Also, if F were to become a leader at some point, it could use
> its state to seed other servers, and they all could lose the writes in the
> missing interval above.
> *Notes:*
> - Reason why snapshot file may be missing or incomplete:
> -- Zookeeper does not sync the data directory after creating a snapshot file.
> Even if a newly created file is synced to disk, if the corresponding
> directory entry is not, then the file will not be visible in the namespace.
> -- Zookeeper does not sync snapshot files. So, they may be empty or
> incomplete during recovery from an untimely crash.
> - In step (6) above, the server could also have written the new transaction
> 31 to a new log file, "log.31". The final outcome would still be the same.
> We are able to deterministically reproduce this problem using the following
> steps:
> # Create a new Zookeeper ensemble on 3 hosts: A, B, and C.
> # Ensured each server has at least one snapshot file in its data dir.
> # Stop Zookeeper process on server A.
> # Slow down disk syncs on server A (see example script below). This ensures
> that snapshot files written by Zookeeper don't make it to disk spontaneously.
> Log files will be written to disk as Zookeeper explicitly issues a sync call
> on such files.
> # Connect to server B and create a new znode /test1.
> # Start Zookeeper process on A, wait for it to write a new snapshot to its
> datadir. This snapshot would contain /test1 but it won’t be synced to disk
> yet.
> # Connect to A and verify that /test1 is visible.
> # Connect to B and create another znode /test2. This will cause A’s
> transaction log to grow further to receive /test2.
> # Cold reboot A.
> # A’s last snapshot is a zero-sized file or is missing altogether since it
> did not get synced to disk before reboot. We have seen both in different
> runs.
> # Connect to A and verify that /test1 does not exist. It exists on B and C.
> Slowing down disk syncs:
> {noformat}
> echo 360000 | sudo tee /proc/sys/vm/dirty_writeback_centisecs
> echo 360000 | sudo tee /proc/sys/vm/dirty_expire_centisecs
> echo 99 | sudo tee /proc/sys/vm/dirty_background_ratio
> echo 99 | sudo tee /proc/sys/vm/dirty_ratio
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)