This paper
https://www.usenix.org/conference/osdi14/technical-sessions/presentation/zheng_mai
describes a potential crash vulnerability in LMDB due to its use of fdatasync
instead of fsync when syncing writes to the data file. The vulnerability
exists because fdatasync omits syncs of the file metadata; if the data file
needed to grow as a result of any writes then this requires a metadata update.
This is a well-understood issue in LMDB; we briefly touched on it in this
earlier email thread
http://www.openldap.org/lists/openldap-technical/201402/msg00111.html and it's
been a topic of discussion on IRC ever since the first multi-FS
microbenchmarks we conducted back in 2012. http://symas.com/mdb/microbench/july/
It's worth noting that this vulnerability doesn't exist on Windows, MacOSX,
Android, or *BSD, because none of these OSs have a function equivalent to
fdatasync in the first place - they always use fsync (or the Windows
equivalent). (Android is an oddball; the underlying Linux kernel of course
supports fdatasync, but the C library, bionic, does not.)
We have a couple approaches for Linux:
1) provide an option to preallocate the file, using fallocate().
Unfortunately this doesn't completely eliminate metadata updates - filesystem
drivers tend to try to be "smart" and make fallocate cheap; they allocate the
space in the FS metadata but they also mark it as "unseen." The first time a
process accesses an unseen page, it gets zeroed out. Up until that point,
whatever old contents of the disk page are still present. The act of marking a
page from "unseen" to "seen" requires a metadata update of its own.
We had a discussion of this FS mis-feature a while ago, but it was fruitless.
https://lkml.org/lkml/2012/12/7/396
2) preallocate the file by explicitly writing zeros to it. This has a
couple other disadvantages:
a) on SSDs, doing such a write needlessly contributes to wearout of the
flash.
b) Windows detects all-zero writes and compresses them out, creating a
sparse file, thus defeating the attempt at preallocation.
3) track the allocated size of the file, and toggle between fsync and
fdatasync depending on whether the allocated size actually grows or not. This
is the approach I'm currently taking in a development branch. Whether we add
this to a new 0.9.x release, or just in 1.0, I haven't yet decided.
As another footnote, I plan to add support for LMDB on a raw partition in 1.x.
Naturally, fsync vs fdatasync will be irrelevant in that case.
--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/