On Aug 11, 2009, at 11:42 PM, Jens Alfke wrote:
On Aug 11, 2009, at 5:03 PM, Damien Katz wrote:
The worst problem is that the disk controller will reorder sector
writes to reduce seek time, which in effect means that if power is
lost, some random subset of the last writes may not happen. So you
won't just end up with a truncated file — you could have a file
that seems intact and has a correct header at the end, but has 4k
bytes of garbage somewhere within the last transaction. Does
CouchDB's file structure guard against that?
First we fsync all the data and indexes, then we write and fsync
the headers in a separate step.
Cool. From my discussions with Apple filesystem guru Dominic
Giampaolo, I gather that this two-phase approach is the right way to
guarantee consistency. (It's also used by the HFS+ filesystem to
secure its journal.)
The caveat is that the fsyncs have to be the paranoid kind that
flush the disk-controller cache, not just the OS kernel cache. (This
is what the nonstandard F_FULLFSYNC mode does in Darwin/OS X;
hopefully CouchDB knows to use that when built for that platform.)
Yep, we know to use that flag -- it was Jan who supplied a patch to
the Erlang/OTP team to get F_FULLFSYNC used by Erlang's standard file
module. If I recall correctly we knew there was a problem when
CouchDB was claiming ~200 fsyncs/second on our laptops :-)
Cheers, Adam