On 04/21/2011 04:26 AM, Simon Riggs wrote:
However, that begs the question of what happens with WAL. At present,
we do nothing to ensure that "the entry in the directory containing
the file has also reached disk".


Well, we do, but it's not obvious why that is unless you've stared at this for far too many hours. A clear description of the possible issue you and Dan are raising showed up on LKML a few years ago: http://lwn.net/Articles/270891/

Here's the most relevant part, which directly addresses the WAL case:

"[fsync] is unsafe for write-ahead logging, because it doesn't really guarantee any _ordering_ for the writes at the hard storage level. So aside from losing committed data, it can also corrupt structural metadata. With ext3 it's quite easy to verify that fsync/fdatasync don't always write a journal entry. (Apart from looking at the kernel code :-)

Just write some data, fsync(), and observe the number of writes in /proc/diskstats. If the current mtime second _hasn't_ changed, the inode isn't written. If you write data, say, 10 times a second to the same place followed by fsync(), you'll see a little more than 10 write I/Os, and less than 20."

There's a terrible hack suggested where you run fchmod to force the journal out in the next fsync that makes me want to track the poster down and shoot him, but this part raises a reasonable question.

The main issue he's complaining about here is a moot one for PostgreSQL. If the WAL rewrites have been reordered but have not completed, the minute WAL replay hits the spot with a missing block the CRC32 will be busted and replay is finished. The fact that he's assuming a database would have such a naive WAL implementation that it would corrupt the database if blocks are written out of order in between fsync call returning is one of the reasons this whole idea never got more traction--hard to get excited about a proposal whose fundamentals rest on an assumption that doesn't turns out to be true on real databases.

There's still the "fsync'd a data block but not the directory entry yet" issue as fall-out from this too. Why doesn't PostgreSQL run into this problem? Because the exact code sequence used is this one:

open
write
fsync
close

And Linux shouldn't ever screw that up, or the similar rename path. Here's what the close man page says, from http://linux.die.net/man/2/close :

"A successful close does not guarantee that the data has been successfully saved to disk, as the kernel defers writes. It is not common for a filesystem to flush the buffers when the stream is closed. If you need to be sure that the data is physically stored use fsync(2). (It will depend on the disk hardware at this point.)"

What this is alluding to is that if you fsync before closing, the close will write all the metadata out too. You're busted if your write cache lies, but we already know all about that issue.

There was a discussion of issues around this on LKML a few years ago, with Alan Cox getting the good pull quote at http://lkml.org/lkml/2009/3/27/268 : "fsync/close() as a pair allows the user to correctly indicate their requirements." While fsync doesn't guarantee that metadata is written out, and neither does close, kernel developers seem to all agree that fsync-before-close means you want everything on disk. Filesystems that don't honor that will break all sorts of software.

It is of course possible there are bugs in some part of this code path, where a clever enough test case might expose a window of strange file/metadata ordering. I think it's too weak of a theorized problem to go specifically chasing after though.

--
Greg Smith   2ndQuadrant US    g...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to