Re: [HACKERS] fsync reliability

Greg Smith Thu, 21 Apr 2011 20:51:30 -0700

On 04/21/2011 04:26 AM, Simon Riggs wrote:

However, that begs the question of what happens with WAL. At present,
we do nothing to ensure that "the entry in the directory containing
the file has also reached disk".

Well, we do, but it's not obvious why that is unless you've stared atthis for far too many hours. A clear description of the possible issueyou and Dan are raising showed up on LKML a few years ago:http://lwn.net/Articles/270891/


Here's the most relevant part, which directly addresses the WAL case:

"[fsync] is unsafe for write-ahead logging, because it doesn't reallyguarantee any _ordering_ for the writes at the hard storage level. Soaside from losing committed data, it can also corrupt structuralmetadata. With ext3 it's quite easy to verify that fsync/fdatasyncdon't always write a journal entry. (Apart from looking at the kernelcode :-)

Just write some data, fsync(), and observe the number of writes in/proc/diskstats. If the current mtime second _hasn't_ changed, theinode isn't written. If you write data, say, 10 times a second to thesame place followed by fsync(), you'll see a little more than 10 writeI/Os, and less than 20."

There's a terrible hack suggested where you run fchmod to force thejournal out in the next fsync that makes me want to track the posterdown and shoot him, but this part raises a reasonable question.

The main issue he's complaining about here is a moot one forPostgreSQL. If the WAL rewrites have been reordered but have notcompleted, the minute WAL replay hits the spot with a missing block theCRC32 will be busted and replay is finished. The fact that he'sassuming a database would have such a naive WAL implementation that itwould corrupt the database if blocks are written out of order in betweenfsync call returning is one of the reasons this whole idea never gotmore traction--hard to get excited about a proposal whose fundamentalsrest on an assumption that doesn't turns out to be true on real databases.

There's still the "fsync'd a data block but not the directory entry yet"issue as fall-out from this too. Why doesn't PostgreSQL run into thisproblem? Because the exact code sequence used is this one:


open
write
fsync
close

And Linux shouldn't ever screw that up, or the similar rename path.Here's what the close man page says, fromhttp://linux.die.net/man/2/close :

"A successful close does not guarantee that the data has beensuccessfully saved to disk, as the kernel defers writes. It is notcommon for a filesystem to flush the buffers when the stream is closed.If you need to be sure that the data is physically stored use fsync(2).(It will depend on the disk hardware at this point.)"

What this is alluding to is that if you fsync before closing, the closewill write all the metadata out too. You're busted if your write cachelies, but we already know all about that issue.

There was a discussion of issues around this on LKML a few years ago,with Alan Cox getting the good pull quote athttp://lkml.org/lkml/2009/3/27/268 : "fsync/close() as a pair allows theuser to correctly indicate their requirements." While fsync doesn'tguarantee that metadata is written out, and neither does close, kerneldevelopers seem to all agree that fsync-before-close means you wanteverything on disk. Filesystems that don't honor that will break allsorts of software.

It is of course possible there are bugs in some part of this code path,where a clever enough test case might expose a window of strangefile/metadata ordering. I think it's too weak of a theorized problem togo specifically chasing after though.


--
Greg Smith   2ndQuadrant US    g...@2ndquadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] fsync reliability

Reply via email to