I seem to recall going to a seminar at LCA09 and there were some file system people there discussing the fact that the fsync is the only way of guaranteeing that the file write has actually created stored data on the "spinning rust".
(they loved the term "spinning rust" because they were the ones doing the work for the SSD stuff.) -- ------------------------------------------------- Zane Gilmore Development and Web Infrastructure Team Leader DDI: 325 9631 Cell:0276 319 206 The New Zealand Institute for Plant and Food Research Limited: Rangahau Ahumaara Kai Postal Address: Plant & Food Research Lincoln Private Bag 4704, Christchurch Mail Centre, CHRISTCHURCH 8140 Physical Address: Plant & Food Research Lincoln Canterbury Agriculture & Science Centre, Gerald St, Lincoln 7608 http://www.plantandfood.co.nz/ -------------------------------------------------- >>> On 12/03/2009 at 12:50 p.m., in message <[email protected]>, John Carter <[email protected]> wrote: > Programmers of the world, (and especially those who teach programmers) > make a bookmark of this one... (Even if you're windows programmer, > make a note!) > > Theodore Ts'o, the ext4 developer had written this very informative > comment on an alleged ext4 data loss bug. > > Ted quite correctly observes all ext4 has done has made prexisting > application bugs more obvious by increasing the window of time in > which the bug can strike, and reminds us of what the Posixly correct > way of doing things is. > > To save launchpad a hammering, I have copied & pasted the whole thing > here... > > Ext4 data loss > > Theodore Ts'o wrote on 2009-03-07: > > https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/317781/comments/54 > > OK, so let me explain what's going on a bit more explicitly. There are > application programmers who are rewriting application files like this: > > 1.a) open and read file ~/.kde/foo/bar/baz > 1.b) fd = open("~/.kde/foo/bar/baz", O_WRONLY|O_TRUNC|O_CREAT) --- this > truncates the file > 1.c) write(fd, buf-of-new-contents-of-file, size-of-new-contents-of-file) > 1.d) close(fd) > > Slightly more sophisticated application writers will do this: > > 2.a) open and read file ~/.kde/foo/bar/baz > 2.b) fd = open("~/.kde/foo/bar/baz.new", O_WRONLY|O_TRUNC|O_CREAT) > 2.c) write(fd, buf-of-new-contents-of-file, size-of-new-contents-of-file) > 2.d) close(fd) > 2.e) rename("~/.kde/foo/bar/baz.new", "~/.kde/foo/bar/baz") > > What emacs (and very sophisticated, careful application writers) will > do is this: > > 3.a) open and read file ~/.kde/foo/bar/baz > 3.b) fd = open("~/.kde/foo/bar/baz.new", O_WRONLY|O_TRUNC|O_CREAT) > 3.c) write(fd, buf-of-new-contents-of-file, size-of-new-contents-of-file) > 3.d) fsync(fd) --- and check the error return from the fsync > 3.e) close(fd) > 3.f) rename("~/.kde/foo/bar/baz", "~/.kde/foo/bar/baz~") --- this is optional > 3.g) rename("~/.kde/foo/bar/baz.new", "~/.kde/foo/bar/baz") > > The fact that series (1) and (2) works at all is an accident. Ext3 in > its default configuration happens to have the property that 5 seconds > after (1) and (2) completes, the data is safely on disk. (3) is the > ***only*** thing which is guaranteed not to lose data. For example, if > you are using laptop mode, the 5 seconds is extended to 30 seconds. > > Now the one downside with (3) is that fsync() is a heavyweight > operation. If your application is stupid, and has hundreds of dot > files in your home directory, each one taking up a 4k disk block even > though it is only storing 4 to 12 bytes of data in each singleton dot > file, and you have to repeat (3) for each of your one hundred dot > files --- and worse yet, your application for some stupid, unknown > reason is writing all of these hundred+ dot files every few seconds, > then (3) will be very painful. But it is painful because the > application is stupidly written --- not for any fundamental filesystem > fault. It's like if you had a robot which was delivering mail to mail > box numbers 1, 2, 3, 4, 5, and crossing the street for each mail box; > on a busy road, this is unsafe, and the robot was getting run over > when it kept on jaywalking --- so you can tell the robot to only cross > at crosswalks, when the "walk" light is on, which is safe, but slow > --- OR, you could rewrite the robot's algorithsm so it delieveres the > mail more intelligently (i.e., one side of the street, and then cross, > safely at the crosswalk, and then do the other side of the street). > > Is that clear? The file system is not "truncating" files. The > application is truncating the files, or is constantly overwriting the > files using the rename system call. This is a fundamentally unsafe > thing to do, and ext3 just happened to paper things over. But *both* > XFS and ext4 does delayed allocation, which means that data blocks > don't get allocated right away, and they don't get written right > away. Btrfs will be doing delayed allocation as well; all modern > filesystems will do this, because it's how you get better > performance. Applications are expected to use fsync() or fdatasync(), > and if that impacts their performance too much, to use a single berkdb > or other binary database file, and not do something stupid with > hundreds of tiny text files that only hold a few bytes of data in each > text file. > > > > John Carter Phone : (64)(3) 358 6639 > Tait Electronics Fax : (64)(3) 359 4632 > PO Box 1645 Christchurch Email : [email protected] > New Zealand On 1 December 2008 HortResearch merged with Crop & Food Research to create The New Zealand Institute for Plant and Food Research Limited. "The contents of this email are confidential and may be subject to legal privilege. If you are not the intended recipient you must not use, disseminate, distribute or reproduce all or any part of this email or attachments. If you have received this email in error, please notify the sender and delete all material pertaining to this e-mail. Any opinion or views expressed in this email are those of the individual sender and may not represent those of The New Zealand Institute for Plant and Food Research Limited."
