I am going to use this opportunity to comment a bit on why issuing the disk flush command is so important for meta-data updates on filesystems which implement instant recovery after a crash. FreeBSD does not issue the command virtually anywhere. Not on UFS fsync(), not with softupdates(), and apparently not with the new journaling code which I am happy to see is making progress.
Rather than use the thread on the FreeBSD lists to comment, which would surely lead to a flame war and devalue the excellent work being done in FreeBSD to add journaling, I am going to describe the issue here. Writes to disk can be broken down into two categories: (1) Meta-data and other structural media writes. (2) Log writes in general. Forward log logical operations, such as write()+fsync(), which do not (necessarily) have to involve meta-data updates on-media in order to be flushable, also known as an intent log. And also UNDO (rollback) log operations and journaling (forward log operations for meta-data, similar to an intent log but not operating as a logical layer). And other methods. Meta-data on the physical media presents a recovery burden after a crash. This is why UFS has fsck, this is why softupdates was written, why HAMMER has an UNDO FIFO, and why the new UFS journaling code is being worked on in FreeBSDland. Different structural forms for meta-data have varying degrees of fragility. For example the single-B-Tree mechanic that HAMMER uses is more fragile than the cylinder group / blockmap mechanic that UFS uses. When flushing data to a hard drive it is vastly important to minimize the meta-data corruption which might occur due to a crash. This is what softupdates tries to do, for example. HAMMER's somewhat more fragile meta-data structures on-media require us to 100% eliminate any possibility of meta-data corruption, and we use disk flush commands to actually flush the disk cache onto the media to separate the flush stages. It could be argued that UFS's somewhat less fragile format can make due without issuing such flushes but even so people have lost filesystems to softupdates and in the general case as storage gets larger and larger the margin for error gets smaller and smaller. You just CANNOT AFFORD for a filesystem to go bad due to an event (crash, powerdown, etc) on a multi-terrabyte filesystem. It is considerably LESS important to avoid data loss when the only data loss possible is the last few write()'s to a file, as long as the entire previous state of the contents of the file can be recovered (as if those write()'s did not occur), with no misordered or partial writes, and no meta-data is lost. This less important data loss case is the one which most BSD's, including FreeBSD and DragonFly, use for UFS write()+fsync(). Under UFS a fsync() does not issue a media flush, it simply issues the I/O and leaves the data sitting the drive cache. HAMMER defaults to full synchronization semantics (and as I said I will be adding a sysctl to allow the particular write()+fsync() case to devolve to just a log-write without a full disk sync command). -- Ok, now my comment on UFS, softupdates, and the new journaling work being done in FreeBSD. Here's my comment: "Kudos on the work! But for gods sakes implement proper disk synchronization mechanics!". Here's why: * You can implement the most important mechanics, those for database-style write()/fsync() operations, using only your journal with relaxed media flush requirements without endangering any meta-data. i.e. anything related to meta-data would always use full meta flush mechanics. In otherwords, you can bake your cake and eat it too! So don't stop with it 90% done. * All my comments above on the fragility of meta-data updates in the face of out of order commits to disk, which is what you will get, apply. A simple write()/fsync() operation with a forward log could use relaxed semantics, but you are playing with fire if you try to do that with meta-data updates. Softupdates ALREADY assumes ordering between flush groups, and this has frankly bitten me on numerous occassions in past years. That is, it waits for X parallel I/O's to complete before initiating the next block of Y parallel I/O's. This is ALREADY broken to some degree. This is ALREADY too fragile. Don't make it *MORE* fragile by assuming ordering between journal updates and meta-data updates queued by softupdates. JOURNAL1 -> SOFT1 -> JOURNAL2 -> SOFT2 could end up being ordered on the disk: SOFT2 -> SOFT1 -> JOURNAL2 -> (partial) JOURNAL1. In that example, since the journal is strictly ordered from a recovery standpoint, the journal will be empty. In another example, say the actual order of the writes to the media is SOFT2 -> SOFT1 -> JOURNAL1 -> JOURNAL2, now your journal is trying to undo operations related to SOFT1 that may have already been overwritten by SOFT2 for which no journal exists. Too much fire. * A large number of your installations will be running systems without a UPS or without shutdown signaling mechanics. The enterprise systems will not, but these operating systems are not designed JUST for enterprise use. How about the home client or server? What about turnkey systems trying to minimize costs? * As drives age and start to use more renamed sectors, write flushes take longer. The longer write flushes take the higher the probability that you will lose data sitting the drive's write cache. * Intermediate caching (iSCSI devices running on UNIX, for example). It is impossible optimize those operations if the targets cannot make any assumptions with regards to synchronization mechanics, requiring fully synchronized writes for each I/O individually. * Port-powered devices. I'd mention USB but USB doesn't handle the disk sync command very well anyway, but there are numerous plug'n'play E-SATA devices which while separately powered provide a means of quickly disconnecting the device. Hmm, I think E-SATA disk keys exist now too, in fact. The easier it is to disconnect the device, the higher the chance of the device getting disconnected at a bad time, including power (hot-swap), UPS or not. Particularly for port-multipliers, and also for SSDs or any externalizable device, it is far easier than you might imagine to depower a device accidently. Human error. * Battery-backed RAID systems are nice, and expensive, but that's no reason to throw away the more typical installation where the drive cache is used directly. This is particularly true for people using SSDs. Sure, a few years from now I expect most SSDs will be able to flush unwritten dirty data to local flash. It hasn't happened yet and it doesn't help with layered caches in the storage path anyway. I will reiterate that when one is playing with multi-terrabyte filesystems, the margin for error is significantly reduced. Power loss events WILL OCCUR. Firmware crashes WILL OCCUR. Power supplies still blow up. It makes no sense to ignore these sources of error. * UPSs are great, I have one... but properly powering down systems attached to a UPS is actually not trivial. In all my years using UPSes through power failures systems have only powered down properly 75% of the time. The other 25%... those wound up being hard power-downs. Bye bye disk cache (and often bye-bye drive, but that's another matter). * NFS and other multi-layered filesystems depend on proper synchronization mechanics for reboot recovery to work properly. The more layers you have, the more likely something will break and all your assumptions will go flying out the window. * VM's can't cache or optimize I/O's if you are forced to use sync-to-media for every I/O because you can't depend on disk flushing. For example, a a FreeBSD client running on a linux host. Goodbye intermediate caching layer if the linux host dies. I've run VMs on windows boxes where the windows box hardlocks and completely destroys the 'drive cache'. I found out the hard way that some VMs ignore the disk synchronization command, even! But I don't expect that to last long as VMs become more important. Basically it comes down to (1) Retaining the ability for devices and intermediate platforms to properly cache and optimize write I/O, (2) The fallacy of the assumption that nothing matters unless caches are battery-backed, and (3) In large scale systems the assumptions for data integrity have extremely serious consequences if that promise of data integrity turns out to be not quite true in all circumstances. Human error guarantees that. So I would not-so-humbly suggest that proper media flush semantics be implemented for any UFS journaling implementation, particularly one done on top of softupdates. PARTICULARLY if you want to get rid of fsck for real. For meta-data, of course. write()+fsync() operations which can be flushed with a single log entry and no meta-data writes could use relaxed semantics (though IMHO not as a default). -Matt Matthew Dillon <dil...@backplane.com>