This mail is about an issue that has been of concern to me for quite a while and I think it is (well past) time to air it more widely and try to come to a resolution.
This issue is how write barriers (the block-device kind, not the memory-barrier kind) should be handled by the various layers. The following is my understanding, which could well be wrong in various specifics. Corrections and other comments are more than welcome. ------------ What are barriers? ================== Barriers (as generated by requests with BIO_RW_BARRIER) are intended to ensure that the data in the barrier request is not visible until all writes submitted earlier are safe on the media, and that the data is safe on the media before any subsequently submitted requests are visible on the device. This is achieved by tagging request in the elevator (or any other request queue) so that no re-ordering is performed around a BIO_RW_BARRIER request, and by sending appropriate commands to the device so that any write-behind caching is defeated by the barrier request. Along side BIO_RW_BARRIER is blkdev_issue_flush which calls q->issue_flush_fn. This can be used to achieve similar effects. There is no guarantee that a device can support BIO_RW_BARRIER - it is always possible that a request will fail with EOPNOTSUPP. Conversely, blkdev_issue_flush must be supported on any device that uses write-behind caching (it if cannot be supported, then write-behind caching should be turned off, at least by default). We can think of there being three types of devices: 1/ SAFE. With a SAFE device, there is no write-behind cache, or if there is it is non-volatile. Once a write completes it is completely safe. Such a device does not require barriers or ->issue_flush_fn, and can respond to them either by a no-op or with -EOPNOTSUPP (the former is preferred). 2/ FLUSHABLE. A FLUSHABLE device may have a volatile write-behind cache. This cache can be flushed with a call to blkdev_issue_flush. It may not support barrier requests. 3/ BARRIER. A BARRIER device supports both blkdev_issue_flush and BIO_RW_BARRIER. Either may be used to synchronise any write-behind cache to non-volatile storage (media). Handling of SAFE and FLUSHABLE devices is essentially the same and can work on a BARRIER device. The BARRIER device has the option of more efficient handling. How does a filesystem use this? =============================== A filesystem will often have a concept of a 'commit' block which makes an assertion about the correctness of other blocks in the filesystem. In the most gross sense, this could be the writing of the superblock of an ext2 filesystem, with the "dirty" bit clear. This write commits all other writes to the filesystem that precede it. More subtle/useful is the commit block in a journal as with ext3 and others. This write commits some number of preceding writes in the journal or elsewhere. The filesystem will want to ensure that all preceding writes are safe before writing the barrier block. There are two ways to achieve this. 1/ Issue all 'preceding writes', wait for them to complete (bi_endio called), call blkdev_issue_flush, issue the commit write, wait for it to complete, call blkdev_issue_flush a second time. (This is needed for FLUSHABLE) 2/ Set the BIO_RW_BARRIER bit in the write request for the commit block. (This is more efficient on BARRIER). The second, while much easier, can fail. So a filesystem should be prepared to deal with that failure by falling back to the first option. Thus the general sequence might be: a/ issue all "preceding writes". b/ issue the commit write with BIO_RW_BARRIER c/ wait for the commit to complete. If it was successful - done. If it failed other than with EOPNOTSUPP, abort else continue d/ wait for all 'preceding writes' to complete e/ call blkdev_issue_flush f/ issue commit write without BIO_RW_BARRIER g/ wait for commit write to complete if it failed, abort h/ call blkdev_issue DONE steps b and c can be left out if it is known that the device does not support barriers. The only way to discover this to try and see if it fails. I don't think any filesystem follows all these steps. ext3 has the right structure, but it doesn't include steps e and h. reiserfs is similar. It does have a call to blkdev_issue_flush, but that is only on the fsync path, so it isn't really protecting general journal commits. XFS - I'm less sure. I think it does 'a' then 'd', then 'b' or 'f' depending on a whether it thinks the device handles barriers, and finally 'g'. I haven't looked at other filesystems. So for devices that support BIO_RW_BARRIER, and for devices that don't need any flush, they work OK, but for device that need flushing, but don't support BIO_RW_BARRIER, none of them work. This should be easy to fix. HOW DO MD or DM USE THIS ======================== 1/ striping devices. This includes md/raid0 md/linear dm-linear dm-stripe and probably others. These devices can easily support blkdev_issue_flush by simply calling blkdev_issue_flush on all component devices. These devices would find it very hard to support BIO_RW_BARRIER. Doing this would require keeping track of all in-flight requests (which some, possibly all, of the above don't) and then: When a BIO_RW_BARRIER request arrives: wait for all pending writes to complete call blkdev_issue_flush on all devices issue the barrier write to the target device(s) as BIO_RW_BARRIER, if that is -EOPNOTSUP, re-issue, wait, flush. Currently none of the listed modules do that. md/raid0 and md/linear fail any BIO_RW_BARRIER with -EOPNOTSUP. dm-linear and dm-stripe simply pass the BIO_RW_BARRIER flag down, which means data may not be flushed correctly: the commit block might be written to one device before a preceding block is written to another device. I think the best approach for this class of devices is to return -EOPNOSUP. If the filesystem does the wait (which they all do already) and the blkdev_issue_flush (which is easy to add), they don't need to support BIO_RW_BARRIER. 2/ Mirror devices. This includes md/raid1 and dm-raid1. These device can trivially implement blkdev_issue_flush much like the striping devices, and can support BIO_RW_BARRIER to some extent. md/raid1 currently tries. I'm not sure about dm-raid1. md/raid1 determines if the underlying devices can handle BIO_RW_BARRIER. If any cannot, it rejects such requests (EOPNOTSUP) itself. If all underlying devices do appear to support barriers, md/raid1 will pass a barrier-write down to all devices. The difficulty comes if it fails on one device, but not all devices. In this case it is not clear what to do. Failing the request is a lie, because some data has been written (possible too early). Succeeding the request (after re-submitting the failed requests) is also a lie as the barrier wasn't really honoured. md/raid1 currently takes the latter approach, but will only do it once - after that it fails all barrier requests. Hopefully this is unlikely to happen. What device would work correctly with barriers once, and then not the next time? The answer is md/raid1. If you remove a failed device and add a new device that doesn't support barriers, md/raid1 will notice and stop supporting barriers. If md/raid1 can change from supporting barrier to not, then maybe some other device could too? I'm not sure what to do about this - maybe just ignore it... 3/ Other modules Other md and dm modules (raid5, mpath, crypt) do not add anything interesting to the above. Either handling BIO_RW_BARRIER is trivial, or extremely difficult. HOW DO LOW LEVEL DEVICES HANDLE THIS ==================================== This is part of the picture that I haven't explored greatly. My feeling is that most if not all devices support blkdev_issue_flush properly, and support barriers reasonably well providing that the hardware does. There in an exception I recently found though. For devices that don't support QUEUE_ORDERED_TAG (i.e. commands sent to the controller can be tagged as barriers), SCSI will use the SYNCHRONIZE_CACHE command to flush the cache after the barrier request (a bit like the filesystem calling blkdev_issue_flush, but at a lower level). However it does this without setting the SYNC_NV bit. This means that a device with a non-volatile cache will be required -- needlessly -- to flush that cache to media. So: some questions to help encourage response: - Is the above substantial correct? Totally correct? - Should the various filesystems be "fixed" as suggested above? Is someone willing to do that? - Is the approach to barriers taken by md appropriate? Should dm do the same? Who will do that? - Is setting the SYNC_NV bit really the right thing to do? Are there any other places where the wrong sort of sync might be happening? Are then any callers that require SYNC_NV to be clear. - The comment above blkdev_issue_flush says "Caller must run wait_for_completion() on its own". What does that mean? - Are there other bit that we could handle better? BIO_RW_FAILFAST? BIO_RW_SYNC? What exactly do they mean? Thank you for your attention. NeilBrown - To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html