Stephen C. Tweedie
Fri, 29 Oct 1999 08:59:42 -0700
Hi all, There seems to be a conflict between journaling filesystem requirements (both ext3 and reiserfs), and the current raid code when it comes to write ordering in the buffer cache. The current ext3 code adds debugging checks to ll_rw_block designed to detect any cases where blocks are being written to disk in an order which breaks the filesystem's transaction ordering guarantees. A couple of hours ago it was triggered during a test run here by the raid background resync daemon. Raid resync basically works by reading, and rewriting, the entire raid device stripe by stripe. The write pass is unconditional. Even if the block is marked as reserved for journaling, and so is bypassed by bdflush, even if the block is clean: it gets written to disk. ext3 uses a separate buffer list for journaled buffers to avoid bdflush writing them back early. As I understand it (correct me if I'm wrong, Chris), reiserfs journaling simply avoids setting the dirty bit on the buffer_head until the log record has been written. Neither case stops raid resync from flushing the buffer to disk. As far as I can see, the current raid resync simply cannot observe any write ordering requirements being placed on the buffer cache. This is something which will have to be addressed in the raid code --- the only alternative appears to be to avoid placing any uncommitted transactional data in the buffer cache at all, which would require massive rewrites of ext3 (and probably no less trauma in reiserfs). This isn't a bug in either the raid code or the journaling --- it's just that the raid code changes semantics which non-journaling filesystems don't care about. Journaling adds extra requirements to the buffer cache, and raid changes the semantics in an incompatible way. Put the two together and you have serious problems during a background raid sync. Ingo, can we work together to address this? One solution would be the ability to mark a buffer_head as "pinned" against being written to disk, and to have raid resync use a temporary buffer head when updating that block and use the on-disk copy, not the in-memory one, to update the disk (guaranteeing that the in-memory copy doesn't hit disk). You will have a much better understanding of the locking requirements necessary to ensure that the two copies don't cause mayhem, but I'm willing to help on the implementation. Fixing this in raid seems far, far preferable to fixing it in the filesystems. The filesystem should be allowed to use the buffer cache for metadata and should be able to assume that there is a way to prevent those buffers from being written to disk until it is ready. --Stephen