There seems to be a conflict between journaling filesystem requirements
(both ext3 and reiserfs), and the current raid code when it comes to
write ordering in the buffer cache.
The current ext3 code adds debugging checks to ll_rw_block designed to
detect any cases where blocks are being written to disk in an order
which breaks the filesystem's transaction ordering guarantees.
A couple of hours ago it was triggered during a test run here by the
raid background resync daemon.
Raid resync basically works by reading, and rewriting, the entire raid
device stripe by stripe. The write pass is unconditional. Even if the
block is marked as reserved for journaling, and so is bypassed by
bdflush, even if the block is clean: it gets written to disk.
ext3 uses a separate buffer list for journaled buffers to avoid bdflush
writing them back early. As I understand it (correct me if I'm wrong,
Chris), reiserfs journaling simply avoids setting the dirty bit on the
buffer_head until the log record has been written. Neither case stops
raid resync from flushing the buffer to disk.
As far as I can see, the current raid resync simply cannot observe any
write ordering requirements being placed on the buffer cache. This is
something which will have to be addressed in the raid code --- the only
alternative appears to be to avoid placing any uncommitted transactional
data in the buffer cache at all, which would require massive rewrites of
ext3 (and probably no less trauma in reiserfs).
This isn't a bug in either the raid code or the journaling --- it's just
that the raid code changes semantics which non-journaling filesystems
don't care about. Journaling adds extra requirements to the buffer
cache, and raid changes the semantics in an incompatible way. Put the
two together and you have serious problems during a background raid
Ingo, can we work together to address this? One solution would be the
ability to mark a buffer_head as "pinned" against being written to disk,
and to have raid resync use a temporary buffer head when updating that
block and use the on-disk copy, not the in-memory one, to update the
disk (guaranteeing that the in-memory copy doesn't hit disk). You will
have a much better understanding of the locking requirements necessary
to ensure that the two copies don't cause mayhem, but I'm willing to
help on the implementation.
Fixing this in raid seems far, far preferable to fixing it in the
filesystems. The filesystem should be allowed to use the buffer cache
for metadata and should be able to assume that there is a way to prevent
those buffers from being written to disk until it is ready.