I found myself in a cave where dragons were sleeping. I crept forward
trying to be as quiet as possible, but before I had crossed to the
other side one of them rolled over and belched. It gazed at me; I
gazed back. As smoke began to drift lazilly from its nostrils I
raised my sword...
Back in mundane reality, I just finished putting the final piece of
(non-incremental) tailmerging in place: unmerge on file extend. Then
one of my test case started failing. Can't be, I thought. I can see
all the data go where it's supposed to, do what it's supposed to.
Then I realized: Oh. Buffer alias.
So now on to the analysis. This is a deep problem; it has nothing to
do with making stupid mistakes. It has everything to do with the
recent unification of the buffer and page caches.
Simply stated, the new cache design divides filesystem blocks into two
classes: those that can be memory-mapped and those that can't. There
is no defined way to move a given block from one class to the other.
This is the sequence of events for a tail unmerge:
- Fix up various inode fields
- bread the tail block
- Allocate new block from ext2, map it into a buffer with getblk
- Copy the tail fragment to the buffer of the newly allocated block
- Mark the buffer dirty and go away
So far so good. If I remount now, I have a good filesystem. But then
I append to the file and the following happens:
- generic_file_write can't find the tail page in the page cache
- so it creates the page, creating the page buffers
- ext2_write_page->block_write_full_page->ext2_get_block
- ext2_get_block fills in the page buffer for the tail block (*)
- we now have our alias, races, the whole nine yards
Reading and mmap operations also run into trouble in similar ways.
First off: I can solve this problem in a nasty, slow way, by waiting
until the new tail block finishes writing then killing the buffer.
Ugh... all operations on the file have to stop and wait and the block
has to be re-read.
A better solution is to kill the alias on the line marked (*). And
after my sword is all bloody from slaying that dragon, will it stay
dead?
It seems I'm fixing a symptom here. The deep problem is the lack of
mobility between the page cache and buffer cache. Sometimes you just
need to do things to individual blocks that are outside the scope of
read, write and mmap. Sometimes you want data blocks to become
metadata blocks and vice versa. Maybe this has been avoided in the
specific filesystems now in the kernel tree, but for the general case
it's too restrictive.
--
Daniel
-
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to [EMAIL PROTECTED]