It's time to get out of "those who can't do, teach" mode and do some more doing
on my tailmerging project.
First, I'll summarize the previous discussion yet again because (a) I fscked it
up the first time and (b) it's awfully instructive. At the time, I understood
about 10% of it. (I'm not shy about admitting that - the reason is, what is
being talked about wasn't documented anywhere.) Now, after going through a
couple of iterations of "Documenting the VFS" I've pretty well got it all under
control but map/unmap and sychronization. Pretty big "but". I guess I know
what the next two installments in the Documenting series will be about.
D: Daniel Phillips
A: Alexander Viro
S: Stephen Tweedie
D> Stephen asked me some sharp questions about how this would work,
D> and after I answered them to his satisfaction he asked me if I
D> would have time to implement this feature. I said yes, and went
D> on to write an initial design document describing the required
D> modifications to Ext's handling of inodes, and a prototype
D> algorithm for doing the tail merging.
A> Here is one more for you:
A> Suppose we grow the last fragment/tail/whatever. Do you copy the
A> data out of that shared block? If so, how do you update buffer_heads in
A> pages that cover the relocated data? (Same goes for reiserfs, if they are
A> doing something similar). BTW, our implementation of UFS is fucked up in
A> that respect, so variant from there will not work.
D> I'm not sure what you mean there...
A> I mean that UFS has the same problem (relocation of the last fragment) and
A> our implementation is fucked up (== does not deal with that properly and
A> eats data). So if you will look for existing solutions - forget about the
A> UFS one; it isn't. UFS will need fixing, but that's a separate story...
S> For tail writes, I'd imagine we would just end up using the page cache
S> as a virtual cache as NFS uses it, and doing plain copy into the
S> buffer cache pages.
A> Ouch. I _really_ don't like it - we end up with special behaviour on one
A> page in the pagecache.
S> Correct. But it's all inside the filesystem, so there is zero VFS
S> impact. And we're talking about non-block-aligned data for tails, so
S> we simply don't have a choice in this case.
A> <shrug> Sure, it's not a VFS problem (albeit it _will_ require accurate
A> playing with unmap_....() in buffer.c), but ext2 problems are pretty
A> interesting too...
S> For tail writes, I'd imagine we would just end up using the page cache
S> as a virtual cache as NFS uses it, and doing plain copy into the
S> buffer cache pages.
A> Ouch. I _really_ don't like it - we end up with special behaviour on one
A> page in the pagecache. And getting data migration from buffer cache to
A> page cache, which is Not Nice(tm).
S> Not preferred for bulk data, perhaps, but the VFS should cope just
S> fine.
A> Yuck... Besides, when do we decide that tail is going to be, erm,
A> merged? What will happen with the page then?
S> To the page? Nothing. To the buffer? It gets updated with the new
S> contents of disk. Page == virtual contents. Buffer == physical
S> contents. Plain and simple.
A> Erm? Consider that: huge lseek() + write past the end of file. Woops - got
A> to unmerge the tail (it's an internal block now) and we've got no
A> knowledge of IO going on the page. Again, IO may be asynchronous - no
A> protection from i_sem for us. After that page becomes a regular one,
A> right? Looks like a change of state to me...
S> Naturally, and that change of state must be made atomically by the
S> filesystem.
A> Yep. Which is the point - there _are_ dragons. I believe that it's doable,
A> but I realy want to repeat: Daniel, watch out for races at the moments
A> when page state changes, it needs more accurate approach than usual
A> pagecache-using fs. It can be done, but it will take some reading (and
A> yes, Stephen, I know that _you_ know it ;-)
The discussion also took another branch:
A> If so, how do you update buffer_heads in pages that cover the relocated data?
D> We have to be sure that if blocks are buffered then they are buffered in
D> exactly one place and you always access them through through the buffer hash
D> table. So far so good, but the picture gets murkier for me when you talk
V> Not. Data normally is in page. Buffer_heads are not included into buffer
V> cache. They are refered from the struct page and their ->b_data just
V> points to appropriate pieces of page. You can not get them via bread().
V> At all. Buffer cache is only for metadata.
S> Only in the default usage. There's no reason at all why we can't use
S> separate buffer and page cache aliases of the same data for tails as a
S> special case.
A> In theory - yes, but doing that will require a _lot_ of accurate thinking
A> about possible races. IOW, I'm afraid that transitions tail<->normal block
A> will be race-prone. Paint me over-cautious, but after you-know-what... Oh,
A> well... I'm not saying that it's impossible, but I _really_ recommend to
A> take a hard look at race scenarios - there is a potential for plenty of
A> them.
I received advice from Chris Mason (ReiserFS) as well:
D: Daniel Phillips
C: Chris Mason
D> I don't know exactly what ReiserFS does - I just heard Hans mention the term
D> 'tail merging' and I could see that it was a good idea.
C> I'll give the quick and dirty answer, if people want more details, let me
C> know. In 2.2, reiserfs_file_write deals directly with tails. It appends
C> to them if there is room in the packed block, or converts them if there
C> isn't. If reiserfs_file_write is called with a buffer size > 512 bytes,
C> it tries to write into full blocks instead of tails. This limits the
C> overhead when you cp/untar to create new files.
C>
C> In both releases, there is locking on the tail to prevent races, and we
C> don't bother with tails on files > 16k (configurable).
C>
C> For 2.4, the functions work like this:
C>
C> reiserfs_get_block converts the tail into its own block and points
C> the buffer head at the new block.
C>
C> reiserfs_readpage reads directly from the tail into the page, leaves the
C> buffer head mapped, and sets b_blocknr to 0.
C>
C> reiserfs_writepage and reiserfs_prepare_write both check for mapped buffer
C> heads with a block number of 0 in the page. If found, they are unmapped.
C> Then block_write_full_page or block_prepare_write is called.
C>
C> reiserfs_truncate deals directly with the tail. If the last block is
C> packed back into the tail, it is unmapped from the page cache.
C>
C> reiserfs_file_release will check to see if the tail needs to be repacked,
C> and use truncate (without changing i_size) to pack the tail.
OK, the first time I read this I didn't understand much of it. After doing my
research I came up with a approach that is... almost exactly what Chris
described. So I guess I'm reinventing the wheel, at least this part of it.
It's still worth pushing on though, because I'm this wheel now has to work on a
different car.
D> After digging a little deeper I can see that using the read actor
D> won't work because the read actor doesn't take the inode, or
D> anything that can be dereferenced to find the inode, as a parameter.
D> So it's not possible to do the tail offset check and adjustment
D> there.
D>
D> That's ok - it's the wrong place to do it anyway because the check
D> then has to be performed each time around the loop. A much better
D> way is to replace generic_file_read in the Ext2 file_operations
D> struct by a new ext2_file_read:
D>
D> proposed_ext2_file_read:
D> - generic_file_read stopping before any tail with nonzero offset
D> - If necessary, generic_file_read of the tail with source offset
C> For reading the tail, take a look at how these functions interact:
C> get_block
C> generic_file_read
C> block_read_full_page (ext2's readpage func)
C>
C> Putting the tail knowledge into ext2_file_read won't be enough, it
C> won't cover mmaps. You have to make sure your readpage/writepage
C> functions keep the page and buffer caches in sync. Reiserfs does
C> most of this from get_block...
and later...
D> Now I have to address the question of how tail blocks can
D> be shared between files...
C> You have two real choices. Unpack the tail before any writes to it, then
C> repack later (file close, whatever). This allows you to use all the
C> generic functions for writing data, and keeps the synchronization down
C> (only write to shared data on pack/unpack).
D> Yes, that was the preferred design right from the beginning even before I
D> started thinking of it in terms of page cache operations vs buffer cache.
C> Or, change your prepare, commit, writepage, and get_block routines to
C> write directly to the shared block. This is somewhat more difficult, and
C> I suspect slower since you'll have to touch all the inodes in the ring as
C> you shift data around for each write.
D> Ick - I think we agree on which is best. Incidently, since the ring is
D> now defined to be double-linked, you will only touch two inodes besides
D> the one you're writing to. Furthermore, if those two inodes happen to
D> be on the same inode page (8 inodes/page in a 1K/block fs; 32 with 4K)
D> then you don't even have to hit the disk an extra time. The tailmerging
D> algorithm could easily be optimized to favor this arrangement if desired.
C> Ok, that makes a lot of sense. The other possible optimization comes
C> with allocate on flush. Instead of unpacking the tail into a new block,
C> you could just make sure a new block will be available, and leave the
C> tail packed. get_block will return a page that is written into, and when
C> the flush happens, data from the page is either allocated into a new
C> block (the unpack commits), or repacked into the tail.
C> The benefit is programs who open, write, close will generally finish with
C> the file before the flush, thus making the unpack close to a noop.
D> I see we're now at the point that Alex warned me about - the
D> change-of-state when a merged tail on the page has to be unmerged
D> due to a write further up the file.
C> Yes. I think reiserfs just does the unmerge directly into the buffer
C> cache. If the unmerge needs to be done, we know there are no pages that
C> have the data unpacked for writing (repacking only happens when there are
C> no writers). Writers unmap the buffer heads in the page struct before
C> changing anything, so the readers will get their data invalidated at
C> that time.
In Ext2 when you unmerge you also have to drill down into the inode's index tree
and reallocate the tail block. Unless someone thinks this is a really bad idea,
I'm going to handle this by letting the 'create' parameter of ext2_get_block
take 3 values instead of 2:
0: don't create a block, return null if it's not there (read)
1: create a block if there isn't one there (write)
2: create a block even there's already one there (copy on write)
I'm not quite ready to code this yet because I'm not clear on what the special
handling of 'metadata' is all about. But getting close.
This was the general plan, and it hasn't changed:
proposed ext2 file write:
writing past tail?
if so, unmerge tail block
generic file write
additions to write full page:
(within the per-buffer write)
is the file tailmerged?
if so, does this part map the tail block?
if so, get the tail block, copy in the tail and dirty it
additions to read full page:
(within the per-buffer read)
is the file tailmerged?
if so, does this part map the tail block?
if so, get the tail block and copy the tail to the page
Here is the are the implementation details, perhaps not exactly right yet, but
getting close.
I wish I could have ext2_readpage, which currently just calls
block_read_full_page, instead read as many buffers on the page as can be read
normally, then handle a merged tail block (if present) specially by bread'ing it
and copying it onto the page. Unfortunately, I'd have to change the VFS to do
that because the only VFS tool available at this level is block_read_full_page,
and it does exactly what it says: reads a full page. (I really don't think this
is sufficiently flexible, and a 'start/number pair of ints would fix the
problem'.) So in order to stay strictly inside the filesystem I have to do
exactly what Chris suggested: Gimic ext2_get_block to recognize the special
case. So there's going to be a ext2_get_block_on_page that knows it's being
called from read_full_page and does:
static int ext2_get_block_on_page (
struct inode *inode, long iblock, struct buffer_head *bh_result, int create)
{
<is this a merged tail block>?
<were we called from ext2_read_full_page, i.e., create=0?>
<yes, bread it, copy the tail fragment to the page and exit>
<no, copy the tail from the page to the getblk'd tail block and dirty
it>
ext2_get_block (inode, iblock, bh_result, 2) /* force tail to new
block */
else
ext2_get_block (inode, iblock, bh_result, create)
}
This is a hack, but at least it's not very complicated. If a inode has a
formerly merged tail block that all the other inodes have removed themselves
from, but it has a nonzero tail offset, or if the block buffer is not on the
page (it was obtained by bread and it's still hanging around and could be in
use) then that will be handled as if was still merged, in other words, a copy of
the tail block will be forced. It's tempting to try to move a buffer that's
been forced off the page by a bread back onto the page, and that could be seen
as a refinement. We'll see if it matters in terms of efficiency.
This is perhaps not the most efficient solution, but should lead to a fairly
simple situation that is robust in the face of unusual circumstances, e.g.,
I/O going on on the old tail block just at the time we want to write to
a place further down the file.
The unmerge that happens in ext2_file_write looks like this:
<start exclusive>
ext2_get_block (inode, iblock, bh_result, 2) /* force tail to new block */
<clear the this inode's tail offset>
<delete this inode from the merge list>
<end exclusive>
The reason the unmerge is done here and not at a lower level in ext2_get_block
is, I don't want to have get_block go diving down into the innards of a page
it's not being asked to do I/O on. That could get pretty messy.
--
Daniel