Daniel Phillips wrote:
> There are two obvious ways to do filesystem-specific special handling of the
> tail block: (1) in the 'read actor' that does the actual copy-to-user (see? yet
> another problem solved by an extra level of indirection!) or (2) in the
> inode->i_mapping->a_ops->readpage function that retrieves the page: essentially
> we would do a copy-on-write to produce a new page that has the tail fragment in
> the correct place.
> 
> I have a very distinct preference for (1), and I'll proceed on the assumption
> that that's what I'll be doing unless some shows me why it's bad.

After digging a little deeper I can see that using the read actor won't work
because the read actor doesn't take the inode, or anything that can be
dereferenced to find the inode, as a parameter.  So it's not possible to do the
tail offset check and adjustment there.

That's ok - it's the wrong place to do it anyway because the check then has to
be performed each time around the loop.  A much better way is to replace
generic_file_read in the Ext2 file_operations struct by a new ext2_file_read:

proposed_ext2_file_read:
  - generic_file_read stopping before any tail with nonzero offset
  - If necessary, generic_file_read of the tail with source offset

This imposes just one extra check when the tail isn't merged or happens to be at
the beginning of the tail block, so read overhead for tailmerging is negligible
when the feature isn't used.

Now I have to address the question of how tail blocks can be shared between
files.  This does not seem to me to be an easy question at all.  I'll summarize
the previous discussion here...

A: Alexander Viro
S: Stephen Tweedie

A> Here is one more for you:
A>     Suppose we grow the last fragment/tail/whatever. Do you copy the
A> data out of that shared block? If so, how do you update buffer_heads in
A> pages that cover the relocated data? (Same goes for reiserfs, if they are
A> doing something similar). BTW, our implementation of UFS is fucked up in
A> that respect, so variant from there will not work.
 
S> For tail writes, I'd imagine we would just end up using the page cache
S> as a virtual cache as NFS uses it, and doing plain copy into the
S> buffer cache pages.

A> Ouch. I _really_ don't like it - we end up with special behaviour on one
A> page in the pagecache. And getting data migration from buffer cache to
A> page cache, which is Not Nice(tm). Yuck... Besides, when do we decide that
A> tail is going to be, erm, merged? What will happen with the page then?

S> Correct.  But it's all inside the filesystem, so there is zero VFS
S> impact.  And we're talking about non-block-aligned data for tails, so
S> we simply don't have a choice in this case.

A> <shrug> Sure, it's not a VFS problem (albeit it _will_ require accurate
A> playing with unmap_....() in buffer.c), but ext2 problems are pretty
A> interesting too...

A> ...And getting data migration from buffer cache to
A> page cache, which is Not Nice(tm).
 
S> Not preferred for bulk data, perhaps, but the VFS should cope just
S> fine.
 
A> Yuck... Besides, when do we decide that
A> tail is going to be, erm, merged? What will happen with the page then?
 
S> To the page?  Nothing.  To the buffer?  It gets updated with the new
S> contents of disk.  Page == virtual contents.  Buffer == physical
S> contents.  Plain and simple.

A> Erm? Consider that: huge lseek() + write past the end of file. Woops - got
A> to unmerge the tail (it's an internal block now) and we've got no
A> knowledge of IO going on the page. Again, IO may be asynchronous - no
A> protection from i_sem for us. After that page becomes a regular one,
A> right? Looks like a change of state to me...

S> Naturally, and that change of state must be made atomically by the
S> filesystem.

A> Yep. Which is the point - there _are_ dragons. I believe that it's doable,
A> but I realy want to repeat: Daniel, watch out for races at the moments
A> when page state changes, it needs more accurate approach than usual
A> pagecache-using fs. It can be done, but it will take some reading (and
A> yes, Stephen, I know that _you_ know it ;-)

Exactly.  OK, till tomorrow...

-- 
Daniel

Reply via email to