Daniel Phillips wrote:
> So now it's time to start asking questions.

I've been partway through lxr'ing the 2.4.0 VFS now, and all I can say is wow! 
This really is a big improvement over 2.2.

Please treat all of the rest of this post as speculation: I'm not sure how
accurate any of it is.  Please feel free to make corrections.

I'm focussing on three functions:

 - generic_file_read
 - generic_file_write
 - generic_file_mmap

Oddly enough, all three of these functions are defined in the mm tree, not the
fs tree.

An observation: in 2.2, generic_file_read was used by Ext2 and
generic_file_write wasn't.  Now I see that all three of the functions appy to
not only Ext2 but a majority of the filesystems.  With the notable exception of
NTFS - can somebody tell me why that is?

Each of these functions works its magic through a 'mapping' defined by an
address_space struct.  In OO terminology this would be a memory mapping class
with instances denoting the mapping between a section of memory and a particular
kind of underlying secondary storage.  So what we have looks a lot like a
generalization of a swap file - all (or the vast majority of) file I/O in Linux
is now done by memory-mapping, whether you ask for it or not.  (This is good)

Who was it that said "any problem, no matter how complex, can be solved by
adding another level of indirection"?  Anyway, we now have
inode->i_mapping->a_ops->readpage instead of 2.2's inode->i_op->readpage.
(breaking the comment above the function)

The three functions in question make use of a 'page cache', which appears to be
a set of memory page 'heads' (page head -> my terminology, since I haven't seen
any other so far -> an instance of struct page) indexed by offset expressed in
units of PAGE_SIZE (4096 on most architectures), munged together with the
address of the mapping struct.  So we use the page offset within a file plus a
unique characteristic of the file (its mapping struct), followed by a linear
search to find a piece of memory that has already been set up to map the
file/offset we're interested in.

I've only looked at generic_file_read in any detail so far.  In simple terms,
its algorithm is:

  while there is more to transfer:
    look in the hash table for a page that maps the current offset
      if there isn't one, create one and add it to the page hash table
    check that the page is up to date
      if it isn't, read it in via the above-mentioned mapping function
    copy the page (or part) to user space via a 'read actor'
    advance by PAGE_SIZE 

This is considerably more complicated in the actual implementation because of
the requirements of concurrency and readahead optmimization.

As a next step, I hope to be able to state some of the conditions that have to
hold throughout the execution of the above algorithm.

Another thing I hope to be able to define pretty soon is the relationship
between the buffer cache and the page cache, including a few rules you have to
follow in order to keep the two consistent.  I suspect from the discussion I've
seen so far that this relationship is still evolving.  Nonetheless, in order to
be able to do filesystem work without breaking things, the rules have to be
stated and understood in a crystal-clear way.

OK, I'm within sight of being able to discuss what to do with that pesky final
block in a tail-merged file, or at least to understand the discussion that's
gone on so far.  The obvious place to handle the special requirements for the
last page would be in the above read-algorithm.  The fly in that ointment is:
doing so would require the VFS to be changed, IOW, work that should stay within
a particular filesystem doesn't.

There are two obvious ways to do filesystem-specific special handling of the
tail block: (1) in the 'read actor' that does the actual copy-to-user (see? yet
another problem solved by an extra level of indirection!) or (2) in the
inode->i_mapping->a_ops->readpage function that retrieves the page: essentially
we would do a copy-on-write to produce a new page that has the tail fragment in
the correct place.

I have a very distinct preference for (1), and I'll proceed on the assumption
that that's what I'll be doing unless some shows me why it's bad.

Now that I know a little more about the page cache I'll have a good think about
what's going to happen when files start sharing tail blocks.

-- 
Daniel

Reply via email to