Documenting VFS, the next installment (was: Tailmerging for Ext2)

Daniel Phillips Mon, 07 Aug 2000 13:26:31 -0700
Hans Reiser wrote:
> Post your document on the reiserfs mailing list when you finish it,
> the ReiserFS team will enjoy reading it.

OK, and I hope your team will review it and post corrections - I'm
hoping somebody picks this up on the documentation lists and makes a
tutorial out of it, so it should be halfway technically accurate. 
Here's the next installment...

Previously I was talking about three functions:
 
 1) generic_file_read
 2) generic_file_write
 3) generic_file_mmap

and had this to say:

"Each of these functions works its magic through a 'mapping' defined by
an address_space struct.  In OO terminology this would be a memory
mapping class with instances denoting the mapping between a section of
memory and a particular kind of underlying secondary storage.  So what
we have looks a lot like a generalization of a swap file - all (or the
vast majority of) file I/O in Linux is now done by memory-mapping..."

"The three functions in question make use of a 'page cache', which
appears to be a set of memory page 'heads' (page head -> my terminology,
since I haven't seen any other so far -> an instance of struct page)
indexed by offset expressed in units of PAGE_SIZE (4096 on most
architectures), munged together with the address of the mapping struct. 
So we use the page offset within a file plus a unique characteristic of
the file (its mapping struct), followed by a linear search to find a
piece of memory that has already been set up to map the file/offset
we're interested in."

I gave the following algorithm for generic_file_read:

  while there is more to transfer:
    look in the hash table for a page that maps the current offset
      if there isn't one, create one and add it to the page hash table
    check that the page is up to date
      if it isn't, read it in via the above-mentioned mapping function
    copy the page (or part) to user space via a 'read actor'
    advance by PAGE_SIZE 

Nobody objected, and it seems to be correct as far as it goes.  It's not
the whole story, though.  Far from it!  I'll press on now.

Understanding the mapping between pages and buffers is crucial to
understanding those three functions.  In (1) and (2) operations are
performed directly on the pages and buffers, and these operations are
quite easy to understand in spite of the fact that the function call
chain wanders around goes very deep at times.  At least it's all
synchronous.

Function (3), mmap, does its work asynchronously, by specifying a set of
mapping operations that will be invoked later through the VM subsystem. 
The memory management header mm.h has this to say about that:

(http://lxr.linux.no/source/include/linux/mm.h?v=2.4.0-test2#L115)
110 /*
111  * These are the virtual MM functions - opening of an area, closing
and
112  * unmapping it (needed to keep files on disk up-to-date etc),
pointer
113  * to the functions called when a no-page or a wp-page exception
occurs. 
114  */
115 struct vm_operations_struct {
116         void (*open)
117         void (*close)
118         void (*unmap)
119         void (*protect)
120         int (*sync)
121         struct page * (*nopage)
122         struct page * (*wppage)
123         int (*swapout)
124 };

At this point I'm far from being able to say anything intelligent about
the details of how this works - that will require taking a fairly long
trip through the vm subsystem.  So I'll just leave it alone for now and
hope that by following existing practice in my own code I won't break
anything.  Emphasis on the "for now".

I'll take a closer look at how generic_file_read works.  The interesting
part is "read it in via the above-mentioned mapping function".  This
starts us on a merry chase:

  do_generic_file_read (mm/filemap.c) ->
     inode->i_mapping->a_ops->readpage ->
        ext2_readpage (fs/ext2/inode.c) ->
        block_read_full_page (fs/buffer.c) ->
           ext2_get_block (fs/ext2/inode.c) ->
              ext2_block_map (fs/ext2/inode.c) ->>
                 inode_bmap
                 block_bmap
        ll_rw_block (drivers/block/ll_rw_blk.c)

Notice how the the call chain crosses subsystem boundaries repeatedly:
starting in the memory manager, it first goes into the VFS (fs), then
into Ext2 (fs/ext2), back out to VFS then immediately back into Ext2,
stays there for a while, returns to the VFS (block_read_full_page) and
finally descends deep into the innards of the block driver subsystem.

In general, the VFS calls through function tables wherever functionality
must be filesystem-specific, and the specific filesystems all the VFS
whenever the VFS can handle something in a generic way, perhaps with the
help of some filesystem-specific function tables.  If that sounds subtle
and complex, it's because it is.  Fortunately, most of what is
interesting here happens in one function: block_read_full_page.

The next essential part of the equation is the data structure
framework.  By knowing the main data structures we can understand what
those functions are trying to do to them, and when.  Roughly speaking,
there are two kinds of actors in this drama:

  - Memory Pages
  - File Buffers

In OO terminology, a memory page is an instance of class "MemoryPage"
(because this isn't C++, you won't see this class defined anywhere -
it's actually there, but only in the minds of the developers).  A
MemoryPage is instantiated by kalloc'ing a new piece of memory of type
"struct page".  In more concrete terms, you could call this a page
head.  In any event, a page object doesn't necessarily have any actual
memory assigned to it, virtual or physical.  Having memory assigned to
it is just one of the things that can happen in the life cycle of a page
object.

Similarly, we can think about the (imaginary) class FileBuffer,
instantiated by allocating and initializing a "struct buffer_head".  A
buffer object can be associated with a disk-block sized piece of memory,
and/or a physical block on disk during its life cycle.

Here is the key thing to know about the page cache and buffer cache in
2.4.0: whenever possible (and that now means "nearly always") page
objects and buffer objects share the same data memory.  This is the same
as saying "the page cache and buffer cache are now unified".

It took me a long time to figure out this simple thing.  I hope that by
writing this I can save all that effort for at least one other
developer.  Now I'll look at exactly how these objects share memory.

Here is the memory page struct definition:

(http://lxr.linux.no/source/include/linux/mm.h?v=2.4.0-test2#L143)
143 typedef struct page {
144         struct list_head list;
145         struct address_space *mapping;
146         unsigned long index;
147         struct page *next_hash;
148         atomic_t count;
149         unsigned long flags;    /* atomic flags, some possibly
updated asynchronously */
150         struct list_head lru;
151         wait_queue_head_t wait;
152         struct page **pprev_hash;
153         struct buffer_head * buffers;
154         unsigned long virtual; /* nonzero if kmapped */
155         struct zone_struct *zone;
156 } mem_map_t;

I'm only interested in a few fields at the moment:
  - buffers - list of buffers sharing the memory assigned to this page
  - next_hash - associates the page with a part of a file

To find the page (if any) that's mapping a given part of a file we
compute a hash as I described before, incorporating both the offset
within the file and the address of the file's memory mapping.  (Clever,
huh?)  Then we search the hash bucket (i.e., the collision list) using
the following algorithm:

   Repeat:
      If we've at the end of the hash list, quit and fail out.
      Does this page belong to the file we're interested in?
      (i.e., has the same instance of struct memory_mapping)
         If so, then does then is it mapped to the part of the file 
         we're interested in? (i.e., does it have the correct offset
         expressed in units of memory pages)
            If so, success, quit and return it.
      Follow the link to the next page in thisbucket.

This is how generic_file_read associates memory pages with file
locations.  Easy.  Now, how are buffers associated with (1) file+offset
and (2) memory pages?

Also easy:  The page has a (circular) list of buffers attached to it. 
Each file buffer is used to access one of the the physical file blocks
associated with a memory page, which has most probably been mapped to a
file by one of the three generic file i/o functions being discussed
here.

Now I'll back up and give a short description of the structure of a file
buffer.  Basically, a file buffer is a node on a hash list that points
at a piece of memory that can store one block of a file.  File buffers
have a lot of history behind them, and they've had a lot of time to
accumulate fields that are used for various purposes.  Here the the file
buffer head struct declaration in all its glory:

(http://lxr.linux.no/source/include/linux/fs.h#L201)
220 struct buffer_head {
221         /* First cache line: */
222         struct buffer_head *b_next;     /* Hash queue list */
223         unsigned long b_blocknr;        /* block number */
224         unsigned short b_size;          /* block size */
225         unsigned short b_list;          /* List that this buffer appears */
226         kdev_t b_dev;                   /* device (B_FREE = free) */
227 
228         atomic_t b_count;               /* users using this block */
229         kdev_t b_rdev;                  /* Real device */
230         unsigned long b_state;          /* buffer state bitmap (see above) */
231         unsigned long b_flushtime;      /* Time when (dirty) buffer should be 
written */
232 
233         struct buffer_head *b_next_free;/* lru/free list linkage */
234         struct buffer_head *b_prev_free;/* doubly linked list of buffers */
235         struct buffer_head *b_this_page;/* circular list of buffers in one page */
236         struct buffer_head *b_reqnext;  /* request queue */
237 
238         struct buffer_head **b_pprev;   /* doubly linked list of hash-queue */
239         char * b_data;                  /* pointer to data block (512 byte) */
240         struct page *b_page;            /* the page this bh is mapped to */
241         void (*b_end_io)(struct buffer_head *bh, int uptodate); /* I/O completion 
*/
242         void *b_dev_id;
243 
244         unsigned long b_rsector;        /* Real buffer location on disk */
245         wait_queue_head_t b_wait;
246         struct kiobuf * b_kiobuf;       /* kiobuf which owns this IO */
247 };

The fields relevant to this discussion are:
  
  page - the memory page, if any, that has this buffer in its list
  b_next - associates the buffer with a given block of a given device

When looked at this way, file buffers and memory pages look pretty
symmetric.  This isn't an accident.  To find a piece of a file in memory
you associate through the page cache.  To find a piece of a file on disk
you associate through the buffer cache.  A buffer object is defined by a
buffer head, and a page object is defined by a [page head] (my
terminology).  The two caches are tied together by having the two kinds
of object heads point at the same piece of memory, and by having them
point at each other.  Simple, huh?  ;-)

I'll note now that memory page size and file block size are often the
same, i.e., 4K, so there will be exactly one buffer on a page's buffer
list.  This is a simple situation that makes everything easy to
understand, but it's not the only possible arrangement.  Before hard
disks grew to today's astronomical sizes, 1K physical block sizes were
common.  In the future (after the problem of disk wastage is solved by
tail merging:-) file block sizes much larger than memory page sizes will
become common.  The VFS and MM systems must deal efficiently with all
the implied combinations, and this is a partial explanation for the
complexity of these systems.

One final note before I go home: A file buffer does not necessarily have
to be associated with a page that's mapped to a file.  I'm going to rely
on this in my implementation of tail merging, where I break break the
lovely one-to-one symmetry of physical block <=> memory block.

-- 
Daniel
Documenting VFS, the next installment (was: Tailmerging for Ext2)

Reply via email to