Re: Documenting VFS, the next installment (was: Tailmerging for Ext2)

Malcolm Beattie Tue, 08 Aug 2000 08:43:48 -0700
[EMAIL PROTECTED] writes:
> Previously I was talking about three functions:
>  
>  1) generic_file_read
>  2) generic_file_write
>  3) generic_file_mmap

Here are some comments of mine, just because I'd like to check out my
own understanding of this sort of thing and trying to explain it in
writing is a good way of doing so.

> and had this to say:
> 
> "Each of these functions works its magic through a 'mapping' defined by
> an address_space struct.  In OO terminology this would be a memory
> mapping class with instances denoting the mapping between a section of
> memory and a particular kind of underlying secondary storage.  So what
> we have looks a lot like a generalization of a swap file - all (or the
> vast majority of) file I/O in Linux is now done by memory-mapping..."
> 
> "The three functions in question make use of a 'page cache', which
> appears to be a set of memory page 'heads' (page head -> my terminology,
> since I haven't seen any other so far -> an instance of struct page)

The "struct page" represents (and is the unit of currency) of the
physical page. You mention the idea of a MemoryPage lower down in the
context of allocating memory but the struct page is even more basic.
Every single physical page of "real" memory accessible to the kernel
has/is a "struct page". Since there's no architecture-portable way of
tacking on extra bits of data to a physical page, Linux does it by
having a separate array (mem_map[]) indexed by page frame number and
mem_map[1000] is a structure containging various bits and pieces of
information associated with page frame number 1000. In passing, an
architecture like S390 *does* in fact store some extra fields for
each physical page: a key and some ref/access bits. So thinking of a
struct page as a "real physical page" along with some private
attributes (in an OO sense) is a good idea.

A struct page represents a physical page of memory and has nothing to
do with addresses: one may not even be able to directly write to that
physical page from the kernel. For example, with PAE36 on x86, "high"
memory pages (roughly speaking those beyond about page number 256000,
or 1GB into physical memory) aren't mapped directly in the kernel's
address space. They are reached by mapping them temporarily with
calls such as roughly
    addr = kmap(page);
    ... can now use address addr to write/read to/from the page
    kunmap(page);
or else they are mapped into a process address space with
remap_page_range. The "addr = kmap(); ...; kunmap()" way of
accessing memory reminds me of the PalmOS way of allocating and
reading/writing memory where you are only granted temporary
permission to write to memory, possibly via a different virtual
address each time. It's a nice way to separate physical and
virtual memory both as an API and while thinking about it.
Obviously, Linux uses various caches and cleverness behind the
scenes to avoid too much of a map/unmap/map/unmap performance hit.

> indexed by offset expressed in units of PAGE_SIZE (4096 on most
> architectures), munged together with the address of the mapping struct. 
> So we use the page offset within a file plus a unique characteristic of
> the file (its mapping struct), followed by a linear search to find a
> piece of memory that has already been set up to map the file/offset
> we're interested in."

The page cache is simply a fast hash look up from
(addressspace, offset) to the struct page holding that data. It used
to be (inode, offset) but, from an OO point of view, there's no
reason why the "lookup key" needs to be an inode and so inventing a
new type of "addressspace" object (of which an inode is one example)
works nicely.

> Understanding the mapping between pages and buffers is crucial to
> understanding those three functions.  In (1) and (2) operations are
> performed directly on the pages and buffers, and these operations are
> quite easy to understand in spite of the fact that the function call
> chain wanders around goes very deep at times.  At least it's all
> synchronous.

Being synchronous may make the coding simpler (and hence possibly
easier to understand looking through it) but I find the underlying
abstractions easier to understand without worrying about whether
it's asynchronous/synchronous. Basically, you can (a) initiate I/O
to write/read into a physical page, (b) ask whether a particular
page is uptodate/dirty, (c) do a blocking wait for a particular
page to become up to date and, sort of, (d) ask to get your own
function called back when an I/O you initiated is complete.
Thinking about it that way lets you separate the vm, filesystem,
and I/O request subsystems without them interfering with each
other. Again, nice from the OO point of view.

> 115 struct vm_operations_struct {
> 116         void (*open)
> 117         void (*close)
> 118         void (*unmap)
> 119         void (*protect)
> 120         int (*sync)
> 121         struct page * (*nopage)
> 122         struct page * (*wppage)
> 123         int (*swapout)
> 124 };
> 
> At this point I'm far from being able to say anything intelligent about
> the details of how this works - that will require taking a fairly long
> trip through the vm subsystem.  So I'll just leave it alone for now and
> hope that by following existing practice in my own code I won't break
> anything.  Emphasis on the "for now".

A vma is (represents) roughly a contiguous area of address space
within a process. More precisely, each process(/task/thread) has a
pointer to an "mm" object (struct mm_struct) which represents its
address space. The main attributes of the mm object are
(1) (pointers to) the set of vma objects within the address space
    (via some fancy AVL structures and algorithms to do fast
    lookups/insertions/split/merge etc.)
(2) A pointer to its page tables
(3) various numbers and stats
(4) an architecture-dependent bit
When a task is running and a page fault happens because it tried to
write to an address which the page tables said didn't have a
corresponding physical page then the low-level Linux trap handlers
look up which vma "owns" that address (i.e. find the vma whose
(start,end) addresses surround the faulting address) and calls the
nopage method on that object. (You'd think that if the process got
a "write-protect fault" then it would call the "wppage" method but
in fact Linux handles such faults in a vma-independent way and the
wppage method member isn't even defined in 2.4). The nopage method
is supposed to arrange for a real physical page of memory (a struct
page) to contain whatever the process thought it was writing to or
reading from and return that struct page. The main fault handler
then fixes up the mm's page tables to point at that physical page
and returns out of the fault handler so that the userland process
can retry accessing the address it faulted on: successfully this
time though.

One sort of vma is the "filemap" vma which corresponds to mapping a
file into a process address space. Its methods are implemented in
mm/filemap.c.  filemap_nopage is passed the faulting address from
which it calculates the offset within the vma of the fault from which
it calculates the offset within the file of the fault (since mmap()
lets you specify an offset from the start of file). From that,
filemap works out where in the page cache the required page should
live and sees if it's there. If it is (and if the process is either
only reading or is writing to a shared file) then it returns that
struct page. If it's not in the page cache, it asks the address space
(which asks the filesystem but never mind that here) to load up the
appropriate page into the page cache and returns that (or possibly a
copy if its writing to a COW mapping).

Now the nice thing about all of that is that there's been no need to
mention the filesystem or I/O requests or block devices at all. It's
all modular and separate. The interface at this level is

        if (!mapping->a_ops->readpage(file, page)) {
                wait_on_page(page);
                if (Page_Uptodate(page))
                        goto success;
        }

so you can see there the "initiate read" and "wait for page to be
uptodate" I mentioned earlier. (You get woken if there's an I/O
error too, hence the check for Uptodate but never mind here).

> I'll take a closer look at how generic_file_read works.  The interesting
> part is "read it in via the above-mentioned mapping function".  This
> starts us on a merry chase:
> 
>   do_generic_file_read (mm/filemap.c) ->
>      inode->i_mapping->a_ops->readpage ->
>         ext2_readpage (fs/ext2/inode.c) ->
>         block_read_full_page (fs/buffer.c) ->
>            ext2_get_block (fs/ext2/inode.c) ->
>               ext2_block_map (fs/ext2/inode.c) ->>
>                  inode_bmap
>                  block_bmap
>         ll_rw_block (drivers/block/ll_rw_blk.c)
> 
> Notice how the the call chain crosses subsystem boundaries repeatedly:
> starting in the memory manager, it first goes into the VFS (fs), then
> into Ext2 (fs/ext2), back out to VFS then immediately back into Ext2,
> stays there for a while, returns to the VFS (block_read_full_page) and
> finally descends deep into the innards of the block driver subsystem.
> 
> In general, the VFS calls through function tables wherever functionality
> must be filesystem-specific, and the specific filesystems all the VFS
> whenever the VFS can handle something in a generic way, perhaps with the
> help of some filesystem-specific function tables.  If that sounds subtle
> and complex, it's because it is.

It's less complex if you think of the "backward calls" as "callbacks".
OK, it's only wording but the intention is clearer. I'm less
familiar with 2.2/2.4 than 2.0 here so I'll sketch how it used to be
done for simplicity. the filesystem has been told to read a pageworth
of file and put the result into a given page (or vice versa). Now
often that consists of:
  1) split the page into blocks (e.g. 4 1K blocks on a 4K page):
     say we want logical blocks 12, 13, 14, 15 of the file to address
     0xd0000000.
  2) find the physical disk blocks corresponding to those blocks:
     say 7000, 7001, 9004, 9005 (maybe no longer contiguous)
  3) do an I/O request to get the block layer to read those blocks
     to address 0xd0000000.
  4) wait for the I/O to complete and return
Now only step (2) there is filesystem-specific so you can have a
generic function to do all of it except (2) and you have the
filesystem provide a logical-block-to-physical-block method,
called bmap, to map 12->7000, 13->7001, 14->9004, 15->9005. In 2.2
and 2.4 it's a bit different and more complex (and ext2 doesn't use
some of generic functions, either to optimise things or because the
generic ones haven't quite caught up probably, I need to look and
see why) so the call trace looks less "clean".

>   - Memory Pages
>   - File Buffers
> 
> In OO terminology, a memory page is an instance of class "MemoryPage"
> (because this isn't C++, you won't see this class defined anywhere -
> it's actually there, but only in the minds of the developers).  A
> MemoryPage is instantiated by kalloc'ing a new piece of memory of type
> "struct page".  In more concrete terms, you could call this a page
> head.  In any event, a page object doesn't necessarily have any actual
> memory assigned to it, virtual or physical.  Having memory assigned to
> it is just one of the things that can happen in the life cycle of a page
> object.

No, a struct page object really does correspond one-to-one with a
physical page. Always.

> Similarly, we can think about the (imaginary) class FileBuffer,
> instantiated by allocating and initializing a "struct buffer_head".  A
> buffer object can be associated with a disk-block sized piece of memory,
> and/or a physical block on disk during its life cycle.

This is where history and abstraction start clashing a bit. In the
old days, the buffer cache used buffer_head structures to
represent disk blocks, (2) cache them and (3) do I/O. That's really
mixing three abstractions. These days, the buffer cache in sense (2)
is still used a little--to hold metadata for some filesystems although
I think there's a suggestion that the page cache can/should be used
for that in some cases where (a) you can find a convenient index
(e.g. negative block numbers) and (b) the fact that it's page-sized
buffers doesn't kill you. However the main use for buffer_heads now
is to do I/O (i.e. sense (3)) even when they're not really buffers
held in the buffer cache. There's a new way to do I/O coming along
(kiobufs) and that should separate the abstractions nicely again.

> Now I'll back up and give a short description of the structure of a file
> buffer.  Basically, a file buffer is a node on a hash list that points
> at a piece of memory that can store one block of a file.  File buffers
> have a lot of history behind them, and they've had a lot of time to
> accumulate fields that are used for various purposes.  Here the the file
> buffer head struct declaration in all its glory:
[...]
> The fields relevant to this discussion are:
>   
>   page - the memory page, if any, that has this buffer in its list
>   b_next - associates the buffer with a given block of a given device
> 
> When looked at this way, file buffers and memory pages look pretty
> symmetric.  This isn't an accident.  To find a piece of a file in memory
> you associate through the page cache.  To find a piece of a file on disk
> you associate through the buffer cache.  A buffer object is defined by a
> buffer head, and a page object is defined by a [page head] (my
> terminology).  The two caches are tied together by having the two kinds
> of object heads point at the same piece of memory, and by having them
> point at each other.  Simple, huh?  ;-)

The nasty thing is the same structure for holding cached disk blocks
is (currently) the same as the structure used for doing I/O. That's
partly why there are so many fields in struct buffer_head and why the
I/O and buffer cache abstractions are a bit mixed up. For the I/O
subsystem, you issue a request
     ll_rw_block(dir, n, &bh);
where dir is READ or WRITE and bh is a list of n buffer_heads.
You fill in in each buffer head the device you want to read/write,
the block number you want to read/write, a block count, the
source/destination address and a function to be called back when
the I/O is complete (b_end_io). You call ll_rw_block and the I/O
subsystem gets on with your request (by doing the appropriate stuff
and calling the block device layer). When it's done what you asked,
it calls back your function. Note that nowhere in that description
is the buffer cache mentioned or necessary. In fact, I/O is done
directly to the page cache by pointing the source/destination
addresses of the buffer_heads into the mapped page cache page. Look
at block_read_full_page (in 2.4) which does
        if (!page->buffers)
                create_empty_buffers(page, inode, blocksize);
and create_empty_buffers calls create_buffers which calls
set_bh_page which sets
                bh->b_data = (char *)(page_address(page) + offset);
So the buffer_heads aren't "pointing into" space managed by the
buffer cache at all: they're pointing directly into the page cache
pages. In 2.4, sct has come up with a new I/O related object called
a kiobuf (see iobuf.h). It is a general "container object" to hold
(physical) pages on which you want to do I/O, where those pages
can be kernel or user pages and has an end_io callback. Putting
together a list of those objects (e.g. for scatter/gather) is a
kiovec structure. In the new scheme of doing things, you won't
need to worry about the mixing-up of the buffer cache and the I/O
subsystem because you call
    int brw_kiovec(int rw, int nr, struct kiobuf *iovec[], 
                   kdev_t dev, unsigned long b[], int size)
(see fs/buffer.c). (The other place where subsystems "meet" is
where the I/O subsystem meets the buffer cache in bread() which
reads a block up through the buffer cache given a block number).

I hope this has been useful for someone other than me (I've
firmed up some stuff in my mind and reminded myself where 2.4 has
hidden/renamed a few things from writing this so it's been useful
to me even if no one else reads this :-) Thanks for provoking me
into brain-dump-to-email mode.

--Malcolm


-- 
Malcolm Beattie <[EMAIL PROTECTED]>
Unix Systems Programmer
Oxford University Computing Services
Re: Documenting VFS, the next installment (was: Tailmerging for Ext2)

Reply via email to