| Back in October, 2006, LWN covered the proposed
fault() method for virtual memory areas. This API change was
put forward as part of a fix for an obscure (but real) race condition
within the kernel. Such a fix would seem important, but, even so, it
took
the better part of a year for fault() to make it into the
mainline. Now that the patch has been merged for 2.6.23, it is worth
taking a look at the API which was adopted.
A virtual memory area (VMA) in the kernel represents a piece
of a process's
virtual address space. Each VMA is mapped in its own way; most VMAs are
mapped to files on the disk, but there are also anonymous VMAs (mapped
to
swap space, for all practical purposes), device memory mappings, and
more.
Each VMA must provide a handler for situations where a specific page in
that VMA is not resident in main memory; the handler must rectify the
situation or let the kernel know that it cannot be done. In most cases,
the nopfn() or older (but more heavily used) nopage()
methods fill that bill. They are called with the offset of the missing
page within the VMA and are expected to return a pointer to the
page structure for the missing page. For more complicated cases,
nonlinear VMAs in particular, the populate() method is invoked
instead.
The existence of three functions to perform the same task
suggests that
requirements have changed over time and that a cleanup is overdue. When
none of those interfaces are able to be extended to prevent a race
condition, the pressure for a new approach can only get stronger. That
new
approach, as created by Nick Piggin, is the fault() method, which
should, eventually, replace all three of the others. The prototype for
fault() is:
int (*fault)(struct vm_area_struct *vma, struct vm_fault *vmf);
Most of the information of interest can be found in the new
vm_fault structure, which looks like this:
struct vm_fault {
unsigned int flags;
pgoff_t pgoff;
void __user *virtual_address;
struct page *page;
};
The fault() method should, like its predecessors, arrange for the
missing page to exist and return its address to the kernel. The
interface
used is rather more flexible, though.
The offset of the missing page can be found in the pgoff
field.
Fault handlers can also find the corresponding user-space address in
virtual_address, but anybody who is tempted to use that field
should be prepared to justify that use to a crowd of skeptical kernel
developers. Most handlers should not care where the page lives in user
space, and use of virtual_address will make it impossible to
support nonlinear VMAs. So, if at all possible, virtual_address
should be ignored. If your code only uses pgoff, it should also
set the VM_CAN_NONLINEAR flag in the VMA's vm_flags field
to let the kernel know that it is playing by the rules.
The flags field has two possible flags:
- FAULT_FLAG_WRITE indicates that the page fault happened on
a write access.
- FAULT_FLAG_NONLINEAR says that the given VMA is a
nonlinear mapping.
After fault() has done its work, it should store a pointer to the
page structure for the faulted-in page in the page field
- but see below for an exception. The return value from fault()
is a set of flags which can indicate a number of things:
- VM_FAULT_OOM: the fault could not be handled because the
handler was unable to allocate the required memory.
- VM_FAULT_SIGBUS: the page offset is out of range, so the
fault could not be handled.
- VM_FAULT_MAJOR: marks a "major" page fault - usually one
which required reading data from disk.
- VM_FAULT_WRITE: a copy-on-write mapping was broken to
satisfy the fault.
- VM_FAULT_NOPAGE: set if the handler has installed the page
table entry directly. In this case, the page field returned in the
vm_fault structure has no meaning. Among other uses, this flag allows
fault() to be used with mappings that have no associated page
structures - mappings of device memory, for example.
- VM_FAULT_LOCKED: the returned page has been locked by the
handler and should be unlocked by the caller. It is used with
file-backed mappings to prevent races with other parts of the kernel
which may be trying to access the same page.
All callers of the populate() VMA operation have been changed, and
that method no longer exists. There is an entry in the feature removal
schedule for nopage() indicating that it will go away "as soon as
possible." The kernel still has a number of nopage()
implementations, though, so getting rid of it may take a little while
yet.
Longer-term plans call for the removal of nopfn() as well, though
no date has been set for this change. Certainly any new code which
implements mmap() should be written to handle faults with
fault() rather than one of the older functions.
|