http://linux-security.cn/index.php?option=com_content&task=view&id=1474&Itemid=42

fault()
2007-08-23
Back in October, 2006, LWN covered the proposed fault() method for virtual memory areas. This API change was put forward as part of a fix for an obscure (but real) race condition within the kernel. Such a fix would seem important, but, even so, it took the better part of a year for fault() to make it into the mainline. Now that the patch has been merged for 2.6.23, it is worth taking a look at the API which was adopted.

A virtual memory area (VMA) in the kernel represents a piece of a process's virtual address space. Each VMA is mapped in its own way; most VMAs are mapped to files on the disk, but there are also anonymous VMAs (mapped to swap space, for all practical purposes), device memory mappings, and more. Each VMA must provide a handler for situations where a specific page in that VMA is not resident in main memory; the handler must rectify the situation or let the kernel know that it cannot be done. In most cases, the nopfn() or older (but more heavily used) nopage() methods fill that bill. They are called with the offset of the missing page within the VMA and are expected to return a pointer to the page structure for the missing page. For more complicated cases, nonlinear VMAs in particular, the populate() method is invoked instead.

The existence of three functions to perform the same task suggests that requirements have changed over time and that a cleanup is overdue. When none of those interfaces are able to be extended to prevent a race condition, the pressure for a new approach can only get stronger. That new approach, as created by Nick Piggin, is the fault() method, which should, eventually, replace all three of the others. The prototype for fault() is:

    int (*fault)(struct vm_area_struct *vma, struct vm_fault *vmf);
      

Most of the information of interest can be found in the new vm_fault structure, which looks like this:

    struct vm_fault {
	unsigned int flags;
	pgoff_t pgoff;
	void __user *virtual_address;
	struct page *page;
    };
      

The fault() method should, like its predecessors, arrange for the missing page to exist and return its address to the kernel. The interface used is rather more flexible, though.

The offset of the missing page can be found in the pgoff field. Fault handlers can also find the corresponding user-space address in virtual_address, but anybody who is tempted to use that field should be prepared to justify that use to a crowd of skeptical kernel developers. Most handlers should not care where the page lives in user space, and use of virtual_address will make it impossible to support nonlinear VMAs. So, if at all possible, virtual_address should be ignored. If your code only uses pgoff, it should also set the VM_CAN_NONLINEAR flag in the VMA's vm_flags field to let the kernel know that it is playing by the rules.

The flags field has two possible flags:

  • FAULT_FLAG_WRITE indicates that the page fault happened on a write access.

  • FAULT_FLAG_NONLINEAR says that the given VMA is a nonlinear mapping.

After fault() has done its work, it should store a pointer to the page structure for the faulted-in page in the page field - but see below for an exception. The return value from fault() is a set of flags which can indicate a number of things:

  • VM_FAULT_OOM: the fault could not be handled because the handler was unable to allocate the required memory.

  • VM_FAULT_SIGBUS: the page offset is out of range, so the fault could not be handled.

  • VM_FAULT_MAJOR: marks a "major" page fault - usually one which required reading data from disk.

  • VM_FAULT_WRITE: a copy-on-write mapping was broken to satisfy the fault.

  • VM_FAULT_NOPAGE: set if the handler has installed the page table entry directly. In this case, the page field returned in the vm_fault structure has no meaning. Among other uses, this flag allows fault() to be used with mappings that have no associated page structures - mappings of device memory, for example.

  • VM_FAULT_LOCKED: the returned page has been locked by the handler and should be unlocked by the caller. It is used with file-backed mappings to prevent races with other parts of the kernel which may be trying to access the same page.

All callers of the populate() VMA operation have been changed, and that method no longer exists. There is an entry in the feature removal schedule for nopage() indicating that it will go away "as soon as possible." The kernel still has a number of nopage() implementations, though, so getting rid of it may take a little while yet. Longer-term plans call for the removal of nopfn() as well, though no date has been set for this change. Certainly any new code which implements mmap() should be written to handle faults with fault() rather than one of the older functions.




Reply via email to