|
http://lwn.net/Articles/160201/
Page migration is the act of moving a process's pages from one part of
the
system to another. Often, the motivation is moving pages between NUMA
nodes
in the hope of improving performance. When this page last looked at the page migration
patch
set, it worked by forcing target pages out to the swap device. When
the owning process later faults them in, these pages will end up on the
desired node. This technique works, but it is not optimal: it would be
nicer to avoid having to write the pages to disk and read them back in.
Christoph Lameter has now followed up with the direct migration patch set, which does away with the side-trip to the swap device. A look at the patch shows why things were not done this way in the first place; direct page migration involves rather more than simply copying the data over. The first step, after choosing a target page, is to lock that page so that nobody else will mess with it. There might currently be I/O active which involves that page, so the kernel must wait for any such I/O to complete. Only then can the real migration work begin. The kernel must establish a swap cache entry for the page, even though it intends to avoid writing the page to swap. This entry will cause the right thing to happen if a process faults on the page while it is being moved. Then all references to the page (page table entries) are unmapped. With luck, all references will go away; if references remain for any reason, the page cannot be moved. Actually moving the page involves copying a subset of the page status bits over, copying the page data itself, then copying the rest of the status bits. The old page is cleared out and freed. If any writeback has been queued up for the new page, it is set in motion. Then it's just a matter of cleaning up, and the page has been successfully moved. If the kernel runs out of free pages on the target node, it will fall back to the swap-based mechanism. So that stage of this patch's evolution remains useful. With this code in place, the kernel has the support it needs to try to keep a process's pages in local memory. The migration code might also prove useful for hotplug memory uses, where all pages must be vacated from a given region. Indeed, some of this code was originally written for hotplug applications. But, at this point, the migration is done on a best-effort basis. For NUMA systems, failure to move a page results in worse performance, but nothing particularly severe. For hotplug memory, instead, this sort of failure will block a memory remove operation altogether. Moving all pages in a region with 100% certainty remains a difficult problem without a complete solution at this time. One of the pieces of such a solution might be active memory defragmentation which, among other things, works to keep non-movable memory allocations out of memory regions which might be removed. When we looked at active defragmentation last week, that patch set looked like it was in trouble. The overhead of the defragmentation code seemed to be too high, and a number of developers (Linus included) felt that this sort of functionality should be implemented using the kernel's zone system, rather then with a new layer in the memory allocator. Defragmentation hacker Mel Gorman doesn't give up that easily, however. He has posted a new, "light" version of the defragmentation patch which, he hopes, will be better received. As he describes it: This is a much simplified anti-defragmentation
approach that simply tries to keep kernel allocations in groups of
2^(MAX_ORDER-1) and easily reclaimed allocations in groups of
2^(MAX_ORDER-1). It uses no balancing, tunables special reserves and it
introduces no new branches in the main path. For small memory systems,
it can be disabled via a config option. In total, it adds 275 new lines
of code with minimum changes made to the main path.
In this version of the patch, a new GFP flag (__GFP_EASYRCLM) is added; its presence indicates an allocation which the kernel can easily get back should the need arise. It is used for user-space pages (which can usually be forced out to backing store) and in a few other situations, such as for some kernel buffers. The buddy allocator already keeps track of memory in large chunks; the new code simply steers reclaimable allocations toward some chunks, while keeping the non-reclaimable allocations in others. In this way, it is hoped, there will be no situations where one non-movable page blocks the freeing of the large, contiguous region in which it is located. The patch works by creating a "usemap" array tracking which kind of allocation is being done from each large chunk of memory. Mel also had to split the per-CPU free lists which are used to perform fast single-page allocations; now there are two such lists, one for each allocation type. From there, it is just a matter of taking allocations from the proper pile, depending on the __GFP_EASYRCLM flag. This version certainly reduces the footprint and overhead of the defragmentation patches. It is still not the zone-based approach that others were pushing for, however. So it remains to be seen whether "active defragmentation lite" is, in the end, better received than its predecessors.
VM followup: page migration and fragmentation avoidance Posted Nov 24, 2005 12:29 UTC (Thu) by markryde (guest, #33361) [Link] >Often, the motivation is moving pages between NUMA nodes in the hope of >improving performance.
Does anybody know if there is any other use for page migration apart
from moving pages between NUMA nodes? (in clusters or virtualization
solutions maybe?)
VM followup: page migration and fragmentation avoidance Posted Nov 25, 2005 11:30 UTC (Fri) by farnz (subscriber, #17727) [Link] Uses I can think of (other people will no doubt correct me when I've got things wrong):
|
- [linuxkernelnewbies] VM followup: page migration and fragmentat... Peter Teoh
