http://lwn.net/Articles/157066/Page migration
NUMA systems have, by design, memory which is local to specific nodes
(groups of processors). While all memory is accessible, local memory is
faster to work with than remote memory. The kernel takes NUMA behavior
into account by attempting to allocate local memory for processes, and
by
avoiding moving processes between nodes whenever possible. Sometimes
processes must be moved, however, with the result that the
local-allocation
optimization can quickly become a pessimization instead. What would be
nice, in such situations, would be the ability to move a process's
memory
when the process itself is shifted to a new node.
Memory migration patches have been circulating for some time now. The latest version is this patch set posted by Christoph Lameter. This patch deliberately does not solve the entire problem, but it does try to establish enough infrastructure that a full migration solution can be evolved eventually. This patch does not automatically migrate memory for processes which have been moved; instead, it leaves the migration decision to user space. There is a new system call: long migrate_pages(pid_t pid, unsigned long maxnode,
unsigned long *old_nodes,
unsigned long *new_nodes);
This call will attempt to move any pages belonging to the given process from old_nodes to new_nodes. There is also a new MPOL_MF_MOVE option to the set_mempolicy() system call which can be used to the same effect. Either way, user space can request that a given process vacate a set of nodes. This operation can be performed in response to an explicit move of the process itself (which might be done by a system scheduling daemon, for example), or in response to other events, such as the impending shutdown and removal of a node. The implementation is simple for now: the code iterates over the process's memory and attempts to force each page needing migration to be swapped. When the process faults the page back in, it should then be allocated on the process's current node. The force-out process actually takes a few passes over the list; initially it passes over locked pages and just concerns itself with pages which are easy to evict. In later passes, it will wait for locked pages and do the hard work of getting the final pages out of memory. Migrating pages by way of the swap device is not the most efficient way of moving them across a NUMA system. Later work on the patch will be aimed at adding direct node-to-node migration, and other features as well. In the mean time, however, the developers would like to see the current implementation merged in time for 2.6.15. Andrew Morton has expressed some reservations, however: he would like to see an explanation of how this code can be made to work with near complete reliability. There are a number of things which can prevent the migration of pages; these include pages locked in place by user space, page undergoing direct I/O, and more. Christoph responded that the patch will get there, eventually. Whether this claim is sufficiently convincing to get the migration patches into 2.6.15 remains to be seen.
Page migration Posted Oct 27, 2005 22:31 UTC (Thu) by Duncan (guest, #6647) [Link] > Migrating pages by way of the swap device> is not the most efficient way of moving > them across a NUMA system. ... Especially for those of us that have swap disabled! It'll be nice to have migration working, but what's this with forcing it out to slow swap so it can be faulted back in on the other node? It's likely that's so slow the cost of doing it exceeds the potential benefit of NUMA optimized memory accesses for the remaining lifetime of the process, in many cases! Of course, if by "swapped out" and "faulted back in", it just means a trip out of directly allocated memory into cache memory and faulted back in from there, no big deal (unless no swap mean's it's disabled), but if it's actually written to disk, that's /quite/ a bit of extra latency to make up in NUMA optimization, enough so it's not likely to be worth it save for processes running (and accessing that memory) for > an hour, anyway. So... I guess my view on this depends on the defined value of "eventually", altho I think I'd still prefer it wait out a version, to be merged with .16 (or later), if it's all going to be swap dependant for .15. After all, it's not like the patches won't still be there for those that want them with .15, anyway, just not mainlined (-mm might be fine). Duncan
Why a system call? Posted Oct 28, 2005 0:32 UTC (Fri) by xoddam (subscriber, #2322) [Link] I don't understand how userspace is expected to know when and why itwould be worthwhile to move pages from node to node. We don't have explicit system calls requesting that pages be written out to the swap device or faulted back -- the OS handles it automatically and it's the job of the VM system to do it well. Recent optimisations like swap readahead are progress in that direction. I imagine the intermediate use of the swap device is merely a stepping stone for the implementor. I'm sure Christoph is aware of the price of disk latency! Perhaps nonlocal memory should be treated as a fast swap device after a process has been migrated -- pages can be faulted across to local memory when they are used on the new node. As an optimisation this could be 'extra lazy', since all pages are actually accessible -- eg. only pages written to by the new node (and, if there is a good heuristic for this, the most frequently read pages) need be copied to local memory. Do NUMA systems have memory-to-memory copy operations which don't trash the processor caches? I can imagine a "DMA" between node memories could take place while the source pages remain readable by the processor.
|
