On Mon, 2007-01-15 at 21:47 -0800, Christoph Lameter wrote: > Currently cpusets are not able to do proper writeback since > dirty ratio calculations and writeback are all done for the system > as a whole. This may result in a large percentage of a cpuset > to become dirty without writeout being triggered. Under NFS > this can lead to OOM conditions. > > Writeback will occur during the LRU scans. But such writeout > is not effective since we write page by page and not in inode page > order (regular writeback). > > In order to fix the problem we first of all introduce a method to > establish a map of nodes that contain dirty pages for each > inode mapping. > > Secondly we modify the dirty limit calculation to be based > on the acctive cpuset. > > If we are in a cpuset then we select only inodes for writeback > that have pages on the nodes of the cpuset. > > After we have the cpuset throttling in place we can then make > further fixups: > > A. We can do inode based writeout from direct reclaim > avoiding single page writes to the filesystem. > > B. We add a new counter NR_UNRECLAIMABLE that is subtracted > from the available pages in a node. This allows us to > accurately calculate the dirty ratio even if large portions > of the node have been allocated for huge pages or for > slab pages.
What about mlock'ed pages? > There are a couple of points where some better ideas could be used: > > 1. The nodemask expands the inode structure significantly if the > architecture allows a high number of nodes. This is only an issue > for IA64. For that platform we expand the inode structure by 128 byte > (to support 1024 nodes). The last patch attempts to address the issue > by using the knowledge about the maximum possible number of nodes > determined on bootup to shrink the nodemask. Not the prettiest indeed, no ideas though. > 2. The calculation of the per cpuset limits can require looping > over a number of nodes which may bring the performance of get_dirty_limits > near pre 2.6.18 performance (before the introduction of the ZVC counters) > (only for cpuset based limit calculation). There is no way of keeping these > counters per cpuset since cpusets may overlap. Well, you gain functionality, you loose some runtime, sad but probably worth it. Otherwise it all looks good. Acked-by: Peter Zijlstra <[EMAIL PROTECTED]> - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/