Re: [RFC 0/8] Cpuset aware writeback

Peter Zijlstra Mon, 15 Jan 2007 23:42:07 -0800

On Mon, 2007-01-15 at 21:47 -0800, Christoph Lameter wrote:
> Currently cpusets are not able to do proper writeback since
> dirty ratio calculations and writeback are all done for the system
> as a whole. This may result in a large percentage of a cpuset
> to become dirty without writeout being triggered. Under NFS
> this can lead to OOM conditions.
> 
> Writeback will occur during the LRU scans. But such writeout
> is not effective since we write page by page and not in inode page
> order (regular writeback).
> 
> In order to fix the problem we first of all introduce a method to
> establish a map of nodes that contain dirty pages for each
> inode mapping.
> 
> Secondly we modify the dirty limit calculation to be based
> on the acctive cpuset.
> 
> If we are in a cpuset then we select only inodes for writeback
> that have pages on the nodes of the cpuset.
> 
> After we have the cpuset throttling in place we can then make
> further fixups:
> 
> A. We can do inode based writeout from direct reclaim
>    avoiding single page writes to the filesystem.
> 
> B. We add a new counter NR_UNRECLAIMABLE that is subtracted
>    from the available pages in a node. This allows us to
>    accurately calculate the dirty ratio even if large portions
>    of the node have been allocated for huge pages or for
>    slab pages.


What about mlock'ed pages?

> There are a couple of points where some better ideas could be used:
> 
> 1. The nodemask expands the inode structure significantly if the
> architecture allows a high number of nodes. This is only an issue
> for IA64. For that platform we expand the inode structure by 128 byte
> (to support 1024 nodes). The last patch attempts to address the issue
> by using the knowledge about the maximum possible number of nodes
> determined on bootup to shrink the nodemask.

Not the prettiest indeed, no ideas though.

> 2. The calculation of the per cpuset limits can require looping
> over a number of nodes which may bring the performance of get_dirty_limits
> near pre 2.6.18 performance (before the introduction of the ZVC counters)
> (only for cpuset based limit calculation). There is no way of keeping these
> counters per cpuset since cpusets may overlap.

Well, you gain functionality, you loose some runtime, sad but probably
worth it.

Otherwise it all looks good.

Acked-by: Peter Zijlstra <[EMAIL PROTECTED]>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC 0/8] Cpuset aware writeback

Reply via email to