[Devel] Re: + memcg-fix-deadlock-between-cpuset-and-memcg.patch added to -mm tree
(resend with adding Cc: containers and linux-mm) Hi, Andrew. Thank you for picking up this patch. But this version has a small race problem. This is a fix-up patch. === From: Daisuke Nishimura nishim...@mxp.nes.nec.co.jp We must clear mc.moving_task before waking up all waiters at the end of task migration. Otherwise, there can be a small race like: mem_cgroup_clear_mc()|mem_cgroup_wait_acct_move() -+--- __mem_cgroup_clear_mc()| wake_up_all()| |prepare_to_wait() |if (mc.moving_task) - true | schedule() | - noone wakes it up. mc.moving_task = NULL | Signed-off-by: Daisuke Nishimura nishim...@mxp.nes.nec.co.jp --- mm/memcontrol.c |6 +- 1 files changed, 5 insertions(+), 1 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b108b30..61678be 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4714,12 +4714,16 @@ static void mem_cgroup_clear_mc(void) { struct mem_cgroup *from = mc.from; + /* +* we must clear moving_task before waking up waiters at the end of +* task migration. +*/ + mc.moving_task = NULL; __mem_cgroup_clear_mc(); spin_lock(mc.lock); mc.from = NULL; mc.to = NULL; spin_unlock(mc.lock); - mc.moving_task = NULL; mem_cgroup_end_move(from); } -- 1.7.1 ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [RFC][BUGFIX] memcg: fix dead lock between cpuset and memcg (Re: [PATCH v5 3/3] cgroups: make procs file writable)
It looks to me like when memcg holds the mmap_sem the whole time, it's just to avoid the deadlock, not that there's there some need for the stuff under mmap_sem not to change between can_attach and attach. But if there is such a need, then the write-side in mpol_rebind_mm may conflict even with my proposed solution. Regardless, the best way would be to avoid holding the mmap_sem across the whole window, possibly by solving the move_charge deadlock some other internal way, if at all possible? I made a patch to fix these probrems(deadlock between cpuset and memcg which commit b1dd693e introduces, and deadlock which the commit fixed). I'll test and resend this after new year holidays in Japan. === From: Daisuke Nishimura nishim...@mxp.nes.nec.co.jp The commit b1dd693e(memcg: avoid deadlock between move charge and try_charge()) can cause another deadlock about mmap_sem on task migration if cpuset and memcg are mounted onto the same mount point. After the commit, cgroup_attach_task() has sequence like: cgroup_attach_task() ss-can_attach() cpuset_can_attach() mem_cgroup_can_attach() down_read(mmap_sem)(1) ss-attach() cpuset_attach() mpol_rebind_mm() down_write(mmap_sem) (2) up_write(mmap_sem) cpuset_migrate_mm() do_migrate_pages() down_read(mmap_sem) up_read(mmap_sem) mem_cgroup_move_task() mem_cgroup_clear_mc() up_read(mmap_sem) We can cause deadlock at (2) because we've already aquire the mmap_sem at (1). But the commit itself is necessary to fix deadlocks which have existed before the commit like: Ex.1) move charge |try charge --+-- mem_cgroup_can_attach() | down_write(mmap_sem) mc.moving_task = current |.. mem_cgroup_precharge_mc() | __mem_cgroup_try_charge() mem_cgroup_count_precharge()|prepare_to_wait() down_read(mmap_sem) |if (mc.moving_task) - cannot aquire the lock |- true | schedule() | - move charge should wake it up Ex.2) move charge |try charge --+-- mem_cgroup_can_attach() | mc.moving_task = current | mem_cgroup_precharge_mc() | mem_cgroup_count_precharge()| down_read(mmap_sem) | ..| up_read(mmap_sem)| | down_write(mmap_sem) mem_cgroup_move_task() |.. mem_cgroup_move_charge() | __mem_cgroup_try_charge() down_read(mmap_sem)|prepare_to_wait() - cannot aquire the lock |if (mc.moving_task) |- true | schedule() | - move charge should wake it up This patch fixes all of these problems by: 1. revert the commit. 2. To fix the Ex.1, we set mc.moving_task after mem_cgroup_count_precharge() has released the mmap_sem. 3. To fix the Ex.2, we use down_read_trylock() instead of down_read() in mem_cgroup_move_charge() and, if it has failed to aquire the lock, cancel all extra charges, wake up all waiters, and retry trylock. Reported-by: Ben Blum bb...@andrew.cmu.edu Signed-off-by: Daisuke Nishimura nishim...@mxp.nes.nec.co.jp --- mm/memcontrol.c | 78 +++ 1 files changed, 44 insertions(+), 34 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 7a22b41..b108b30 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -292,7 +292,6 @@ static struct move_charge_struct { unsigned long moved_charge; unsigned long moved_swap; struct task_struct *moving_task;/* a task moving charges */ - struct mm_struct *mm; wait_queue_head_t waitq;/* a waitq for other context */ } mc = { .lock = __SPIN_LOCK_UNLOCKED(mc.lock), @@ -4639,7 +4638,7 @@ static unsigned long mem_cgroup_count_precharge(struct mm_struct *mm) unsigned long precharge; struct vm_area_struct *vma; - /* We've already held the mmap_sem */ + down_read(mm-mmap_sem); for (vma = mm-mmap; vma; vma = vma-vm_next) { struct mm_walk mem_cgroup_count_precharge_walk = { .pmd_entry = mem_cgroup_count_precharge_pte_range, @@ -4651,6 +4650,7 @@ static unsigned long mem_cgroup_count_precharge(struct mm_struct *mm) walk_page_range(vma-vm_start, vma-vm_end, mem_cgroup_count_precharge_walk
[Devel] Re: [PATCH v5 3/3] cgroups: make procs file writable
On Fri, 24 Dec 2010 21:55:08 -0500 Ben Blum bb...@andrew.cmu.edu wrote: On Thu, Dec 23, 2010 at 10:33:52PM -0500, Ben Blum wrote: On Thu, Dec 16, 2010 at 12:26:03AM -0800, Andrew Morton wrote: Patches have gone a bit stale, sorry. Refactoring in kernel/cgroup_freezer.c necessitates a refresh and retest please. commit 53feb29767c29c877f9d47dcfe14211b5b0f7ebd changed a bunch of stuff in kernel/cpuset.c to allocate nodemasks with NODEMASK_ALLOC (which wraps kmalloc) instead of on the stack. 1. All these functions have 'void' return values, indicating that calling them them must not fail. Sure there are bailout cases, but no semblance of cross-function error propagation. Most importantly, cpuset_attach is a subsystem callback, which MUST not fail given the way it's used in cgroups, so relying on kmalloc is not safe. 2. I'm working on a patch series which needs to hold tasklist_lock across ss-attach callbacks (in cpuset_attach's if (threadgroup) case, this is how safe traversal of tsk-thread_group will be ensured), and kmalloc isn't allowed while holding a spin-lock. Why do we need heap-allocation here at all? In each case their scope is exactly the function's scope, and neither the commit nor the surrounding patch series give any explanation. I'd like to revert the patch if possible. cc'ing Miao Xie (author) and David Rientjes (acker). -- Ben Well even with the proposed solution to this there is still another problem that I see - that of mmap_sem. cpuset_attach() calls into mpol_rebind_mm() and do_migrate_pages(), which take mm-mmap_sem for writing and reading respectively. This is going to conflict with tasklist_lock... but moreover, the memcontrol subsys also touches the task's mm-mmap_sem, holding onto it between mem_cgroup_can_attach() and mem_cgroup_move_task() - as of b1dd693e5b9348bd68a80e679e03cf9c0973b01b. So we have (currently, even without my patches): cgroup_attach_task (1) cpuset_can_attach (2) mem_cgroup_can_attach - down_read(mm-mmap_sem); (3) cpuset_attach - mpol_rebind_mm - down_write(mm-mmap_sem); - up_write(mm-mmap_sem); - cpuset_migrate_mm - do_migrate_pages - down_read(mm-mmap_sem); - up_read(mm-mmap_sem); (4) mem_cgroup_move_task - mem_cgroup_clear_mc - up_read(...); hmm, nice catch. Is there some interdependency I'm missing here that guarantees recursive locking/deadlock will be avoided? It all looks like typical-case code. Unfortunately, not. I couldn't hit this because I mount all subsystems onto different mount points in my environment. I think we should move taking the mmap_sem all the way up into cgroup_attach_task and cgroup_attach_proc; it will be held for writing the whole time. I don't quite understand the mempolicy stuff but maybe there can be ways to use mpol_rebind_mm and do_migrate_pages when the lock is already held. I agree. Another solution(just an idea): avoid enabling both move_charge feature of memcg and memory_migrate of cpuset at the same time iff they are mounted under the same mount point. But, hmm... it's not a good idea to make a subsystem take account of another subsystem, IMHO. Thanks, Daisuke Nishimura. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH v2][memcg+dirtylimit] Fix overwriting global vm dirty limit setting by memcg (Re: [PATCH v3 00/11] memcg: per cgroup dirty page accounting
limit, and b) system usage is at system limit Then the above patch will not trigger writeback. Example with two memcg: sys B C limit usage sys 10 10 B7 6 C5 4 If B wants to grow, the system will exceed system limit of 10 and should be throttled. However, the smaller limit (7) will be selected and applied to memcg usage (6), which indicates no need to throttle, so the system could get as bad as: limit usage sys 10 12 B7 7 C5 5 In this case the system usage exceeds the system limit because each the per-memcg checks see no per-memcg problems. nice catch ! To solve this I propose we create a new structure to aggregate both dirty limit and usage data: struct dirty_limits { unsigned long dirty_thresh; unsigned long background_thresh; unsigned long nr_reclaimable; unsigned long nr_writeback; }; global_dirty_limits() would then query both the global and memcg limits and dirty usage of one that is closest to its limit. This change makes global_dirty_limits() look like: void global_dirty_limits(struct dirty_limits *limits) { unsigned long background; unsigned long dirty; unsigned long nr_reclaimable; unsigned long nr_writeback; unsigned long available_memory = determine_dirtyable_memory(); struct task_struct *tsk; if (vm_dirty_bytes) dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE); else dirty = (vm_dirty_ratio * available_memory) / 100; if (dirty_background_bytes) background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE); else background = (dirty_background_ratio * available_memory) / 100; nr_reclaimable = global_page_state(NR_FILE_DIRTY) + global_page_state(NR_UNSTABLE_NFS); nr_writeback = global_page_state(NR_WRITEBACK); if (mem_cgroup_dirty_limits(available_memory, limits) dirty_available(limits-dirty_thresh, limits-nr_reclaimable, limits-nr_writeback) dirty_available(dirty, nr_reclaimable, nr_writeback)) { dirty = min(dirty, limits-dirty_thresh); background = min(background, limits-background_thresh); } else { limits-nr_reclaimable = nr_reclaimable; limits-nr_writeback = nr_writeback; } if (background = dirty) background = dirty / 2; tsk = current; if (tsk-flags PF_LESS_THROTTLE || rt_task(tsk)) { background += background / 4; dirty += dirty / 4; } limits-background_thresh = background; limits-dirty_thresh = dirty; } Because this approach considered both memcg and system limits, the problem described above is avoided. I have this change integrated into the memcg dirty limit series (-v3 was the last post; v4 is almost ready with this change). I will post -v4 with this approach is there is no strong objection. I have one concern now. I've noticed that global_dirty_limits() is called bdi_debug_stats_show(), so the output of cat /sys/kernel/debug/bdi/.../stats changes depending on the memcg where the command is executed. Can you take it into account too ? Thanks, Daisuke Nishimura. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH v2][memcg+dirtylimit] Fix overwriting global vm dirty limit setting by memcg (Re: [PATCH v3 00/11] memcg: per cgroup dirty page accounting
On Wed, 20 Oct 2010 14:02:55 +0900 KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote: Fixed one here. == From: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com Now, at calculating dirty limit, vm_dirty_param() is called. This function returns dirty-limit related parameters considering memory cgroup settings. Now, assume that vm_dirty_bytes=100M (global dirty limit) and memory cgroup has 1G of pages and 40 dirty_ratio, dirtyable memory is 500MB. In this case, global_dirty_limits will consider dirty_limt as 500 *0.4 = 200MB. This is bad...memory cgroup is not back door. This patch limits the return value of vm_dirty_param() considring global settings. Changelog: - fixed an argument mem int to u64 - fixed to use global available memory to cap memcg's value. Signed-off-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com Acked-by: Daisuke Nishimura nishim...@mxp.nes.nec.co.jp ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH v3 06/11] memcg: add kernel calls for memcg dirty page stats
On Mon, 18 Oct 2010 17:39:39 -0700 Greg Thelen gthe...@google.com wrote: Add calls into memcg dirty page accounting. Notify memcg when pages transition between clean, file dirty, writeback, and unstable nfs. This allows the memory controller to maintain an accurate view of the amount of its memory that is dirty. Signed-off-by: Greg Thelen gthe...@google.com Signed-off-by: Andrea Righi ari...@develer.com Reviewed-by: Daisuke Nishimura nishim...@mxp.nes.nec.co.jp ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH v3 02/11] memcg: document cgroup dirty memory interfaces
. + 6. Hierarchy support The memory controller supports a deep hierarchy and hierarchical accounting. -- 1.7.1 Can you clarify whether we can limit the total dirty pages under hierarchy in use_hierarchy==1 case ? If we can, I think it would be better to note it in this documentation. Thanks, Daisuke Nishimura. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH v3 02/11] memcg: document cgroup dirty memory interfaces
On Wed, 20 Oct 2010 09:11:09 +0900 KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote: On Tue, 19 Oct 2010 14:00:58 -0700 Greg Thelen gthe...@google.com wrote: (snip) +When use_hierarchy=0, each cgroup has independent dirty memory usage and limits. + +When use_hierarchy=1, a parent cgroup increasing its dirty memory usage will +compare its total_dirty memory (which includes sum of all child cgroup dirty +memory) to its dirty limits. This keeps a parent from explicitly exceeding its +dirty limits. However, a child cgroup can increase its dirty usage without +considering the parent's dirty limits. Thus the parent's total_dirty can exceed +the parent's dirty limits as a child dirties pages. Hmm. in short, dirty_ratio in use_hierarchy=1 doesn't work as an user expects. Is this a spec. or a current implementation ? I think as following. - add a limitation as At setting chidlren's dirty_ratio, it must be below parent's. If it exceeds parent's dirty_ratio, EINVAL is returned. Could you modify setting memory.dirty_ratio code ? Then, parent's dirty_ratio will never exceeds its own. (If I understand correctly.) memory.dirty_limit_in_bytes will be a bit more complecated, but I think you can. I agree. At the first impression, this limitation seems a bit overkill for me, because we allow memory.limit_in_bytes of a child bigger than that of parent now. But considering more, the situation is different, because usage_in_bytes never exceeds limit_in_bytes. Thanks, Daisuke Nishimura. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH v3 07/11] memcg: add dirty limits to mem_cgroup
+static unsigned long long +memcg_hierarchical_free_pages(struct mem_cgroup *mem) +{ + struct cgroup *cgroup; + unsigned long long min_free, free; + + min_free = res_counter_read_u64(mem-res, RES_LIMIT) - + res_counter_read_u64(mem-res, RES_USAGE); + cgroup = mem-css.cgroup; + if (!mem-use_hierarchy) + goto out; + + while (cgroup-parent) { + cgroup = cgroup-parent; + mem = mem_cgroup_from_cont(cgroup); + if (!mem-use_hierarchy) + break; + free = res_counter_read_u64(mem-res, RES_LIMIT) - + res_counter_read_u64(mem-res, RES_USAGE); + min_free = min(min_free, free); + } +out: + /* Translate free memory in pages */ + return min_free PAGE_SHIFT; +} + I think you can simplify this function using parent_mem_cgroup(). unsigned long free, min_free = ULLONG_MAX; while (mem) { free = res_counter_read_u64(mem-res, RES_LIMIT) - res_counter_read_u64(mem-res, RES_USAGE); min_free = min(min_free, free); mem = parent_mem_cgroup(); } /* Translate free memory in pages */ return min_free PAGE_SHIFT; And, IMHO, we should return min(global_page_state(NR_FREE_PAGES), min_free PAGE_SHIFT). Because we are allowed to set no-limit(or a very big limit) in memcg, so min_free can be very big if we don't set a limit against all the memcg's in hierarchy. Thanks, Dasiuke Nishimura. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH v3 05/11] memcg: add dirty page accounting infrastructure
On Mon, 18 Oct 2010 17:39:38 -0700 Greg Thelen gthe...@google.com wrote: Add memcg routines to track dirty, writeback, and unstable_NFS pages. These routines are not yet used by the kernel to count such pages. A later change adds kernel calls to these new routines. Signed-off-by: Greg Thelen gthe...@google.com Signed-off-by: Andrea Righi ari...@develer.com Acked-by: Daisuke Nishimura nishim...@mxp.nes.nec.co.jp ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH v3 09/11] memcg: add cgroupfs interface to memcg dirty limits
On Mon, 18 Oct 2010 17:39:42 -0700 Greg Thelen gthe...@google.com wrote: Add cgroupfs interface to memcg dirty page limits: Direct write-out is controlled with: - memory.dirty_ratio - memory.dirty_limit_in_bytes Background write-out is controlled with: - memory.dirty_background_ratio - memory.dirty_background_limit_bytes Other memcg cgroupfs files support 'M', 'm', 'k', 'K', 'g' and 'G' suffixes for byte counts. This patch provides the same functionality for memory.dirty_limit_in_bytes and memory.dirty_background_limit_bytes. Signed-off-by: Andrea Righi ari...@develer.com Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com Signed-off-by: Greg Thelen gthe...@google.com Acked-by: Daisuke Nishimura nishim...@mxp.nes.nec.co.jp One question: shouldn't we return -EINVAL when writing to dirty(_background)_limit_bytes a bigger value than that of global one(if any) ? Or do you intentionally set the input value without comparing it with the global value ? But, hmm..., IMHO we should check it in __mem_cgroup_dirty_param() or something not to allow dirty pages more than global limit. Thanks, Daisuke Nishimura. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH v3 09/11] memcg: add cgroupfs interface to memcg dirty limits
On Wed, 20 Oct 2010 12:31:10 +0900 Daisuke Nishimura nishim...@mxp.nes.nec.co.jp wrote: On Mon, 18 Oct 2010 17:39:42 -0700 Greg Thelen gthe...@google.com wrote: Add cgroupfs interface to memcg dirty page limits: Direct write-out is controlled with: - memory.dirty_ratio - memory.dirty_limit_in_bytes Background write-out is controlled with: - memory.dirty_background_ratio - memory.dirty_background_limit_bytes Other memcg cgroupfs files support 'M', 'm', 'k', 'K', 'g' and 'G' suffixes for byte counts. This patch provides the same functionality for memory.dirty_limit_in_bytes and memory.dirty_background_limit_bytes. Signed-off-by: Andrea Righi ari...@develer.com Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com Signed-off-by: Greg Thelen gthe...@google.com Acked-by: Daisuke Nishimura nishim...@mxp.nes.nec.co.jp One question: shouldn't we return -EINVAL when writing to dirty(_background)_limit_bytes a bigger value than that of global one(if any) ? Or do you intentionally set the input value without comparing it with the global value ? But, hmm..., IMHO we should check it in __mem_cgroup_dirty_param() or something not to allow dirty pages more than global limit. Oh, Kamazawa-san has just send a fix for this problem :) Please ignore this comment. Thanks, Daisuke Nishimura. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH v3 02/11] memcg: document cgroup dirty memory interfaces
On Wed, 20 Oct 2010 11:24:31 +0900 KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote: On Wed, 20 Oct 2010 10:14:21 +0900 KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote: On Wed, 20 Oct 2010 09:48:21 +0900 Daisuke Nishimura nishim...@mxp.nes.nec.co.jp wrote: On Wed, 20 Oct 2010 09:11:09 +0900 KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote: On Tue, 19 Oct 2010 14:00:58 -0700 Greg Thelen gthe...@google.com wrote: (snip) +When use_hierarchy=0, each cgroup has independent dirty memory usage and limits. + +When use_hierarchy=1, a parent cgroup increasing its dirty memory usage will +compare its total_dirty memory (which includes sum of all child cgroup dirty +memory) to its dirty limits. This keeps a parent from explicitly exceeding its +dirty limits. However, a child cgroup can increase its dirty usage without +considering the parent's dirty limits. Thus the parent's total_dirty can exceed +the parent's dirty limits as a child dirties pages. Hmm. in short, dirty_ratio in use_hierarchy=1 doesn't work as an user expects. Is this a spec. or a current implementation ? I think as following. - add a limitation as At setting chidlren's dirty_ratio, it must be below parent's. If it exceeds parent's dirty_ratio, EINVAL is returned. Could you modify setting memory.dirty_ratio code ? Then, parent's dirty_ratio will never exceeds its own. (If I understand correctly.) memory.dirty_limit_in_bytes will be a bit more complecated, but I think you can. I agree. At the first impression, this limitation seems a bit overkill for me, because we allow memory.limit_in_bytes of a child bigger than that of parent now. But considering more, the situation is different, because usage_in_bytes never exceeds limit_in_bytes. I'd like to consider a patch. Please mention that use_hierarchy=1 case depends on implemenation. for now. BTW, how about supporing dirty_limit_in_bytes when use_hierarchy=0 or leave it as broken when use_hierarchy=1 ? It seems we can only support dirty_ratio when hierarchy is used. It's all right for me. This feature would be useful even w/o hierarchy support. Thanks, Daisuke Nishimura. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH v3 10/11] writeback: make determine_dirtyable_memory() static.
On Mon, 18 Oct 2010 17:39:43 -0700 Greg Thelen gthe...@google.com wrote: The determine_dirtyable_memory() function is not used outside of page writeback. Make the routine static. No functional change. Just a cleanup in preparation for a change that adds memcg dirty limits consideration into global_dirty_limits(). Signed-off-by: Andrea Righi ari...@develer.com Signed-off-by: Greg Thelen gthe...@google.com Reviewed-by: Daisuke Nishimura nishim...@mxp.nes.nec.co.jp ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH v3 08/11] memcg: CPU hotplug lockdep warning fix
On Mon, 18 Oct 2010 17:39:41 -0700 Greg Thelen gthe...@google.com wrote: From: Balbir Singh bal...@linux.vnet.ibm.com memcg has lockdep warnings (sleep inside rcu lock) From: Balbir Singh bal...@linux.vnet.ibm.com Recent move to get_online_cpus() ends up calling get_online_cpus() from mem_cgroup_read_stat(). However mem_cgroup_read_stat() is called under rcu lock. get_online_cpus() can sleep. The dirty limit patches expose this BUG more readily due to their usage of mem_cgroup_page_stat() This patch address this issue as identified by lockdep and moves the hotplug protection to a higher layer. This might increase the time required to hotplug, but not by much. Warning messages BUG: sleeping function called from invalid context at kernel/cpu.c:62 in_atomic(): 0, irqs_disabled(): 0, pid: 6325, name: pagetest 2 locks held by pagetest/6325: do_page_fault+0x27d/0x4a0 mem_cgroup_page_stat+0x0/0x23f Pid: 6325, comm: pagetest Not tainted 2.6.36-rc5-mm1+ #201 Call Trace: [81041224] __might_sleep+0x12d/0x131 [8104f4af] get_online_cpus+0x1c/0x51 [8110eedb] mem_cgroup_read_stat+0x27/0xa3 [811125d2] mem_cgroup_page_stat+0x131/0x23f [811124a1] ? mem_cgroup_page_stat+0x0/0x23f [810d57c3] global_dirty_limits+0x42/0xf8 [810d58b3] throttle_vm_writeout+0x3a/0xb4 [810dc2f8] shrink_zone+0x3e6/0x3f8 [81074a35] ? ktime_get_ts+0xb2/0xbf [810dd1aa] do_try_to_free_pages+0x106/0x478 [810dd601] try_to_free_mem_cgroup_pages+0xe5/0x14c [8110f947] mem_cgroup_hierarchical_reclaim+0x314/0x3a2 [8b31] __mem_cgroup_try_charge+0x29b/0x593 [894a] ? __mem_cgroup_try_charge+0xb4/0x593 [81071258] ? local_clock+0x40/0x59 [81009015] ? sched_clock+0x9/0xd [810710d5] ? sched_clock_local+0x1c/0x82 [8111398a] mem_cgroup_charge_common+0x4b/0x76 [81141469] ? bio_add_page+0x36/0x38 [81113ba9] mem_cgroup_cache_charge+0x1f4/0x214 [810cd195] add_to_page_cache_locked+0x4a/0x148 Acked-by: Greg Thelen gthe...@google.com Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com Acked-by: Daisuke Nishimura nishim...@mxp.nes.nec.co.jp ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH v3 11/11] memcg: check memcg dirty limits in page writeback
On Wed, 20 Oct 2010 13:18:57 +0900 KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote: On Tue, 19 Oct 2010 10:00:15 +0900 KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote: On Mon, 18 Oct 2010 17:39:44 -0700 Greg Thelen gthe...@google.com wrote: If the current process is in a non-root memcg, then global_dirty_limits() will consider the memcg dirty limit. This allows different cgroups to have distinct dirty limits which trigger direct and background writeback at different levels. Signed-off-by: Andrea Righi ari...@develer.com Signed-off-by: Greg Thelen gthe...@google.com Reviewed-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com Why FREEPAGES in memcg is not counted as dirtyable ? It's counted as memcg_hierarchical_free_pages() in mem_cgroup_page_stat(). As for FREEPAGES, we should count ancestors, not children. mem_cgroup_page_stat(): 1311 if (mem !mem_cgroup_is_root(mem)) { 1312 /* 1313 * If we're looking for dirtyable pages we need to evaluate 1314 * free pages depending on the limit and usage of the parents 1315 * first of all. 1316 */ 1317 if (item == MEMCG_NR_DIRTYABLE_PAGES) 1318 value = memcg_hierarchical_free_pages(mem); 1319 else 1320 value = 0; 1321 /* 1322 * Recursively evaluate page statistics against all cgroup 1323 * under hierarchy tree 1324 */ 1325 for_each_mem_cgroup_tree(iter, mem) 1326 value += mem_cgroup_local_page_stat(iter, item); ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH v3 11/11] memcg: check memcg dirty limits in page writeback
On Mon, 18 Oct 2010 17:39:44 -0700 Greg Thelen gthe...@google.com wrote: If the current process is in a non-root memcg, then global_dirty_limits() will consider the memcg dirty limit. This allows different cgroups to have distinct dirty limits which trigger direct and background writeback at different levels. Signed-off-by: Andrea Righi ari...@develer.com Signed-off-by: Greg Thelen gthe...@google.com Reviewed-by: Daisuke Nishimura nishim...@mxp.nes.nec.co.jp ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH v3 01/11] memcg: add page_cgroup flags for dirty page tracking
On Mon, 18 Oct 2010 17:39:34 -0700 Greg Thelen gthe...@google.com wrote: Add additional flags to page_cgroup to track dirty pages within a mem_cgroup. Signed-off-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com Signed-off-by: Andrea Righi ari...@develer.com Signed-off-by: Greg Thelen gthe...@google.com Acked-by: Daisuke Nishimura nishim...@mxp.nes.nec.co.jp ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH v3 03/11] memcg: create extensible page stat update routines
On Mon, 18 Oct 2010 17:39:36 -0700 Greg Thelen gthe...@google.com wrote: Replace usage of the mem_cgroup_update_file_mapped() memcg statistic update routine with two new routines: * mem_cgroup_inc_page_stat() * mem_cgroup_dec_page_stat() As before, only the file_mapped statistic is managed. However, these more general interfaces allow for new statistics to be more easily added. New statistics are added with memcg dirty page accounting. Signed-off-by: Greg Thelen gthe...@google.com Signed-off-by: Andrea Righi ari...@develer.com Acked-by: Daisuke Nishimura nishim...@mxp.nes.nec.co.jp ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH v3 04/11] memcg: add lock to synchronize page accounting and migration
On Mon, 18 Oct 2010 17:39:37 -0700 Greg Thelen gthe...@google.com wrote: From: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com From: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com Introduce a new bit spin lock, PCG_MOVE_LOCK, to synchronize the page accounting and migration code. This reworks the locking scheme of _update_stat() and _move_account() by adding new lock bit PCG_MOVE_LOCK, which is always taken under IRQ disable. 1. If pages are being migrated from a memcg, then updates to that memcg page statistics are protected by grabbing PCG_MOVE_LOCK using move_lock_page_cgroup(). In an upcoming commit, memcg dirty page accounting will be updating memcg page accounting (specifically: num writeback pages) from IRQ context (softirq). Avoid a deadlocking nested spin lock attempt by disabling irq on the local processor when grabbing the PCG_MOVE_LOCK. 2. lock for update_page_stat is used only for avoiding race with move_account(). So, IRQ awareness of lock_page_cgroup() itself is not a problem. The problem is between mem_cgroup_update_page_stat() and mem_cgroup_move_account_page(). Trade-off: * Changing lock_page_cgroup() to always disable IRQ (or local_bh) has some impacts on performance and I think it's bad to disable IRQ when it's not necessary. * adding a new lock makes move_account() slower. Score is here. Performance Impact: moving a 8G anon process. Before: real0m0.792s user0m0.000s sys 0m0.780s After: real0m0.854s user0m0.000s sys 0m0.842s This score is bad but planned patches for optimization can reduce this impact. Signed-off-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com Signed-off-by: Greg Thelen gthe...@google.com This approach is more straightforward and easy to understand. Acked-by: Daisuke Nishimura nishim...@mxp.nes.nec.co.jp ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH v4] memcg: reduce lock time at move charge
On Tue, 12 Oct 2010 14:48:01 +0900 KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote: On Tue, 12 Oct 2010 12:56:13 +0900 Daisuke Nishimura nishim...@mxp.nes.nec.co.jp wrote: +err_out: + for (; mt info + num; mt++) + if (mt-type == MC_TARGET_PAGE) { + putback_lru_page(mt-val.page); Is this putback_lru_page() necessary ? is_target_pte_for_mc() doesn't isolate the page. Ok, v4 here. tested failure path and success path. Looks good to me. Acked-by: Daisuke Nishimura nishim...@mxp.nes.nec.co.jp Thanks, Daisuke Nishimura. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH v2] memcg: reduce lock time at move charge (Was Re: [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup()
+err_out: + for (; mt info + num; mt++) + if (mt-type == MC_TARGET_PAGE) { + putback_lru_page(mt-val.page); Is this putback_lru_page() necessary ? is_target_pte_for_mc() doesn't isolate the page. Thanks, Daisuke Nishimura. + put_page(mt-val.page); + } + goto out; } static void mem_cgroup_move_charge(struct mm_struct *mm) ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH v2] memcg: reduce lock time at move charge (Was Re: [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup()
On Thu, 7 Oct 2010 16:14:54 -0700 Andrew Morton a...@linux-foundation.org wrote: On Thu, 7 Oct 2010 17:04:05 +0900 KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote: Now, at task migration among cgroup, memory cgroup scans page table and moving account if flags are properly set. The core code, mem_cgroup_move_charge_pte_range() does pte_offset_map_lock(); for all ptes in a page table: 1. look into page table, find_and_get a page 2. remove it from LRU. 3. move charge. 4. putback to LRU. put_page() pte_offset_map_unlock(); for pte entries on a 3rd level? page table. This pte_offset_map_lock seems a bit long. This patch modifies a rountine as for 32 pages: pte_offset_map_lock() find_and_get a page record it pte_offset_map_unlock() for all recorded pages isolate it from LRU. move charge putback to LRU for all recorded pages put_page() The patch makes the code larger, more complex and slower! Before this patch: textdata bss dec hex filename 27163 117824100 43045a825 mm/memcontrol.o After this patch: textdata bss dec hex filename 27307 122944100 43701aab5 mm/memcontrol.o hmm, allocating mc.target[] statically might be bad, but I'm now wondering whether I could allocate mc itself dynamically(I'll try). I do think we're owed a more complete description of its benefits than seems a bit long. Have problems been observed? Any measurements taken? IIUC, this patch is necessary for [PATCH] memcg: lock-free clear page writeback later, but I agree we should describe it. Thanks, Daisuke Nishimura. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [RFC] Restrict size of page_cgroup-flags
On Wed, 6 Oct 2010 19:53:14 +0530 Balbir Singh bal...@linux.vnet.ibm.com wrote: I propose restricting page_cgroup.flags to 16 bits. The patch for the same is below. Comments? Restrict the bits usage in page_cgroup.flags From: Balbir Singh bal...@linux.vnet.ibm.com Restricting the flags helps control growth of the flags unbound. Restriciting it to 16 bits gives us the possibility of merging cgroup id with flags (atomicity permitting) and saving a whole long word in page_cgroup I agree that reducing the size of page_cgroup would be good and important. But, wouldn't it be better to remove -page, if possible ? Thanks, Daisuke Nishimura. Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com --- include/linux/page_cgroup.h |3 +++ mm/page_cgroup.c|1 + 2 files changed, 4 insertions(+), 0 deletions(-) diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h index 872f6b1..10c37b4 100644 --- a/include/linux/page_cgroup.h +++ b/include/linux/page_cgroup.h @@ -44,8 +44,11 @@ enum { PCG_FILE_WRITEBACK, /* page is under writeback */ PCG_FILE_UNSTABLE_NFS, /* page is NFS unstable */ PCG_MIGRATION, /* under page migration */ + PCG_MAX_NR, }; +#define PCG_MAX_BIT_SIZE 16 + #define TESTPCGFLAG(uname, lname)\ static inline int PageCgroup##uname(struct page_cgroup *pc) \ { return test_bit(PCG_##lname, pc-flags); } diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c index 5bffada..e16ad2e 100644 --- a/mm/page_cgroup.c +++ b/mm/page_cgroup.c @@ -258,6 +258,7 @@ void __init page_cgroup_init(void) unsigned long pfn; int fail = 0; + BUILD_BUG_ON(PCG_MAX_NR = PCG_MAX_BIT_SIZE); if (mem_cgroup_disabled()) return; -- Three Cheers, Balbir -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 04/10] memcg: disable local interrupts in lock_page_cgroup()
On Thu, 7 Oct 2010 09:35:45 +0900 KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote: On Wed, 6 Oct 2010 09:15:34 +0900 Minchan Kim minchan@gmail.com wrote: First of all, we could add your patch as it is and I don't expect any regression report about interrupt latency. That's because many embedded guys doesn't use mmotm and have a tendency to not report regression of VM. Even they don't use memcg. Hmm... I pass the decision to MAINTAINER Kame and Balbir. Thanks for the detail explanation. Hmm. IRQ delay is a concern. So, my option is this. How do you think ? 1. remove local_irq_save()/restore() in lock/unlock_page_cgroup(). yes, I don't like it. 2. At moving charge, do this: a) lock_page()/ or trylock_page() b) wait_on_page_writeback() c) do move_account under lock_page_cgroup(). c) unlock_page() Then, Writeback updates will never come from IRQ context while lock/unlock_page_cgroup() is held by move_account(). There will be no race. hmm, if we'll do that, I think we need to do that under pte_lock in mem_cgroup_move_charge_pte_range(). But, we can't do wait_on_page_writeback() under pte_lock, right? Or, we need re-organize current move-charge implementation. Thanks, Daisuke Nishimura. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [RFC] Restrict size of page_cgroup-flags
On Thu, 7 Oct 2010 08:44:59 +0530 Balbir Singh bal...@linux.vnet.ibm.com wrote: * nishim...@mxp.nes.nec.co.jp nishim...@mxp.nes.nec.co.jp [2010-10-07 09:54:58]: On Wed, 6 Oct 2010 19:53:14 +0530 Balbir Singh bal...@linux.vnet.ibm.com wrote: I propose restricting page_cgroup.flags to 16 bits. The patch for the same is below. Comments? Restrict the bits usage in page_cgroup.flags From: Balbir Singh bal...@linux.vnet.ibm.com Restricting the flags helps control growth of the flags unbound. Restriciting it to 16 bits gives us the possibility of merging cgroup id with flags (atomicity permitting) and saving a whole long word in page_cgroup I agree that reducing the size of page_cgroup would be good and important. But, wouldn't it be better to remove -page, if possible ? Without the page pointer, how do we go from pc to page for reclaim? We store page_cgroups in arrays now, so I suppose we can implement pc_to_pfn() using the similar calculation as page_to_pfn() does. IIRC, KAMEZAWA-san talked about it in another thread. Thanks, Daisuke Nishimura. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [RFC] Restrict size of page_cgroup-flags
On Thu, 7 Oct 2010 13:22:33 +0900 KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote: On Thu, 7 Oct 2010 09:26:08 +0530 Balbir Singh bal...@linux.vnet.ibm.com wrote: * KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com [2010-10-07 12:18:16]: On Thu, 7 Oct 2010 08:42:04 +0530 Balbir Singh bal...@linux.vnet.ibm.com wrote: * KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com [2010-10-07 08:58:58]: On Wed, 6 Oct 2010 19:53:14 +0530 Balbir Singh bal...@linux.vnet.ibm.com wrote: I propose restricting page_cgroup.flags to 16 bits. The patch for the same is below. Comments? Restrict the bits usage in page_cgroup.flags From: Balbir Singh bal...@linux.vnet.ibm.com Restricting the flags helps control growth of the flags unbound. Restriciting it to 16 bits gives us the possibility of merging cgroup id with flags (atomicity permitting) and saving a whole long word in page_cgroup Signed-off-by: Balbir Singh bal...@linux.vnet.ibm.com Doesn't make sense until you show the usage of existing bits. ?? Limiting something for NOT EXISTING PATCH doesn't make sense, in general. And I guess 16bit may be too large on 32bit systems. too large on 32 bit systems? My intention is to keep the flags to 16 bits and then use cgroup id for the rest and see if we can remove mem_cgroup pointer You can't use flags field to store mem_cgroup_id while we use lock bit on it. We have to store something more stable...as pfn or node-id or zone-id. It's very racy. Yes, correct it is racy, there is no easy way from what I know we can write the upper 16 bits of the flag without affecting the lower 16 bits, if the 16 bits are changing. One of the techniques could be to have lock for the unsigned long word itself - but I don't know what performance overhead that would add. Having said that I would like to explore techniques that allow me to merge the two. to store pfn, we need to limit under 12bit. I'm sorry if I miss something, but is it valid in PAE case too ? I think it would be better to store section id(or node id) rather than pfn. I'll schedule my patch if dirty_ratio one goes. Agreed. I think we should do dirty_ratio patches first. Thanks, Daisuke Nishimura. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 01/10] memcg: add page_cgroup flags for dirty page tracking
On Sun, 3 Oct 2010 23:57:56 -0700 Greg Thelen gthe...@google.com wrote: Add additional flags to page_cgroup to track dirty pages within a mem_cgroup. Signed-off-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com Signed-off-by: Andrea Righi ari...@develer.com Signed-off-by: Greg Thelen gthe...@google.com Acked-by: Daisuke Nishimura nishim...@mxp.nes.nec.co.jp Thanks, Daisuke Nishimura. --- include/linux/page_cgroup.h | 23 +++ 1 files changed, 23 insertions(+), 0 deletions(-) diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h index 5bb13b3..b59c298 100644 --- a/include/linux/page_cgroup.h +++ b/include/linux/page_cgroup.h @@ -40,6 +40,9 @@ enum { PCG_USED, /* this object is in use. */ PCG_ACCT_LRU, /* page has been accounted for */ PCG_FILE_MAPPED, /* page is accounted as mapped */ + PCG_FILE_DIRTY, /* page is dirty */ + PCG_FILE_WRITEBACK, /* page is under writeback */ + PCG_FILE_UNSTABLE_NFS, /* page is NFS unstable */ PCG_MIGRATION, /* under page migration */ }; @@ -59,6 +62,10 @@ static inline void ClearPageCgroup##uname(struct page_cgroup *pc) \ static inline int TestClearPageCgroup##uname(struct page_cgroup *pc) \ { return test_and_clear_bit(PCG_##lname, pc-flags); } +#define TESTSETPCGFLAG(uname, lname) \ +static inline int TestSetPageCgroup##uname(struct page_cgroup *pc) \ + { return test_and_set_bit(PCG_##lname, pc-flags); } + TESTPCGFLAG(Locked, LOCK) /* Cache flag is set only once (at allocation) */ @@ -80,6 +87,22 @@ SETPCGFLAG(FileMapped, FILE_MAPPED) CLEARPCGFLAG(FileMapped, FILE_MAPPED) TESTPCGFLAG(FileMapped, FILE_MAPPED) +SETPCGFLAG(FileDirty, FILE_DIRTY) +CLEARPCGFLAG(FileDirty, FILE_DIRTY) +TESTPCGFLAG(FileDirty, FILE_DIRTY) +TESTCLEARPCGFLAG(FileDirty, FILE_DIRTY) +TESTSETPCGFLAG(FileDirty, FILE_DIRTY) + +SETPCGFLAG(FileWriteback, FILE_WRITEBACK) +CLEARPCGFLAG(FileWriteback, FILE_WRITEBACK) +TESTPCGFLAG(FileWriteback, FILE_WRITEBACK) + +SETPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS) +CLEARPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS) +TESTPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS) +TESTCLEARPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS) +TESTSETPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS) + SETPCGFLAG(Migration, MIGRATION) CLEARPCGFLAG(Migration, MIGRATION) TESTPCGFLAG(Migration, MIGRATION) -- 1.7.1 ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 02/10] memcg: document cgroup dirty memory interfaces
On Sun, 3 Oct 2010 23:57:57 -0700 Greg Thelen gthe...@google.com wrote: Document cgroup dirty memory interfaces and statistics. Signed-off-by: Andrea Righi ari...@develer.com Signed-off-by: Greg Thelen gthe...@google.com I think you will change nfs to nfs_unstable, but anyway, Acked-by: Daisuke Nishimura nishim...@mxp.nes.nec.co.jp Thanks Daisuke Nishimura. --- Documentation/cgroups/memory.txt | 37 + 1 files changed, 37 insertions(+), 0 deletions(-) diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index 7781857..eab65e2 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -385,6 +385,10 @@ mapped_file - # of bytes of mapped file (includes tmpfs/shmem) pgpgin - # of pages paged in (equivalent to # of charging events). pgpgout - # of pages paged out (equivalent to # of uncharging events). swap - # of bytes of swap usage +dirty- # of bytes that are waiting to get written back to the disk. +writeback- # of bytes that are actively being written back to the disk. +nfs - # of bytes sent to the NFS server, but not yet committed to + the actual storage. inactive_anon- # of bytes of anonymous memory and swap cache memory on LRU list. active_anon - # of bytes of anonymous and swap cache memory on active @@ -453,6 +457,39 @@ memory under it will be reclaimed. You can reset failcnt by writing 0 to failcnt file. # echo 0 .../memory.failcnt +5.5 dirty memory + +Control the maximum amount of dirty pages a cgroup can have at any given time. + +Limiting dirty memory is like fixing the max amount of dirty (hard to reclaim) +page cache used by a cgroup. So, in case of multiple cgroup writers, they will +not be able to consume more than their designated share of dirty pages and will +be forced to perform write-out if they cross that limit. + +The interface is equivalent to the procfs interface: /proc/sys/vm/dirty_*. It +is possible to configure a limit to trigger both a direct writeback or a +background writeback performed by per-bdi flusher threads. The root cgroup +memory.dirty_* control files are read-only and match the contents of +the /proc/sys/vm/dirty_* files. + +Per-cgroup dirty limits can be set using the following files in the cgroupfs: + +- memory.dirty_ratio: the amount of dirty memory (expressed as a percentage of + cgroup memory) at which a process generating dirty pages will itself start + writing out dirty data. + +- memory.dirty_bytes: the amount of dirty memory (expressed in bytes) in the + cgroup at which a process generating dirty pages will start itself writing out + dirty data. + +- memory.dirty_background_ratio: the amount of dirty memory of the cgroup + (expressed as a percentage of cgroup memory) at which background writeback + kernel threads will start writing out dirty data. + +- memory.dirty_background_bytes: the amount of dirty memory (expressed in bytes) + in the cgroup at which background writeback kernel threads will start writing + out dirty data. + 6. Hierarchy support The memory controller supports a deep hierarchy and hierarchical accounting. -- 1.7.1 ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: Cgroup OOM bug
On Mon, 10 May 2010 09:01:31 +0900, KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote: On Fri, 7 May 2010 14:30:21 -0700 Jeremy Hanmer jer...@hq.newdream.net wrote: I've found a reproducible case where you can create an infinite loop inside the oom killer if you create a cgroup with a single process that has oom_adj=-17 and trigger an OOM situation inside that cgroup. It seems that oom_adj should probably be ignored in the case where it makes the OOM situation unsolvable. Here's an excerpt from the console output when the bug is triggered: Memory cgroup out of memory: kill process 25306 (memeater) score 0 or a child Memory cgroup out of memory: kill process 25306 (memeater) score 0 or a child Memory cgroup out of memory: kill process 25306 (memeater) score 0 or a child Memory cgroup out of memory: kill process 25306 (memeater) score 0 or a child Memory cgroup out of memory: kill process 25306 (memeater) score 0 or a child Memory cgroup out of memory: kill process 25306 (memeater) score 0 or a child Memory cgroup out of memory: kill process 25306 (memeater) score 0 or a child Memory cgroup out of memory: kill process 25306 (memeater) score 0 or a child Memory cgroup out of memory: kill process 25306 (memeater) score 0 or a child Memory cgroup out of memory: kill process 25306 (memeater) score 0 or a child Memory cgroup out of memory: kill process 25306 (memeater) score 0 or a child Memory cgroup out of memory: kill process 25306 (memeater) score 0 or a child Now, In -mm tree, 2 options are supproted and it's under test. 1. oom_notifier ... an event notifier which notifies OOM. 2. oom_disable ... disable OOM-Killer. All threads will wait for the end of OOM. No cpu consumption. By 1+2, an user(user-land deamon) can implement its own OOM handler. Please wait it. And, you also need memcg-make-oom-killer-a-no-op-when-no-killable-task-can-be-found.patch. IIUC, we will be stuck at the infinite loop inside mem_cgroup_out_of_memory() otherwise. Thanks, Daisuke Nishimura. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH -mmotm 1/5] memcg: disable irq at page cgroup lock
On Wed, 14 Apr 2010 21:48:25 -0700, Greg Thelen gthe...@google.com wrote: On Wed, Apr 14, 2010 at 7:40 PM, Daisuke Nishimura nishim...@mxp.nes.nec.co.jp wrote: On Wed, 14 Apr 2010 13:14:07 -0700, Greg Thelen gthe...@google.com wrote: Vivek Goyal vgo...@redhat.com writes: On Tue, Apr 13, 2010 at 11:55:12PM -0700, Greg Thelen wrote: On Thu, Mar 18, 2010 at 8:00 PM, KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote: On Fri, 19 Mar 2010 08:10:39 +0530 Balbir Singh bal...@linux.vnet.ibm.com wrote: * KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com [2010-03-19 10:23:32]: On Thu, 18 Mar 2010 21:58:55 +0530 Balbir Singh bal...@linux.vnet.ibm.com wrote: * KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com [2010-03-18 13:35:27]: Then, no probelm. It's ok to add mem_cgroup_udpate_stat() indpendent from mem_cgroup_update_file_mapped(). The look may be messy but it's not your fault. But please write why add new function to patch description. I'm sorry for wasting your time. Do we need to go down this route? We could check the stat and do the correct thing. In case of FILE_MAPPED, always grab page_cgroup_lock and for others potentially look at trylock. It is OK for different stats to be protected via different locks. I _don't_ want to see a mixture of spinlock and trylock in a function. A well documented well written function can help. The other thing is to of-course solve this correctly by introducing different locking around the statistics. Are you suggesting the later? No. As I wrote.     - don't modify codes around FILE_MAPPED in this series.     - add a new functions for new statistics Then,     - think about clean up later, after we confirm all things work as expected. I have ported Andrea Righi's memcg dirty page accounting patches to latest mmtom-2010-04-05-16-09.  In doing so I have to address this locking issue.  Does the following look good?  I will (of course) submit the entire patch for review, but I wanted make sure I was aiming in the right direction. void mem_cgroup_update_page_stat(struct page *page,           enum mem_cgroup_write_page_stat_item idx, bool charge) {   static int seq;   struct page_cgroup *pc;   if (mem_cgroup_disabled())       return;   pc = lookup_page_cgroup(page);   if (!pc || mem_cgroup_is_root(pc-mem_cgroup))       return;   /*   * This routine does not disable irq when updating stats.  So it is   * possible that a stat update from within interrupt routine, could   * deadlock.  Use trylock_page_cgroup() to avoid such deadlock.  This   * makes the memcg counters fuzzy.  More complicated, or lower   * performing locking solutions avoid this fuzziness, but are not   * currently needed.   */   if (irqs_disabled()) {       ^ Or may be in_interrupt()? Good catch.  I will replace irqs_disabled() with in_interrupt(). I think you should check both. __remove_from_page_cache(), which will update DIRTY, is called with irq disabled(iow, under mapping-tree_lock) but not in interrupt context. The only reason to use trylock in this case is to prevent deadlock when running in a context that may have preempted or interrupted a routine that already holds the bit locked. In the __remove_from_page_cache() irqs are disabled, but that does not imply that a routine holding the spinlock has been preempted. When the bit is locked, preemption is disabled. The only way to interrupt a holder of the bit for an interrupt to occur (I/O, timer, etc). So I think that in_interrupt() is sufficient. Am I missing something? IIUC, it's would be enough to prevent deadlock where one CPU tries to acquire the same page cgroup lock. But there is still some possibility where 2 CPUs can cause dead lock each other(please see the commit e767e056). IOW, my point is don't call lock_page_cgroup() under mapping-tree_lock. Thanks, Daisuke Nishimura. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH -mmotm 1/5] memcg: disable irq at page cgroup lock
On Wed, 14 Apr 2010 13:14:07 -0700, Greg Thelen gthe...@google.com wrote: Vivek Goyal vgo...@redhat.com writes: On Tue, Apr 13, 2010 at 11:55:12PM -0700, Greg Thelen wrote: On Thu, Mar 18, 2010 at 8:00 PM, KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote: On Fri, 19 Mar 2010 08:10:39 +0530 Balbir Singh bal...@linux.vnet.ibm.com wrote: * KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com [2010-03-19 10:23:32]: On Thu, 18 Mar 2010 21:58:55 +0530 Balbir Singh bal...@linux.vnet.ibm.com wrote: * KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com [2010-03-18 13:35:27]: Then, no probelm. It's ok to add mem_cgroup_udpate_stat() indpendent from mem_cgroup_update_file_mapped(). The look may be messy but it's not your fault. But please write why add new function to patch description. I'm sorry for wasting your time. Do we need to go down this route? We could check the stat and do the correct thing. In case of FILE_MAPPED, always grab page_cgroup_lock and for others potentially look at trylock. It is OK for different stats to be protected via different locks. I _don't_ want to see a mixture of spinlock and trylock in a function. A well documented well written function can help. The other thing is to of-course solve this correctly by introducing different locking around the statistics. Are you suggesting the later? No. As I wrote. Â Â Â Â - don't modify codes around FILE_MAPPED in this series. Â Â Â Â - add a new functions for new statistics Then, Â Â Â Â - think about clean up later, after we confirm all things work as expected. I have ported Andrea Righi's memcg dirty page accounting patches to latest mmtom-2010-04-05-16-09. In doing so I have to address this locking issue. Does the following look good? I will (of course) submit the entire patch for review, but I wanted make sure I was aiming in the right direction. void mem_cgroup_update_page_stat(struct page *page, enum mem_cgroup_write_page_stat_item idx, bool charge) { static int seq; struct page_cgroup *pc; if (mem_cgroup_disabled()) return; pc = lookup_page_cgroup(page); if (!pc || mem_cgroup_is_root(pc-mem_cgroup)) return; /* * This routine does not disable irq when updating stats. So it is * possible that a stat update from within interrupt routine, could * deadlock. Use trylock_page_cgroup() to avoid such deadlock. This * makes the memcg counters fuzzy. More complicated, or lower * performing locking solutions avoid this fuzziness, but are not * currently needed. */ if (irqs_disabled()) { ^ Or may be in_interrupt()? Good catch. I will replace irqs_disabled() with in_interrupt(). I think you should check both. __remove_from_page_cache(), which will update DIRTY, is called with irq disabled(iow, under mapping-tree_lock) but not in interrupt context. Anyway, I tend to agree with KAMEZAWA-san: use trylock always(except for FILE_MAPPED), or add some new interfaces(e.g. mem_cgroup_update_stat_locked/safe...). Thanks, Daisuke Nishimura. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH -mmotm 1/5] memcg: disable irq at page cgroup lock
On Thu, 18 Mar 2010 09:45:19 +0900, KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote: On Thu, 18 Mar 2010 08:54:11 +0900 KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote: On Wed, 17 Mar 2010 17:28:55 +0530 Balbir Singh bal...@linux.vnet.ibm.com wrote: * Andrea Righi ari...@develer.com [2010-03-15 00:26:38]: From: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com Now, file-mapped is maintaiend. But more generic update function will be needed for dirty page accounting. For accountig page status, we have to guarantee lock_page_cgroup() will be never called under tree_lock held. To guarantee that, we use trylock at updating status. By this, we do fuzzy accounting, but in almost all case, it's correct. I don't like this at all, but in almost all cases is not acceptable for statistics, since decisions will be made on them and having them incorrect is really bad. Could we do a form of deferred statistics and fix this. plz show your implementation which has no performance regresssion. For me, I don't neee file_mapped accounting, at all. If we can remove that, we can add simple migration lock. file_mapped is a feattue you added. please improve it. BTW, I should explain how acculate this accounting is in this patch itself. Now, lock_page_cgroup/unlock_page_cgroup happens when - charge/uncharge/migrate/move accounting Then, the lock contention (trylock failure) seems to occur in conflict with - charge, uncharge, migarate. move accounting About dirty accounting, charge/uncharge/migarate are operation in synchronous manner with radix-tree (holding treelock etc). Then no account leak. move accounting is only source for inacculacy...but I don't think this move-task is ciritialmoreover, we don't move any file pages at task-move, now. (But Nishimura-san has a plan to do so.) So, contention will happen only at confliction with force_empty. About FILE_MAPPED accounting, it's not synchronous with radix-tree operaton. Then, accounting-miss seems to happen when charge/uncharge/migrate/account move. But charge we don't account a page as FILE_MAPPED before it's charged. uncharge .. usual operation in turncation is unmap-remove-from-radix-tree. Then, it's sequential in almost all case. The race exists when... Assume there are 2 threads A and B. A truncate a file, B map/unmap that. This is very unusal confliction. migrate we do try_to_unmap before migrating pages. Then, FILE_MAPPED is properly handled. move account we don't have move-account-mapped-file, yet. FILE_MAPPED is updated under pte lock. OTOH, move account is also done under pte lock. page cgroup lock is held under pte lock in both cases, so move account is not so problem as for FILE_MAPPED. Thanks, Daisuke Nishimura. Then, this trylock contention happens at contention with force_empty and truncate. Then, main issue for contention is force_empty. But it's called for removing memcg, accounting for such memcg is not important. Then, I say this accounting is Okay. To do more accurate, we may need another migration lock. But to get better performance for root cgroup, we have to call mem_cgroup_is_root() before taking lock and there will be another complicated race. Bye, -Kame ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH -mmotm 2/5] memcg: dirty memory documentation
On Mon, 15 Mar 2010 00:26:39 +0100, Andrea Righi ari...@develer.com wrote: Document cgroup dirty memory interfaces and statistics. Signed-off-by: Andrea Righi ari...@develer.com --- Documentation/cgroups/memory.txt | 36 1 files changed, 36 insertions(+), 0 deletions(-) diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index 49f86f3..38ca499 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -310,6 +310,11 @@ cache- # of bytes of page cache memory. rss - # of bytes of anonymous and swap cache memory. pgpgin - # of pages paged in (equivalent to # of charging events). pgpgout - # of pages paged out (equivalent to # of uncharging events). +filedirty- # of pages that are waiting to get written back to the disk. +writeback- # of pages that are actively being written back to the disk. +writeback_tmp- # of pages used by FUSE for temporary writeback buffers. +nfs - # of NFS pages sent to the server, but not yet committed to + the actual storage. active_anon - # of bytes of anonymous and swap cache memory on active lru list. inactive_anon- # of bytes of anonymous memory and swap cache memory on @@ -345,6 +350,37 @@ Note: - a cgroup which uses hierarchy and it has child cgroup. - a cgroup which uses hierarchy and not the root of hierarchy. +5.4 dirty memory + + Control the maximum amount of dirty pages a cgroup can have at any given time. + + Limiting dirty memory is like fixing the max amount of dirty (hard to + reclaim) page cache used by any cgroup. So, in case of multiple cgroup writers, + they will not be able to consume more than their designated share of dirty + pages and will be forced to perform write-out if they cross that limit. + + The interface is equivalent to the procfs interface: /proc/sys/vm/dirty_*. + It is possible to configure a limit to trigger both a direct writeback or a + background writeback performed by per-bdi flusher threads. + + Per-cgroup dirty limits can be set using the following files in the cgroupfs: + + - memory.dirty_ratio: contains, as a percentage of cgroup memory, the +amount of dirty memory at which a process which is generating disk writes +inside the cgroup will start itself writing out dirty data. + + - memory.dirty_bytes: the amount of dirty memory of the cgroup (expressed in +bytes) at which a process generating disk writes will start itself writing +out dirty data. + + - memory.dirty_background_ratio: contains, as a percentage of the cgroup +memory, the amount of dirty memory at which background writeback kernel +threads will start writing out dirty data. + + - memory.dirty_background_bytes: the amount of dirty memory of the cgroup (in +bytes) at which background writeback kernel threads will start writing out +dirty data. + It would be better to note that what those files of root cgroup mean. We cannot write any value to them, IOW, we cannot control dirty limit about root cgroup. And they show the same value as the global one(strictly speaking, it's not true because global values can change. We need a hook in mem_cgroup_dirty_read()?). Thanks, Daisuke Nishimura. 6. Hierarchy support -- 1.6.3.3 ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH -mmotm 4/5] memcg: dirty pages accounting and limiting infrastructure
On Tue, 16 Mar 2010 10:11:50 -0400 Vivek Goyal vgo...@redhat.com wrote: On Tue, Mar 16, 2010 at 11:32:38AM +0900, Daisuke Nishimura wrote: [..] + * mem_cgroup_page_stat() - get memory cgroup file cache statistics + * @item:memory statistic item exported to the kernel + * + * Return the accounted statistic value, or a negative value in case of error. + */ +s64 mem_cgroup_page_stat(enum mem_cgroup_read_page_stat_item item) +{ + struct mem_cgroup_page_stat stat = {}; + struct mem_cgroup *mem; + + rcu_read_lock(); + mem = mem_cgroup_from_task(current); + if (mem !mem_cgroup_is_root(mem)) { + /* + * If we're looking for dirtyable pages we need to evaluate + * free pages depending on the limit and usage of the parents + * first of all. + */ + if (item == MEMCG_NR_DIRTYABLE_PAGES) + stat.value = memcg_get_hierarchical_free_pages(mem); + /* + * Recursively evaluate page statistics against all cgroup + * under hierarchy tree + */ + stat.item = item; + mem_cgroup_walk_tree(mem, stat, mem_cgroup_page_stat_cb); + } else + stat.value = -EINVAL; + rcu_read_unlock(); + + return stat.value; +} + hmm, mem_cgroup_page_stat() can return negative value, but you place BUG_ON() in [5/5] to check it returns negative value. What happens if the current is moved to root between mem_cgroup_has_dirty_limit() and mem_cgroup_page_stat() ? How about making mem_cgroup_has_dirty_limit() return the target mem_cgroup, and passing the mem_cgroup to mem_cgroup_page_stat() ? Hmm, if mem_cgroup_has_dirty_limit() retrun pointer to memcg, then one shall have to use rcu_read_lock() and that will look ugly. agreed. Why don't we simply look at the return value and if it is negative, we fall back to using global stats and get rid of BUG_ON()? Or, modify mem_cgroup_page_stat() to return global stats if it can't determine per cgroup stat for some reason. (mem=NULL or root cgroup etc). I don't have any objection as long as we don't hit BUG_ON. Thanks, Daisuke Nishimura. Vivek -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majord...@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH -mmotm 4/5] memcg: dirty pages accounting and limiting infrastructure
!do_swap_account) continue; + if (i = MCS_FILE_STAT_STAR i = MCS_FILE_STAT_END + mem_cgroup_is_root(mem_cont)) + continue; cb-fill(cb, memcg_stat_strings[i].local_name, mystat.stat[i]); } not to show file stat in root cgroup ? It's meaningless value anyway. Of course, you'd better mention it in [2/5] too. Thanks, Daisuke Nishimura. /* per zone stat */ val = mem_cgroup_get_local_zonestat(mem, LRU_INACTIVE_ANON); @@ -3583,6 +3917,63 @@ unlock: return ret; } +static u64 mem_cgroup_dirty_read(struct cgroup *cgrp, struct cftype *cft) +{ + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp); + + switch (cft-private) { + case MEM_CGROUP_DIRTY_RATIO: + return memcg-dirty_param.dirty_ratio; + case MEM_CGROUP_DIRTY_BYTES: + return memcg-dirty_param.dirty_bytes; + case MEM_CGROUP_DIRTY_BACKGROUND_RATIO: + return memcg-dirty_param.dirty_background_ratio; + case MEM_CGROUP_DIRTY_BACKGROUND_BYTES: + return memcg-dirty_param.dirty_background_bytes; + default: + BUG(); + } +} + +static int +mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val) +{ + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp); + int type = cft-private; + + if (cgrp-parent == NULL) + return -EINVAL; + if ((type == MEM_CGROUP_DIRTY_RATIO || + type == MEM_CGROUP_DIRTY_BACKGROUND_RATIO) val 100) + return -EINVAL; + /* + * TODO: provide a validation check routine. And retry if validation + * fails. + */ + switch (type) { + case MEM_CGROUP_DIRTY_RATIO: + memcg-dirty_param.dirty_ratio = val; + memcg-dirty_param.dirty_bytes = 0; + break; + case MEM_CGROUP_DIRTY_BYTES: + memcg-dirty_param.dirty_ratio = 0; + memcg-dirty_param.dirty_bytes = val; + break; + case MEM_CGROUP_DIRTY_BACKGROUND_RATIO: + memcg-dirty_param.dirty_background_ratio = val; + memcg-dirty_param.dirty_background_bytes = 0; + break; + case MEM_CGROUP_DIRTY_BACKGROUND_BYTES: + memcg-dirty_param.dirty_background_ratio = 0; + memcg-dirty_param.dirty_background_bytes = val; + break; + default: + BUG(); + break; + } + return 0; +} + static struct cftype mem_cgroup_files[] = { { .name = usage_in_bytes, @@ -3634,6 +4025,30 @@ static struct cftype mem_cgroup_files[] = { .write_u64 = mem_cgroup_swappiness_write, }, { + .name = dirty_ratio, + .read_u64 = mem_cgroup_dirty_read, + .write_u64 = mem_cgroup_dirty_write, + .private = MEM_CGROUP_DIRTY_RATIO, + }, + { + .name = dirty_bytes, + .read_u64 = mem_cgroup_dirty_read, + .write_u64 = mem_cgroup_dirty_write, + .private = MEM_CGROUP_DIRTY_BYTES, + }, + { + .name = dirty_background_ratio, + .read_u64 = mem_cgroup_dirty_read, + .write_u64 = mem_cgroup_dirty_write, + .private = MEM_CGROUP_DIRTY_BACKGROUND_RATIO, + }, + { + .name = dirty_background_bytes, + .read_u64 = mem_cgroup_dirty_read, + .write_u64 = mem_cgroup_dirty_write, + .private = MEM_CGROUP_DIRTY_BACKGROUND_BYTES, + }, + { .name = move_charge_at_immigrate, .read_u64 = mem_cgroup_move_charge_read, .write_u64 = mem_cgroup_move_charge_write, @@ -3892,8 +4307,21 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont) mem-last_scanned_child = 0; spin_lock_init(mem-reclaim_param_lock); - if (parent) + if (parent) { mem-swappiness = get_swappiness(parent); + mem-dirty_param = parent-dirty_param; + } else { + while (1) { + get_global_vm_dirty_param(mem-dirty_param); + /* + * Since global dirty parameters are not protected we + * try to speculatively read them and retry if we get + * inconsistent values. + */ + if (likely(dirty_param_is_valid(mem-dirty_param))) + break; + } + } atomic_set(mem-refcnt, 1); mem-move_charge_at_immigrate = 0; mutex_init(mem-thresholds_lock); -- 1.6.3.3 ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers
[Devel] Re: [PATCH mmotm 2.5/4] memcg: disable irq at page cgroup lock (Re: [PATCH -mmotm 3/4] memcg: dirty pages accounting and limiting infrastructure)
On Thu, 11 Mar 2010 15:15:11 +0900, KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote: On Thu, 11 Mar 2010 14:13:00 +0900 KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote: On Thu, 11 Mar 2010 13:58:47 +0900 Daisuke Nishimura nishim...@mxp.nes.nec.co.jp wrote: I'll consider yet another fix for race in account migration if I can. me too. How about this ? Assume that the race is very rare. 1. use trylock when updating statistics. If trylock fails, don't account it. 2. add PCG_FLAG for all status as + PCG_ACCT_FILE_MAPPED, /* page is accounted as file rss*/ + PCG_ACCT_DIRTY, /* page is dirty */ + PCG_ACCT_WRITEBACK, /* page is being written back to disk */ + PCG_ACCT_WRITEBACK_TEMP, /* page is used as temporary buffer for FUSE */ + PCG_ACCT_UNSTABLE_NFS, /* NFS page not yet committed to the server */ 3. At reducing counter, check PCG_xxx flags by TESTCLEARPCGFLAG() This is similar to an _used_ method of LRU accounting. And We can think this method's error-range never go too bad number. I agree with you. I've been thinking whether we can remove page cgroup lock in update_stat as we do in lru handling codes. I think this kind of fuzzy accounting is enough for writeback status. Does anyone need strict accounting ? IMHO, we don't need strict accounting. How this looks ? I agree to this direction. One concern is we re-introduce trylock again.. Some comments are inlined. == From: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com Now, file-mapped is maintaiend. But more generic update function will be needed for dirty page accounting. For accountig page status, we have to guarantee lock_page_cgroup() will be never called under tree_lock held. To guarantee that, we use trylock at updating status. By this, we do fuzyy accounting, but in almost all case, it's correct. Signed-off-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com --- include/linux/memcontrol.h |7 +++ include/linux/page_cgroup.h | 15 +++ mm/memcontrol.c | 88 +--- mm/rmap.c |4 +- 4 files changed, 90 insertions(+), 24 deletions(-) Index: mmotm-2.6.34-Mar9/mm/memcontrol.c === --- mmotm-2.6.34-Mar9.orig/mm/memcontrol.c +++ mmotm-2.6.34-Mar9/mm/memcontrol.c @@ -1348,30 +1348,79 @@ bool mem_cgroup_handle_oom(struct mem_cg * Currently used to update mapped file statistics, but the routine can be * generalized to update other statistics as well. */ -void mem_cgroup_update_file_mapped(struct page *page, int val) +void __mem_cgroup_update_stat(struct page_cgroup *pc, int idx, bool charge) { struct mem_cgroup *mem; - struct page_cgroup *pc; - - pc = lookup_page_cgroup(page); - if (unlikely(!pc)) - return; + int val; - lock_page_cgroup(pc); mem = pc-mem_cgroup; - if (!mem) - goto done; + if (!mem || !PageCgroupUsed(pc)) + return; - if (!PageCgroupUsed(pc)) - goto done; + if (charge) + val = 1; + else + val = -1; + switch (idx) { + case MEMCG_NR_FILE_MAPPED: + if (charge) { + if (!PageCgroupFileMapped(pc)) + SetPageCgroupFileMapped(pc); + else + val = 0; + } else { + if (PageCgroupFileMapped(pc)) + ClearPageCgroupFileMapped(pc); + else + val = 0; + } Using !TestSetPageCgroupFileMapped(pc) or TestClearPageCgroupFileMapped(pc) is better ? + idx = MEM_CGROUP_STAT_FILE_MAPPED; + break; + default: + BUG(); + break; + } /* * Preemption is already disabled. We can use __this_cpu_xxx */ - __this_cpu_add(mem-stat-count[MEM_CGROUP_STAT_FILE_MAPPED], val); + __this_cpu_add(mem-stat-count[idx], val); +} -done: - unlock_page_cgroup(pc); +void mem_cgroup_update_stat(struct page *page, int idx, bool charge) +{ + struct page_cgroup *pc; + + pc = lookup_page_cgroup(page); + if (unlikely(!pc)) + return; + + if (trylock_page_cgroup(pc)) { + __mem_cgroup_update_stat(pc, idx, charge); + unlock_page_cgroup(pc); + } + return; +} + +static void mem_cgroup_migrate_stat(struct page_cgroup *pc, + struct mem_cgroup *from, struct mem_cgroup *to) +{ + preempt_disable(); + if (PageCgroupFileMapped(pc)) { + __this_cpu_dec(from-stat-count[MEM_CGROUP_STAT_FILE_MAPPED]); + __this_cpu_inc(to-stat-count[MEM_CGROUP_STAT_FILE_MAPPED]); + } + preempt_enable
[Devel] Re: [PATCH -mmotm 0/5] memcg: per cgroup dirty limit (v6)
On Thu, 11 Mar 2010 18:42:44 +0900, KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote: On Thu, 11 Mar 2010 18:25:00 +0900 KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote: Then, it's not problem that check pc-mem_cgroup is root cgroup or not without spinlock. == void mem_cgroup_update_stat(struct page *page, int idx, bool charge) { pc = lookup_page_cgroup(page); if (unlikely(!pc) || mem_cgroup_is_root(pc-mem_cgroup)) return; ... } == This can be handle in the same logic of lock failure path. And we just do ignore accounting. There are will be no spinlocksto do more than this, I think we have to use struct page rather than struct page_cgroup. Hmm..like this ? The bad point of this patch is that this will corrupt FILE_MAPPED status in root cgroup. This kind of change is not very good. So, one way is to use this kind of function only for new parameters. Hmm. IMHO, if we disable accounting file stats in root cgroup, it would be better not to show them in memory.stat to avoid confusing users. But, hmm, I think accounting them in root cgroup isn't so meaningless. Isn't making mem_cgroup_has_dirty_limit() return false in case of root cgroup enough? == From: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com Now, file-mapped is maintaiend. But more generic update function will be needed for dirty page accounting. For accountig page status, we have to guarantee lock_page_cgroup() will be never called under tree_lock held. To guarantee that, we use trylock at updating status. By this, we do fuzyy accounting, but in almost all case, it's correct. Changelog: - removed unnecessary preempt_disable() - added root cgroup check. By this, we do no lock/account in root cgroup. Signed-off-by: KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com Looks good overall. Thanks, Daisuke Nishimura. --- include/linux/memcontrol.h |7 ++- include/linux/page_cgroup.h | 15 +++ mm/memcontrol.c | 92 +--- mm/rmap.c |4 - 4 files changed, 94 insertions(+), 24 deletions(-) Index: mmotm-2.6.34-Mar9/mm/memcontrol.c === --- mmotm-2.6.34-Mar9.orig/mm/memcontrol.c +++ mmotm-2.6.34-Mar9/mm/memcontrol.c @@ -1348,30 +1348,83 @@ bool mem_cgroup_handle_oom(struct mem_cg * Currently used to update mapped file statistics, but the routine can be * generalized to update other statistics as well. */ -void mem_cgroup_update_file_mapped(struct page *page, int val) +void __mem_cgroup_update_stat(struct page_cgroup *pc, int idx, bool charge) { struct mem_cgroup *mem; - struct page_cgroup *pc; - - pc = lookup_page_cgroup(page); - if (unlikely(!pc)) - return; + int val; - lock_page_cgroup(pc); mem = pc-mem_cgroup; - if (!mem) - goto done; + if (!mem || !PageCgroupUsed(pc)) + return; - if (!PageCgroupUsed(pc)) - goto done; + if (charge) + val = 1; + else + val = -1; + switch (idx) { + case MEMCG_NR_FILE_MAPPED: + if (charge) { + if (!PageCgroupFileMapped(pc)) + SetPageCgroupFileMapped(pc); + else + val = 0; + } else { + if (PageCgroupFileMapped(pc)) + ClearPageCgroupFileMapped(pc); + else + val = 0; + } + idx = MEM_CGROUP_STAT_FILE_MAPPED; + break; + default: + BUG(); + break; + } /* * Preemption is already disabled. We can use __this_cpu_xxx */ - __this_cpu_add(mem-stat-count[MEM_CGROUP_STAT_FILE_MAPPED], val); + __this_cpu_add(mem-stat-count[idx], val); +} -done: - unlock_page_cgroup(pc); +void mem_cgroup_update_stat(struct page *page, int idx, bool charge) +{ + struct page_cgroup *pc; + + pc = lookup_page_cgroup(page); + if (!pc || mem_cgroup_is_root(pc-mem_cgroup)) + return; + + if (trylock_page_cgroup(pc)) { + __mem_cgroup_update_stat(pc, idx, charge); + unlock_page_cgroup(pc); + } + return; +} + +static void mem_cgroup_migrate_stat(struct page_cgroup *pc, + struct mem_cgroup *from, struct mem_cgroup *to) +{ + if (PageCgroupFileMapped(pc)) { + __this_cpu_dec(from-stat-count[MEM_CGROUP_STAT_FILE_MAPPED]); + if (!mem_cgroup_is_root(to)) { + __this_cpu_inc(to-stat-count[MEM_CGROUP_STAT_FILE_MAPPED]); + } else { + ClearPageCgroupFileMapped(pc); + } + } +} + +static void +__mem_cgroup_stat_fixup
[Devel] Re: [PATCH mmotm 2.5/4] memcg: disable irq at page cgroup lock (Re: [PATCH -mmotm 3/4] memcg: dirty pages accounting and limiting infrastructure)
On Wed, 10 Mar 2010 09:26:24 +0530, Balbir Singh bal...@linux.vnet.ibm.com wrote: * nishim...@mxp.nes.nec.co.jp nishim...@mxp.nes.nec.co.jp [2010-03-10 10:43:09]: Please please measure the performance overhead of this change. here. I made a patch below and measured the time(average of 10 times) of kernel build on tmpfs(make -j8 on 8 CPU machine with 2.6.33 defconfig). before - root cgroup: 190.47 sec - child cgroup: 192.81 sec after - root cgroup: 191.06 sec - child cgroup: 193.06 sec after2(local_irq_save/restore) - root cgroup: 191.42 sec - child cgroup: 193.55 sec hmm, I think it's in error range, but I can see a tendency by testing several times that it's getting slower as I add additional codes. Using local_irq_disable()/enable() except in mem_cgroup_update_file_mapped(it can be the only candidate to be called with irq disabled in future) might be the choice. Error range would depend on things like standard deviation and repetition. It might be good to keep update_file_mapped and see the impact. My concern is with large systems, the difference might be larger. -- Three Cheers, Balbir I made a patch(attached) using both local_irq_disable/enable and local_irq_save/restore. local_irq_save/restore is used only in mem_cgroup_update_file_mapped. And I attached a histogram graph of 30 times kernel build in root cgroup for each. before_root: no irq operation(original) after_root: local_irq_disable/enable for all after2_root: local_irq_save/restore for all after3_root: mixed version(attached) hmm, there seems to be a tendency that before after after3 after2 ? Should I replace save/restore version to mixed version ? Thanks, Daisuke Nishimura. === include/linux/page_cgroup.h | 28 ++-- mm/memcontrol.c | 36 ++-- 2 files changed, 44 insertions(+), 20 deletions(-) diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h index 30b0813..c0aca62 100644 --- a/include/linux/page_cgroup.h +++ b/include/linux/page_cgroup.h @@ -83,16 +83,40 @@ static inline enum zone_type page_cgroup_zid(struct page_cgroup *pc) return page_zonenum(pc-page); } -static inline void lock_page_cgroup(struct page_cgroup *pc) +static inline void __lock_page_cgroup(struct page_cgroup *pc) { bit_spin_lock(PCG_LOCK, pc-flags); } -static inline void unlock_page_cgroup(struct page_cgroup *pc) +static inline void __unlock_page_cgroup(struct page_cgroup *pc) { bit_spin_unlock(PCG_LOCK, pc-flags); } +#define lock_page_cgroup_irq(pc) \ + do {\ + local_irq_disable();\ + __lock_page_cgroup(pc); \ + } while (0) + +#define unlock_page_cgroup_irq(pc) \ + do {\ + __unlock_page_cgroup(pc); \ + local_irq_enable(); \ + } while (0) + +#define lock_page_cgroup_irqsave(pc, flags)\ + do {\ + local_irq_save(flags); \ + __lock_page_cgroup(pc); \ + } while (0) + +#define unlock_page_cgroup_irqrestore(pc, flags) \ + do {\ + __unlock_page_cgroup(pc); \ + local_irq_restore(flags); \ + } while (0) + #else /* CONFIG_CGROUP_MEM_RES_CTLR */ struct page_cgroup; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 02ea959..11d483e 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1354,12 +1354,13 @@ void mem_cgroup_update_file_mapped(struct page *page, int val) { struct mem_cgroup *mem; struct page_cgroup *pc; + unsigned long flags; pc = lookup_page_cgroup(page); if (unlikely(!pc)) return; - lock_page_cgroup(pc); + lock_page_cgroup_irqsave(pc, flags); mem = pc-mem_cgroup; if (!mem) goto done; @@ -1373,7 +1374,7 @@ void mem_cgroup_update_file_mapped(struct page *page, int val) __this_cpu_add(mem-stat-count[MEM_CGROUP_STAT_FILE_MAPPED], val); done: - unlock_page_cgroup(pc); + unlock_page_cgroup_irqrestore(pc, flags); } /* @@ -1711,7 +1712,7 @@ struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page) VM_BUG_ON(!PageLocked(page)); pc = lookup_page_cgroup(page); - lock_page_cgroup(pc); + lock_page_cgroup_irq(pc); if (PageCgroupUsed(pc)) { mem = pc-mem_cgroup; if (mem !css_tryget(mem-css)) @@ -1725,7 +1726,7 @@ struct mem_cgroup
[Devel] Re: [PATCH mmotm 2.5/4] memcg: disable irq at page cgroup lock (Re: [PATCH -mmotm 3/4] memcg: dirty pages accounting and limiting infrastructure)
On Thu, 11 Mar 2010 13:49:08 +0900, KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote: On Thu, 11 Mar 2010 13:31:23 +0900 Daisuke Nishimura nishim...@mxp.nes.nec.co.jp wrote: On Wed, 10 Mar 2010 09:26:24 +0530, Balbir Singh bal...@linux.vnet.ibm.com wrote: * nishim...@mxp.nes.nec.co.jp nishim...@mxp.nes.nec.co.jp [2010-03-10 10:43:09]: I made a patch(attached) using both local_irq_disable/enable and local_irq_save/restore. local_irq_save/restore is used only in mem_cgroup_update_file_mapped. And I attached a histogram graph of 30 times kernel build in root cgroup for each. before_root: no irq operation(original) after_root: local_irq_disable/enable for all after2_root: local_irq_save/restore for all after3_root: mixed version(attached) hmm, there seems to be a tendency that before after after3 after2 ? Should I replace save/restore version to mixed version ? IMHO, starting from after2_root version is the easist. If there is a chance to call lock/unlock page_cgroup can be called in interrupt context, we _have to_ disable IRQ, anyway. And if we have to do this, I prefer migration_lock rather than this mixture. I see. BTW, how big your system is ? Balbir-san's concern is for bigger machines. But I'm not sure this change is affecte by the size of machines. I'm sorry I have no big machine, now. My test machine have 8CPUs, and I run all the test with make -j8. Sorry, I don't have easy access to huge machine either. I'll consider yet another fix for race in account migration if I can. me too. Thanks, Daisuke Nishimura. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH mmotm 2.5/4] memcg: disable irq at page cgroup lock (Re: [PATCH -mmotm 3/4] memcg: dirty pages accounting and limiting infrastructure)
Please please measure the performance overhead of this change. here. I made a patch below and measured the time(average of 10 times) of kernel build on tmpfs(make -j8 on 8 CPU machine with 2.6.33 defconfig). before - root cgroup: 190.47 sec - child cgroup: 192.81 sec after - root cgroup: 191.06 sec - child cgroup: 193.06 sec after2(local_irq_save/restore) - root cgroup: 191.42 sec - child cgroup: 193.55 sec hmm, I think it's in error range, but I can see a tendency by testing several times that it's getting slower as I add additional codes. Using local_irq_disable()/enable() except in mem_cgroup_update_file_mapped(it can be the only candidate to be called with irq disabled in future) might be the choice. Thanks, Daisuke Nishimura. On Tue, 9 Mar 2010 10:20:58 +0530, Balbir Singh bal...@linux.vnet.ibm.com wrote: * nishim...@mxp.nes.nec.co.jp nishim...@mxp.nes.nec.co.jp [2010-03-09 10:29:28]: On Tue, 9 Mar 2010 09:19:14 +0900, KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote: On Tue, 9 Mar 2010 01:12:52 +0100 Andrea Righi ari...@develer.com wrote: On Mon, Mar 08, 2010 at 05:31:00PM +0900, KAMEZAWA Hiroyuki wrote: On Mon, 8 Mar 2010 17:07:11 +0900 Daisuke Nishimura nishim...@mxp.nes.nec.co.jp wrote: On Mon, 8 Mar 2010 11:37:11 +0900, KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote: On Mon, 8 Mar 2010 11:17:24 +0900 Daisuke Nishimura nishim...@mxp.nes.nec.co.jp wrote: But IIRC, clear_writeback is done under treelock No ? The place where NR_WRITEBACK is updated is out of tree_lock. 1311 int test_clear_page_writeback(struct page *page) 1312 { 1313 struct address_space *mapping = page_mapping(page); 1314 int ret; 1315 1316 if (mapping) { 1317 struct backing_dev_info *bdi = mapping-backing_dev_info; 1318 unsigned long flags; 1319 1320 spin_lock_irqsave(mapping-tree_lock, flags); 1321 ret = TestClearPageWriteback(page); 1322 if (ret) { 1323 radix_tree_tag_clear(mapping-page_tree, 1324 page_index(page), 1325 PAGECACHE_TAG_WRITEBACK); 1326 if (bdi_cap_account_writeback(bdi)) { 1327 __dec_bdi_stat(bdi, BDI_WRITEBACK); 1328 __bdi_writeout_inc(bdi); 1329 } 1330 } 1331 spin_unlock_irqrestore(mapping-tree_lock, flags); 1332 } else { 1333 ret = TestClearPageWriteback(page); 1334 } 1335 if (ret) 1336 dec_zone_page_state(page, NR_WRITEBACK); 1337 return ret; 1338 } We can move this up to under tree_lock. Considering memcg, all our target has mapping. If we newly account bounce-buffers (for NILFS, FUSE, etc..), which has no -mapping, we need much more complex new charge/uncharge theory. But yes, adding new lock scheme seems complicated. (Sorry Andrea.) My concerns is performance. We may need somehing new re-implementation of locks/migrate/charge/uncharge. I agree. Performance is my concern too. I made a patch below and measured the time(average of 10 times) of kernel build on tmpfs(make -j8 on 8 CPU machine with 2.6.33 defconfig). before - root cgroup: 190.47 sec - child cgroup: 192.81 sec after - root cgroup: 191.06 sec - child cgroup: 193.06 sec Hmm... about 0.3% slower for root, 0.1% slower for child. Hmm...accepatable ? (sounds it's in error-range) BTW, why local_irq_disable() ? local_irq_save()/restore() isn't better ? Probably there's not the overhead of saving flags? maybe. Anyway, it would make the code much more readable... ok. please go ahead in this direction. Nishimura-san, would you post an independent patch ? If no, Andrea-san, please. This is the updated version. Andrea-san, can you merge this into your patch set ? Please please measure the performance overhead of this change. -- Three Cheers, Balbir ___ Containers mailing list contain...@lists.linux
[Devel] Re: [PATCH -mmotm 3/4] memcg: dirty pages accounting and limiting infrastructure
On Mon, 8 Mar 2010 11:37:11 +0900, KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote: On Mon, 8 Mar 2010 11:17:24 +0900 Daisuke Nishimura nishim...@mxp.nes.nec.co.jp wrote: But IIRC, clear_writeback is done under treelock No ? The place where NR_WRITEBACK is updated is out of tree_lock. 1311 int test_clear_page_writeback(struct page *page) 1312 { 1313 struct address_space *mapping = page_mapping(page); 1314 int ret; 1315 1316 if (mapping) { 1317 struct backing_dev_info *bdi = mapping-backing_dev_info; 1318 unsigned long flags; 1319 1320 spin_lock_irqsave(mapping-tree_lock, flags); 1321 ret = TestClearPageWriteback(page); 1322 if (ret) { 1323 radix_tree_tag_clear(mapping-page_tree, 1324 page_index(page), 1325 PAGECACHE_TAG_WRITEBACK); 1326 if (bdi_cap_account_writeback(bdi)) { 1327 __dec_bdi_stat(bdi, BDI_WRITEBACK); 1328 __bdi_writeout_inc(bdi); 1329 } 1330 } 1331 spin_unlock_irqrestore(mapping-tree_lock, flags); 1332 } else { 1333 ret = TestClearPageWriteback(page); 1334 } 1335 if (ret) 1336 dec_zone_page_state(page, NR_WRITEBACK); 1337 return ret; 1338 } We can move this up to under tree_lock. Considering memcg, all our target has mapping. If we newly account bounce-buffers (for NILFS, FUSE, etc..), which has no -mapping, we need much more complex new charge/uncharge theory. But yes, adding new lock scheme seems complicated. (Sorry Andrea.) My concerns is performance. We may need somehing new re-implementation of locks/migrate/charge/uncharge. I agree. Performance is my concern too. I made a patch below and measured the time(average of 10 times) of kernel build on tmpfs(make -j8 on 8 CPU machine with 2.6.33 defconfig). before - root cgroup: 190.47 sec - child cgroup: 192.81 sec after - root cgroup: 191.06 sec - child cgroup: 193.06 sec Hmm... about 0.3% slower for root, 0.1% slower for child. === From: Daisuke Nishimura nishim...@mxp.nes.nec.co.jp In current implementation, we don't have to disable irq at lock_page_cgroup() because the lock is never acquired in interrupt context. But we are going to do it in later patch, so this patch encloses all of lock_page_cgroup()/unlock_page_cgroup() with irq_disabled()/irq_enabled(). Signed-off-by: Daisuke Nishimura nishim...@mxp.nes.nec.co.jp --- mm/memcontrol.c | 17 + 1 files changed, 17 insertions(+), 0 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 02ea959..e5ae1a1 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1359,6 +1359,7 @@ void mem_cgroup_update_file_mapped(struct page *page, int val) if (unlikely(!pc)) return; + local_irq_disable(); lock_page_cgroup(pc); mem = pc-mem_cgroup; if (!mem) @@ -1374,6 +1375,7 @@ void mem_cgroup_update_file_mapped(struct page *page, int val) done: unlock_page_cgroup(pc); + local_irq_enable(); } /* @@ -1711,6 +1713,7 @@ struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page) VM_BUG_ON(!PageLocked(page)); pc = lookup_page_cgroup(page); + local_irq_disable(); lock_page_cgroup(pc); if (PageCgroupUsed(pc)) { mem = pc-mem_cgroup; @@ -1726,6 +1729,7 @@ struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page) rcu_read_unlock(); } unlock_page_cgroup(pc); + local_irq_enable(); return mem; } @@ -1742,9 +1746,11 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem, if (!mem) return; + local_irq_disable(); lock_page_cgroup(pc); if (unlikely(PageCgroupUsed(pc))) { unlock_page_cgroup(pc); + local_irq_enable(); mem_cgroup_cancel_charge(mem); return; } @@ -1775,6 +1781,7 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem, mem_cgroup_charge_statistics(mem, pc, true); unlock_page_cgroup(pc); + local_irq_enable(); /* * charge_statistics updated event counter. Then, check it. * Insert ancestor (and ancestor's ancestors), to softlimit RB-tree. @@ -1844,12 +1851,14 @@ static int mem_cgroup_move_account(struct page_cgroup *pc, struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge) { int ret = -EINVAL; + local_irq_disable(); lock_page_cgroup(pc
[Devel] Re: [PATCH -mmotm 3/4] memcg: dirty pages accounting and limiting infrastructure
On Mon, 8 Mar 2010 17:31:00 +0900, KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote: On Mon, 8 Mar 2010 17:07:11 +0900 Daisuke Nishimura nishim...@mxp.nes.nec.co.jp wrote: On Mon, 8 Mar 2010 11:37:11 +0900, KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote: On Mon, 8 Mar 2010 11:17:24 +0900 Daisuke Nishimura nishim...@mxp.nes.nec.co.jp wrote: But IIRC, clear_writeback is done under treelock No ? The place where NR_WRITEBACK is updated is out of tree_lock. 1311 int test_clear_page_writeback(struct page *page) 1312 { 1313 struct address_space *mapping = page_mapping(page); 1314 int ret; 1315 1316 if (mapping) { 1317 struct backing_dev_info *bdi = mapping-backing_dev_info; 1318 unsigned long flags; 1319 1320 spin_lock_irqsave(mapping-tree_lock, flags); 1321 ret = TestClearPageWriteback(page); 1322 if (ret) { 1323 radix_tree_tag_clear(mapping-page_tree, 1324 page_index(page), 1325 PAGECACHE_TAG_WRITEBACK); 1326 if (bdi_cap_account_writeback(bdi)) { 1327 __dec_bdi_stat(bdi, BDI_WRITEBACK); 1328 __bdi_writeout_inc(bdi); 1329 } 1330 } 1331 spin_unlock_irqrestore(mapping-tree_lock, flags); 1332 } else { 1333 ret = TestClearPageWriteback(page); 1334 } 1335 if (ret) 1336 dec_zone_page_state(page, NR_WRITEBACK); 1337 return ret; 1338 } We can move this up to under tree_lock. Considering memcg, all our target has mapping. If we newly account bounce-buffers (for NILFS, FUSE, etc..), which has no -mapping, we need much more complex new charge/uncharge theory. But yes, adding new lock scheme seems complicated. (Sorry Andrea.) My concerns is performance. We may need somehing new re-implementation of locks/migrate/charge/uncharge. I agree. Performance is my concern too. I made a patch below and measured the time(average of 10 times) of kernel build on tmpfs(make -j8 on 8 CPU machine with 2.6.33 defconfig). before - root cgroup: 190.47 sec - child cgroup: 192.81 sec after - root cgroup: 191.06 sec - child cgroup: 193.06 sec Hmm... about 0.3% slower for root, 0.1% slower for child. Hmm...accepatable ? (sounds it's in error-range) BTW, why local_irq_disable() ? local_irq_save()/restore() isn't better ? I don't have any strong reason. All of lock_page_cgroup() is *now* called w/o irq disabled, so I used just disable()/enable() instead of save()/restore(). I think disable()/enable() is better in those cases because we need not to save/restore eflags register by pushf/popf, but, I don't have any numbers though, there wouldn't be a big difference in performance. Thanks, Daisuke Nishimura. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH -mmotm 3/4] memcg: dirty pages accounting and limiting infrastructure
On Tue, 9 Mar 2010 09:20:54 +0900, KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote: On Tue, 9 Mar 2010 09:18:45 +0900 Daisuke Nishimura nishim...@mxp.nes.nec.co.jp wrote: On Mon, 8 Mar 2010 17:31:00 +0900, KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote: On Mon, 8 Mar 2010 17:07:11 +0900 Daisuke Nishimura nishim...@mxp.nes.nec.co.jp wrote: Hmm...accepatable ? (sounds it's in error-range) BTW, why local_irq_disable() ? local_irq_save()/restore() isn't better ? I don't have any strong reason. All of lock_page_cgroup() is *now* called w/o irq disabled, so I used just disable()/enable() instead of save()/restore(). My point is, this will be used under treelock soon. I agree. I'll update the patch using save()/restore(), and repost later. Thanks, Daisuke Nishimura. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH mmotm 2.5/4] memcg: disable irq at page cgroup lock (Re: [PATCH -mmotm 3/4] memcg: dirty pages accounting and limiting infrastructure)
On Tue, 9 Mar 2010 09:19:14 +0900, KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote: On Tue, 9 Mar 2010 01:12:52 +0100 Andrea Righi ari...@develer.com wrote: On Mon, Mar 08, 2010 at 05:31:00PM +0900, KAMEZAWA Hiroyuki wrote: On Mon, 8 Mar 2010 17:07:11 +0900 Daisuke Nishimura nishim...@mxp.nes.nec.co.jp wrote: On Mon, 8 Mar 2010 11:37:11 +0900, KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote: On Mon, 8 Mar 2010 11:17:24 +0900 Daisuke Nishimura nishim...@mxp.nes.nec.co.jp wrote: But IIRC, clear_writeback is done under treelock No ? The place where NR_WRITEBACK is updated is out of tree_lock. 1311 int test_clear_page_writeback(struct page *page) 1312 { 1313 struct address_space *mapping = page_mapping(page); 1314 int ret; 1315 1316 if (mapping) { 1317 struct backing_dev_info *bdi = mapping-backing_dev_info; 1318 unsigned long flags; 1319 1320 spin_lock_irqsave(mapping-tree_lock, flags); 1321 ret = TestClearPageWriteback(page); 1322 if (ret) { 1323 radix_tree_tag_clear(mapping-page_tree, 1324 page_index(page), 1325 PAGECACHE_TAG_WRITEBACK); 1326 if (bdi_cap_account_writeback(bdi)) { 1327 __dec_bdi_stat(bdi, BDI_WRITEBACK); 1328 __bdi_writeout_inc(bdi); 1329 } 1330 } 1331 spin_unlock_irqrestore(mapping-tree_lock, flags); 1332 } else { 1333 ret = TestClearPageWriteback(page); 1334 } 1335 if (ret) 1336 dec_zone_page_state(page, NR_WRITEBACK); 1337 return ret; 1338 } We can move this up to under tree_lock. Considering memcg, all our target has mapping. If we newly account bounce-buffers (for NILFS, FUSE, etc..), which has no -mapping, we need much more complex new charge/uncharge theory. But yes, adding new lock scheme seems complicated. (Sorry Andrea.) My concerns is performance. We may need somehing new re-implementation of locks/migrate/charge/uncharge. I agree. Performance is my concern too. I made a patch below and measured the time(average of 10 times) of kernel build on tmpfs(make -j8 on 8 CPU machine with 2.6.33 defconfig). before - root cgroup: 190.47 sec - child cgroup: 192.81 sec after - root cgroup: 191.06 sec - child cgroup: 193.06 sec Hmm... about 0.3% slower for root, 0.1% slower for child. Hmm...accepatable ? (sounds it's in error-range) BTW, why local_irq_disable() ? local_irq_save()/restore() isn't better ? Probably there's not the overhead of saving flags? maybe. Anyway, it would make the code much more readable... ok. please go ahead in this direction. Nishimura-san, would you post an independent patch ? If no, Andrea-san, please. This is the updated version. Andrea-san, can you merge this into your patch set ? === From: Daisuke Nishimura nishim...@mxp.nes.nec.co.jp In current implementation, we don't have to disable irq at lock_page_cgroup() because the lock is never acquired in interrupt context. But we are going to call it in later patch in an interrupt context or with irq disabled, so this patch disables irq at lock_page_cgroup() and enables it at unlock_page_cgroup(). Signed-off-by: Daisuke Nishimura nishim...@mxp.nes.nec.co.jp --- include/linux/page_cgroup.h | 16 ++-- mm/memcontrol.c | 43 +-- 2 files changed, 39 insertions(+), 20 deletions(-) diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h index 30b0813..0d2f92c 100644 --- a/include/linux/page_cgroup.h +++ b/include/linux/page_cgroup.h @@ -83,16 +83,28 @@ static inline enum zone_type page_cgroup_zid(struct page_cgroup *pc) return page_zonenum(pc-page); } -static inline void lock_page_cgroup(struct page_cgroup *pc) +static inline void __lock_page_cgroup(struct page_cgroup *pc) { bit_spin_lock(PCG_LOCK, pc-flags); } -static inline void unlock_page_cgroup(struct page_cgroup *pc) +static inline void __unlock_page_cgroup(struct page_cgroup *pc) { bit_spin_unlock(PCG_LOCK, pc-flags); } +#define lock_page_cgroup(pc, flags)\ + do {\ + local_irq_save(flags
[Devel] Re: [PATCH -mmotm 3/4] memcg: dirty pages accounting and limiting infrastructure
+/* + * mem_cgroup_update_page_stat_locked() - update memcg file cache's accounting + * @page:the page involved in a file cache operation. + * @idx: the particular file cache statistic. + * @charge: true to increment, false to decrement the statistic specified + * by @idx. + * + * Update memory cgroup file cache's accounting from a locked context. + * + * NOTE: must be called with mapping-tree_lock held. + */ +void mem_cgroup_update_page_stat_locked(struct page *page, + enum mem_cgroup_write_page_stat_item idx, bool charge) +{ + struct address_space *mapping = page_mapping(page); + struct page_cgroup *pc; + + if (mem_cgroup_disabled()) + return; + WARN_ON_ONCE(!irqs_disabled()); + WARN_ON_ONCE(mapping !spin_is_locked(mapping-tree_lock)); + I think this is a wrong place to insert assertion. The problem about page cgroup lock is that it can be interrupted in current implementation. So, a) it must not be aquired under another lock which can be aquired in interrupt context, such as mapping-tree_lock, to avoid: context1context2 lock_page_cgroup(pcA) spin_lock_irq(tree_lock) lock_page_cgroup(pcA) interrupted =fail spin_lock_irqsave(tree_lock) =fail b) it must not be aquired in interrupt context to avoid: lock_page_cgroup(pcA) interrupted lock_page_cgroup(pcA) =fail I think something like this would be better: @@ -83,8 +83,14 @@ static inline enum zone_type page_cgroup_zid(struct page_cgroup *pc) return page_zonenum(pc-page); } +#include linux/irqflags.h +#include linux/hardirq.h static inline void lock_page_cgroup(struct page_cgroup *pc) { +#ifdef CONFIG_DEBUG_VM + WARN_ON_ONCE(irqs_disabled()); + WARN_ON_ONCE(in_interrupt()); +#endif bit_spin_lock(PCG_LOCK, pc-flags); } + pc = lookup_page_cgroup(page); + if (unlikely(!pc) || !PageCgroupUsed(pc)) + return; + mem_cgroup_update_page_stat(pc, idx, charge); +} +EXPORT_SYMBOL_GPL(mem_cgroup_update_page_stat_locked); + +/* + * mem_cgroup_update_page_stat_unlocked() - update memcg file cache's accounting + * @page:the page involved in a file cache operation. + * @idx: the particular file cache statistic. + * @charge: true to increment, false to decrement the statistic specified + * by @idx. + * + * Update memory cgroup file cache's accounting from an unlocked context. + */ +void mem_cgroup_update_page_stat_unlocked(struct page *page, + enum mem_cgroup_write_page_stat_item idx, bool charge) +{ + struct page_cgroup *pc; + + if (mem_cgroup_disabled()) + return; + pc = lookup_page_cgroup(page); + if (unlikely(!pc) || !PageCgroupUsed(pc)) + return; + lock_page_cgroup(pc); + mem_cgroup_update_page_stat(pc, idx, charge); unlock_page_cgroup(pc); } +EXPORT_SYMBOL_GPL(mem_cgroup_update_page_stat_unlocked); IIUC, test_clear_page_writeback(at least) can be called under interrupt context. This means lock_page_cgroup() is called under interrupt context, that is, the case b) above can happen. hmm... I don't have any good idea for now except disabling irq around page cgroup lock to avoid all of these mess things. Thanks, Daisuke Nishimura. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH -mmotm 3/4] memcg: dirty pages accounting and limiting infrastructure
On Mon, 8 Mar 2010 10:56:41 +0900, KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote: On Mon, 8 Mar 2010 10:44:47 +0900 Daisuke Nishimura nishim...@mxp.nes.nec.co.jp wrote: +/* + * mem_cgroup_update_page_stat_locked() - update memcg file cache's accounting + * @page:the page involved in a file cache operation. + * @idx: the particular file cache statistic. + * @charge: true to increment, false to decrement the statistic specified + * by @idx. + * + * Update memory cgroup file cache's accounting from a locked context. + * + * NOTE: must be called with mapping-tree_lock held. + */ +void mem_cgroup_update_page_stat_locked(struct page *page, + enum mem_cgroup_write_page_stat_item idx, bool charge) +{ + struct address_space *mapping = page_mapping(page); + struct page_cgroup *pc; + + if (mem_cgroup_disabled()) + return; + WARN_ON_ONCE(!irqs_disabled()); + WARN_ON_ONCE(mapping !spin_is_locked(mapping-tree_lock)); + I think this is a wrong place to insert assertion. The problem about page cgroup lock is that it can be interrupted in current implementation. So, a) it must not be aquired under another lock which can be aquired in interrupt context, such as mapping-tree_lock, to avoid: context1context2 lock_page_cgroup(pcA) spin_lock_irq(tree_lock) lock_page_cgroup(pcA) interrupted =fail spin_lock_irqsave(tree_lock) =fail b) it must not be aquired in interrupt context to avoid: lock_page_cgroup(pcA) interrupted lock_page_cgroup(pcA) =fail I think something like this would be better: @@ -83,8 +83,14 @@ static inline enum zone_type page_cgroup_zid(struct page_cgroup *pc) return page_zonenum(pc-page); } +#include linux/irqflags.h +#include linux/hardirq.h static inline void lock_page_cgroup(struct page_cgroup *pc) { +#ifdef CONFIG_DEBUG_VM + WARN_ON_ONCE(irqs_disabled()); + WARN_ON_ONCE(in_interrupt()); +#endif bit_spin_lock(PCG_LOCK, pc-flags); } + pc = lookup_page_cgroup(page); + if (unlikely(!pc) || !PageCgroupUsed(pc)) + return; + mem_cgroup_update_page_stat(pc, idx, charge); +} +EXPORT_SYMBOL_GPL(mem_cgroup_update_page_stat_locked); + +/* + * mem_cgroup_update_page_stat_unlocked() - update memcg file cache's accounting + * @page:the page involved in a file cache operation. + * @idx: the particular file cache statistic. + * @charge: true to increment, false to decrement the statistic specified + * by @idx. + * + * Update memory cgroup file cache's accounting from an unlocked context. + */ +void mem_cgroup_update_page_stat_unlocked(struct page *page, + enum mem_cgroup_write_page_stat_item idx, bool charge) +{ + struct page_cgroup *pc; + + if (mem_cgroup_disabled()) + return; + pc = lookup_page_cgroup(page); + if (unlikely(!pc) || !PageCgroupUsed(pc)) + return; + lock_page_cgroup(pc); + mem_cgroup_update_page_stat(pc, idx, charge); unlock_page_cgroup(pc); } +EXPORT_SYMBOL_GPL(mem_cgroup_update_page_stat_unlocked); IIUC, test_clear_page_writeback(at least) can be called under interrupt context. This means lock_page_cgroup() is called under interrupt context, that is, the case b) above can happen. hmm... I don't have any good idea for now except disabling irq around page cgroup lock to avoid all of these mess things. Hmm...simply IRQ-off for all updates ? I think so in current code. But after these changes, we must use local_irq_save()/restore() instead of local_irq_disable()/enable() in mem_cgroup_update_page_stat(). But IIRC, clear_writeback is done under treelock No ? The place where NR_WRITEBACK is updated is out of tree_lock. 1311 int test_clear_page_writeback(struct page *page) 1312 { 1313 struct address_space *mapping = page_mapping(page); 1314 int ret; 1315 1316 if (mapping) { 1317 struct backing_dev_info *bdi = mapping-backing_dev_info; 1318 unsigned long flags; 1319 1320 spin_lock_irqsave(mapping-tree_lock, flags); 1321 ret = TestClearPageWriteback(page); 1322 if (ret) { 1323 radix_tree_tag_clear(mapping-page_tree, 1324 page_index(page), 1325 PAGECACHE_TAG_WRITEBACK); 1326 if (bdi_cap_account_writeback(bdi)) { 1327
[Devel] Re: [PATCH -mmotm 3/4] memcg: dirty pages accounting and limiting infrastructure
+ ClearPageCgroupDirty(pc); + break; + case MEM_CGROUP_STAT_WRITEBACK: + if (val 0) + SetPageCgroupWriteback(pc); + else + ClearPageCgroupWriteback(pc); + break; + case MEM_CGROUP_STAT_WRITEBACK_TEMP: + if (val 0) + SetPageCgroupWritebackTemp(pc); + else + ClearPageCgroupWritebackTemp(pc); + break; + case MEM_CGROUP_STAT_UNSTABLE_NFS: + if (val 0) + SetPageCgroupUnstableNFS(pc); + else + ClearPageCgroupUnstableNFS(pc); + break; + default: + BUG(); + break; + } + mem = pc-mem_cgroup; + if (likely(mem)) + __this_cpu_add(mem-stat-count[idx], val); + unlock_page_cgroup_migrate(pc); } +EXPORT_SYMBOL_GPL(mem_cgroup_update_stat); /* * size of first charge trial. 32 comes from vmscan.c's magic value. @@ -1701,6 +1885,45 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem, memcg_check_events(mem, pc-page); } +/* + * Update file cache accounted statistics on task migration. + * + * TODO: We don't move charges of file (including shmem/tmpfs) pages for now. + * So, at the moment this function simply returns without updating accounted + * statistics, because we deal only with anonymous pages here. + */ This function is not unique to task migration. It's called from rmdir() too. So this comment isn't needed. +static void __mem_cgroup_update_file_stat(struct page_cgroup *pc, + struct mem_cgroup *from, struct mem_cgroup *to) +{ + struct page *page = pc-page; + + if (!page_mapped(page) || PageAnon(page)) + return; + + if (PageCgroupFileMapped(pc)) { + __this_cpu_dec(from-stat-count[MEM_CGROUP_STAT_FILE_MAPPED]); + __this_cpu_inc(to-stat-count[MEM_CGROUP_STAT_FILE_MAPPED]); + } + if (PageCgroupDirty(pc)) { + __this_cpu_dec(from-stat-count[MEM_CGROUP_STAT_FILE_DIRTY]); + __this_cpu_inc(to-stat-count[MEM_CGROUP_STAT_FILE_DIRTY]); + } + if (PageCgroupWriteback(pc)) { + __this_cpu_dec(from-stat-count[MEM_CGROUP_STAT_WRITEBACK]); + __this_cpu_inc(to-stat-count[MEM_CGROUP_STAT_WRITEBACK]); + } + if (PageCgroupWritebackTemp(pc)) { + __this_cpu_dec( + from-stat-count[MEM_CGROUP_STAT_WRITEBACK_TEMP]); + __this_cpu_inc(to-stat-count[MEM_CGROUP_STAT_WRITEBACK_TEMP]); + } + if (PageCgroupUnstableNFS(pc)) { + __this_cpu_dec( + from-stat-count[MEM_CGROUP_STAT_UNSTABLE_NFS]); + __this_cpu_inc(to-stat-count[MEM_CGROUP_STAT_UNSTABLE_NFS]); + } +} + /** * __mem_cgroup_move_account - move account of the page * @pc: page_cgroup of the page. @@ -1721,22 +1944,16 @@ static void __mem_cgroup_commit_charge(struct mem_cgroup *mem, static void __mem_cgroup_move_account(struct page_cgroup *pc, struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge) { - struct page *page; - VM_BUG_ON(from == to); VM_BUG_ON(PageLRU(pc-page)); VM_BUG_ON(!PageCgroupLocked(pc)); VM_BUG_ON(!PageCgroupUsed(pc)); VM_BUG_ON(pc-mem_cgroup != from); - page = pc-page; - if (page_mapped(page) !PageAnon(page)) { - /* Update mapped_file data for mem_cgroup */ - preempt_disable(); - __this_cpu_dec(from-stat-count[MEM_CGROUP_STAT_FILE_MAPPED]); - __this_cpu_inc(to-stat-count[MEM_CGROUP_STAT_FILE_MAPPED]); - preempt_enable(); - } + preempt_disable(); + lock_page_cgroup_migrate(pc); + __mem_cgroup_update_file_stat(pc, from, to); + mem_cgroup_charge_statistics(from, pc, false); if (uncharge) /* This is not cancel, but cancel_charge does all we need. */ @@ -1745,6 +1962,8 @@ static void __mem_cgroup_move_account(struct page_cgroup *pc, /* caller should have done css_get */ pc-mem_cgroup = to; mem_cgroup_charge_statistics(to, pc, true); + unlock_page_cgroup_migrate(pc); + preempt_enable(); Glad to see this cleanup :) But, hmm, I don't think preempt_disable/enable() is enough(and bit_spin_lock/unlock() does it anyway). lock/unlock_page_cgroup_migrate() can be called under irq context (e.g. end_page_writeback()), so I think we must local_irq_disable()/enable() here. Thanks, Daisuke Nishimura. /* * We charges against to which may not have any tasks. Then, to * can be under rmdir(). But in current implementation, caller of @@ -3042,6 +3261,10 @@ enum { MCS_PGPGIN, MCS_PGPGOUT, MCS_SWAP, + MCS_FILE_DIRTY, + MCS_WRITEBACK, + MCS_WRITEBACK_TEMP
[Devel] Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
+ ClearPageCgroupAcctDirty(pc); + break; + case MEM_CGROUP_STAT_WBACK: + if (set) + SetPageCgroupAcctWB(pc); + else + ClearPageCgroupAcctWB(pc); + break; + case MEM_CGROUP_STAT_WBACK_TEMP: + if (set) + SetPageCgroupAcctWBTemp(pc); + else + ClearPageCgroupAcctWBTemp(pc); + break; + case MEM_CGROUP_STAT_UNSTABLE_NFS: + if (set) + SetPageCgroupAcctUnstableNFS(pc); + else + ClearPageCgroupAcctUnstableNFS(pc); + break; + default: + BUG(); + break; + } + mem = pc-mem_cgroup; + if (set) + __this_cpu_inc(mem-stat-count[idx]); + else + __this_cpu_dec(mem-stat-count[idx]); + unlock_page_cgroup_migrate(pc); +} + +static void move_acct_information(struct mem_cgroup *from, + struct mem_cgroup *to, + struct page_cgroup *pc) +{ + /* preemption is disabled, migration_lock is held. */ + if (PageCgroupAcctDirty(pc)) { + __this_cpu_dec(from-stat-count[MEM_CGROUP_STAT_DIRTY]); + __this_cpu_inc(to-stat-count[MEM_CGROUP_STAT_DIRTY]); + } + if (PageCgroupAcctWB(pc)) { + __this_cpu_dec(from-stat-count[MEM_CGROUP_STAT_WBACK]); + __this_cpu_inc(to-stat-count[MEM_CGROUP_STAT_WBACK]); + } + if (PageCgroupAcctWBTemp(pc)) { + __this_cpu_dec(from-stat-count[MEM_CGROUP_STAT_WBACK_TEMP]); + __this_cpu_inc(to-stat-count[MEM_CGROUP_STAT_WBACK_TEMP]); + } + if (PageCgroupAcctUnstableNFS(pc)) { + __this_cpu_dec(from-stat-count[MEM_CGROUP_STAT_UNSTABLE_NFS]); + __this_cpu_inc(to-stat-count[MEM_CGROUP_STAT_UNSTABLE_NFS]); + } +} + +/* * size of first charge trial. 32 comes from vmscan.c's magic value. * TODO: maybe necessary to use big numbers in big irons. */ @@ -1794,15 +1878,16 @@ static void __mem_cgroup_move_account(st VM_BUG_ON(!PageCgroupUsed(pc)); VM_BUG_ON(pc-mem_cgroup != from); + preempt_disable(); + lock_page_cgroup_migrate(pc); page = pc-page; if (page_mapped(page) !PageAnon(page)) { /* Update mapped_file data for mem_cgroup */ - preempt_disable(); __this_cpu_dec(from-stat-count[MEM_CGROUP_STAT_FILE_MAPPED]); __this_cpu_inc(to-stat-count[MEM_CGROUP_STAT_FILE_MAPPED]); - preempt_enable(); } mem_cgroup_charge_statistics(from, pc, false); + move_acct_information(from, to, pc); Kame-san, a question. According to is_target_pte_for_mc() it seems we don't move file pages across cgroups for now. If !PageAnon(page) we just return 0 and the page won't be selected for migration in mem_cgroup_move_charge_pte_range(). You're right. It's my TODO to move file pages at task migration. So, if I've understood well the code is correct in perspective, but right now it's unnecessary. File pages are not moved on task migration across cgroups and, at the moment, there's no way for file page accounted statistics to go negative. Or am I missing something? __mem_cgroup_move_account() will be called not only at task migration but also at rmdir, so I think it would be better to handle file pages anyway. Thanks, Daisuke Nishimura. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
On Tue, 2 Mar 2010 09:01:58 +0100, Andrea Righi ari...@develer.com wrote: On Tue, Mar 02, 2010 at 09:23:09AM +0900, KAMEZAWA Hiroyuki wrote: On Mon, 1 Mar 2010 22:23:40 +0100 Andrea Righi ari...@develer.com wrote: Apply the cgroup dirty pages accounting and limiting infrastructure to the opportune kernel functions. Signed-off-by: Andrea Righi ari...@develer.com Seems nice. Hmm. the last problem is moving account between memcg. Right ? Correct. This was actually the last item of the TODO list. Anyway, I'm still considering if it's correct to move dirty pages when a task is migrated from a cgroup to another. Currently, dirty pages just remain in the original cgroup and are flushed depending on the original cgroup settings. That is not totally wrong... at least moving the dirty pages between memcgs should be optional (move_charge_at_immigrate?). FYI, I'm planning to add file-cache and shmem/tmpfs support for move_charge feature for 2.6.35. But, hmm, it would be complicated if we try to move dirty account too. Thanks, Daisuke Nishimura. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
On Tue, 2 Mar 2010 23:18:23 +0100, Andrea Righi ari...@develer.com wrote: On Tue, Mar 02, 2010 at 07:20:26PM +0530, Balbir Singh wrote: * KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com [2010-03-02 17:23:16]: On Tue, 2 Mar 2010 09:01:58 +0100 Andrea Righi ari...@develer.com wrote: On Tue, Mar 02, 2010 at 09:23:09AM +0900, KAMEZAWA Hiroyuki wrote: On Mon, 1 Mar 2010 22:23:40 +0100 Andrea Righi ari...@develer.com wrote: Apply the cgroup dirty pages accounting and limiting infrastructure to the opportune kernel functions. Signed-off-by: Andrea Righi ari...@develer.com Seems nice. Hmm. the last problem is moving account between memcg. Right ? Correct. This was actually the last item of the TODO list. Anyway, I'm still considering if it's correct to move dirty pages when a task is migrated from a cgroup to another. Currently, dirty pages just remain in the original cgroup and are flushed depending on the original cgroup settings. That is not totally wrong... at least moving the dirty pages between memcgs should be optional (move_charge_at_immigrate?). My concern is - migration between memcg is already suppoted - at task move - at rmdir Then, if you leave DIRTY_PAGE accounting to original cgroup, the new cgroup (migration target)'s Dirty page accounting may goes to be negative, or incorrect value. Please check FILE_MAPPED implementation in __mem_cgroup_move_account() As if (page_mapped(page) !PageAnon(page)) { /* Update mapped_file data for mem_cgroup */ preempt_disable(); __this_cpu_dec(from-stat-count[MEM_CGROUP_STAT_FILE_MAPPED]); __this_cpu_inc(to-stat-count[MEM_CGROUP_STAT_FILE_MAPPED]); preempt_enable(); } then, FILE_MAPPED never goes negative. Absolutely! I am not sure how complex dirty memory migration will be, but one way of working around it would be to disable migration of charges when the feature is enabled (dirty* is set in the memory cgroup). We might need additional logic to allow that to happen. I've started to look at dirty memory migration. First attempt is to add DIRTY, WRITEBACK, etc. to page_cgroup flags and handle them in __mem_cgroup_move_account(). Probably I'll have something ready for the next version of the patch. I still need to figure if this can work as expected... I agree it's a right direction(in fact, I have been planning to post a patch in that direction), so I leave it to you. Can you add PCG_FILE_MAPPED flag too ? I think this flag can be handled in the same way as other flags you're trying to add, and we can change if (page_mapped(page) !PageAnon(page)) to if (PageCgroupFileMapped(pc) in __mem_cgroup_move_account(). It would be cleaner than current code, IMHO. Thanks, Daisuke Nishimura. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
diff --git a/mm/filemap.c b/mm/filemap.c index fe09e51..f85acae 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -135,6 +135,7 @@ void __remove_from_page_cache(struct page *page) * having removed the page entirely. */ if (PageDirty(page) mapping_cap_account_dirty(mapping)) { + mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, -1); dec_zone_page_state(page, NR_FILE_DIRTY); dec_bdi_stat(mapping-backing_dev_info, BDI_DIRTY); } (snip) @@ -1096,6 +1113,7 @@ int __set_page_dirty_no_writeback(struct page *page) void account_page_dirtied(struct page *page, struct address_space *mapping) { if (mapping_cap_account_dirty(mapping)) { + mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, 1); __inc_zone_page_state(page, NR_FILE_DIRTY); __inc_bdi_stat(mapping-backing_dev_info, BDI_DIRTY); task_dirty_inc(current); As long as I can see, those two functions(at least) calls mem_cgroup_update_state(), which acquires page cgroup lock, under mapping-tree_lock. But as I fixed before in commit e767e056, page cgroup lock must not acquired under mapping-tree_lock. hmm, we should call those mem_cgroup_update_state() outside mapping-tree_lock, or add local_irq_save/restore() around lock/unlock_page_cgroup() to avoid dead-lock. Thanks, Daisuke Nishimura. On Mon, 1 Mar 2010 22:23:40 +0100, Andrea Righi ari...@develer.com wrote: Apply the cgroup dirty pages accounting and limiting infrastructure to the opportune kernel functions. Signed-off-by: Andrea Righi ari...@develer.com --- fs/fuse/file.c |5 +++ fs/nfs/write.c |4 ++ fs/nilfs2/segment.c | 10 +- mm/filemap.c|1 + mm/page-writeback.c | 84 -- mm/rmap.c |4 +- mm/truncate.c |2 + 7 files changed, 76 insertions(+), 34 deletions(-) diff --git a/fs/fuse/file.c b/fs/fuse/file.c index a9f5e13..dbbdd53 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -11,6 +11,7 @@ #include linux/pagemap.h #include linux/slab.h #include linux/kernel.h +#include linux/memcontrol.h #include linux/sched.h #include linux/module.h @@ -1129,6 +1130,8 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req) list_del(req-writepages_entry); dec_bdi_stat(bdi, BDI_WRITEBACK); + mem_cgroup_update_stat(req-pages[0], + MEM_CGROUP_STAT_WRITEBACK_TEMP, -1); dec_zone_page_state(req-pages[0], NR_WRITEBACK_TEMP); bdi_writeout_inc(bdi); wake_up(fi-page_waitq); @@ -1240,6 +1243,8 @@ static int fuse_writepage_locked(struct page *page) req-inode = inode; inc_bdi_stat(mapping-backing_dev_info, BDI_WRITEBACK); + mem_cgroup_update_stat(tmp_page, + MEM_CGROUP_STAT_WRITEBACK_TEMP, 1); inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP); end_page_writeback(page); diff --git a/fs/nfs/write.c b/fs/nfs/write.c index b753242..7316f7a 100644 --- a/fs/nfs/write.c +++ b/fs/nfs/write.c @@ -439,6 +439,7 @@ nfs_mark_request_commit(struct nfs_page *req) req-wb_index, NFS_PAGE_TAG_COMMIT); spin_unlock(inode-i_lock); + mem_cgroup_update_stat(req-wb_page, MEM_CGROUP_STAT_UNSTABLE_NFS, 1); inc_zone_page_state(req-wb_page, NR_UNSTABLE_NFS); inc_bdi_stat(req-wb_page-mapping-backing_dev_info, BDI_UNSTABLE); __mark_inode_dirty(inode, I_DIRTY_DATASYNC); @@ -450,6 +451,7 @@ nfs_clear_request_commit(struct nfs_page *req) struct page *page = req-wb_page; if (test_and_clear_bit(PG_CLEAN, (req)-wb_flags)) { + mem_cgroup_update_stat(page, MEM_CGROUP_STAT_UNSTABLE_NFS, -1); dec_zone_page_state(page, NR_UNSTABLE_NFS); dec_bdi_stat(page-mapping-backing_dev_info, BDI_UNSTABLE); return 1; @@ -1273,6 +1275,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how) req = nfs_list_entry(head-next); nfs_list_remove_request(req); nfs_mark_request_commit(req); + mem_cgroup_update_stat(req-wb_page, + MEM_CGROUP_STAT_UNSTABLE_NFS, -1); dec_zone_page_state(req-wb_page, NR_UNSTABLE_NFS); dec_bdi_stat(req-wb_page-mapping-backing_dev_info, BDI_UNSTABLE); diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c index ada2f1b..aef6d13 100644 --- a/fs/nilfs2/segment.c +++ b/fs/nilfs2/segment.c @@ -1660,8 +1660,11 @@ nilfs_copy_replace_page_buffers(struct page *page, struct list_head *out) } while (bh = bh-b_this_page, bh2 = bh2-b_this_page, bh != head); kunmap_atomic(kaddr, KM_USER0); - if (!TestSetPageWriteback(clone_page)) + if (!TestSetPageWriteback(clone_page
[Devel] Re: [PATCH -mmotm 3/3] memcg: dirty pages instrumentation
On Wed, 3 Mar 2010 12:29:06 +0900, KAMEZAWA Hiroyuki kamezawa.hir...@jp.fujitsu.com wrote: On Wed, 3 Mar 2010 11:12:38 +0900 Daisuke Nishimura nishim...@mxp.nes.nec.co.jp wrote: diff --git a/mm/filemap.c b/mm/filemap.c index fe09e51..f85acae 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -135,6 +135,7 @@ void __remove_from_page_cache(struct page *page) * having removed the page entirely. */ if (PageDirty(page) mapping_cap_account_dirty(mapping)) { + mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, -1); dec_zone_page_state(page, NR_FILE_DIRTY); dec_bdi_stat(mapping-backing_dev_info, BDI_DIRTY); } (snip) @@ -1096,6 +1113,7 @@ int __set_page_dirty_no_writeback(struct page *page) void account_page_dirtied(struct page *page, struct address_space *mapping) { if (mapping_cap_account_dirty(mapping)) { + mem_cgroup_update_stat(page, MEM_CGROUP_STAT_FILE_DIRTY, 1); __inc_zone_page_state(page, NR_FILE_DIRTY); __inc_bdi_stat(mapping-backing_dev_info, BDI_DIRTY); task_dirty_inc(current); As long as I can see, those two functions(at least) calls mem_cgroup_update_state(), which acquires page cgroup lock, under mapping-tree_lock. But as I fixed before in commit e767e056, page cgroup lock must not acquired under mapping-tree_lock. hmm, we should call those mem_cgroup_update_state() outside mapping-tree_lock, or add local_irq_save/restore() around lock/unlock_page_cgroup() to avoid dead-lock. Ah, good catch! But hmm... This account_page_dirtted() seems to be called under IRQ-disabled. About __remove_from_page_cache(), I think page_cgroup should have its own DIRTY flag, then, mem_cgroup_uncharge_page() can handle it automatically. But. there are no guarantee that following never happens. lock_page_cgroup() === interrupt. - mapping-tree_lock() Even if mapping-tree_lock is held with IRQ-disabled. Then, if we add local_irq_save(), we have to add it to all lock_page_cgroup(). Then, hm...some kind of new trick ? as.. (Follwoing patch is not tested!!) If we can verify that all callers of mem_cgroup_update_stat() have always either aquired or not aquired tree_lock, this direction will work fine. But if we can't, we have to add local_irq_save() to lock_page_cgroup() like below. === include/linux/page_cgroup.h |8 ++-- mm/memcontrol.c | 43 +-- 2 files changed, 31 insertions(+), 20 deletions(-) diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h index 30b0813..51da916 100644 --- a/include/linux/page_cgroup.h +++ b/include/linux/page_cgroup.h @@ -83,15 +83,19 @@ static inline enum zone_type page_cgroup_zid(struct page_cgroup *pc) return page_zonenum(pc-page); } -static inline void lock_page_cgroup(struct page_cgroup *pc) +static inline void __lock_page_cgroup(struct page_cgroup *pc) { bit_spin_lock(PCG_LOCK, pc-flags); } +#define lock_page_cgroup(pc, flags) \ + do { local_irq_save(flags); __lock_page_cgroup(pc); } while (0) -static inline void unlock_page_cgroup(struct page_cgroup *pc) +static inline void __unlock_page_cgroup(struct page_cgroup *pc) { bit_spin_unlock(PCG_LOCK, pc-flags); } +#define unlock_page_cgroup(pc, flags) \ + do { __unlock_page_cgroup(pc); local_irq_restore(flags); } while (0) #else /* CONFIG_CGROUP_MEM_RES_CTLR */ struct page_cgroup; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 00ed4b1..40b9be4 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1327,12 +1327,13 @@ void mem_cgroup_update_file_mapped(struct page *page, int val) { struct mem_cgroup *mem; struct page_cgroup *pc; + unsigned long flags; pc = lookup_page_cgroup(page); if (unlikely(!pc)) return; - lock_page_cgroup(pc); + lock_page_cgroup(pc, flags); mem = pc-mem_cgroup; if (!mem) goto done; @@ -1346,7 +1347,7 @@ void mem_cgroup_update_file_mapped(struct page *page, int val) __this_cpu_add(mem-stat-count[MEM_CGROUP_STAT_FILE_MAPPED], val); done: - unlock_page_cgroup(pc); + unlock_page_cgroup(pc, flags); } /* @@ -1680,11 +1681,12 @@ struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page) struct page_cgroup *pc; unsigned short id; swp_entry_t ent; + unsigned long flags; VM_BUG_ON(!PageLocked(page)); pc = lookup_page_cgroup(page); - lock_page_cgroup(pc); + lock_page_cgroup(pc, flags); if (PageCgroupUsed(pc)) { mem = pc-mem_cgroup; if (mem !css_tryget(mem-css)) @@ -1698,7 +1700,7 @@ struct mem_cgroup *try_get_mem_cgroup_from_page(struct page *page) mem = NULL; rcu_read_unlock
[Devel] Re: [PATCH -mmotm 1/2] memcg: dirty pages accounting and limiting infrastructure
On Fri, 26 Feb 2010 23:52:30 +0100, Andrea Righi ari...@develer.com wrote: Infrastructure to account dirty pages per cgroup and add dirty limit interfaces in the cgroupfs: - Active write-out: memory.dirty_ratio, memory.dirty_bytes - Background write-out: memory.dirty_background_ratio, memory.dirty_background_bytes It looks good for me in general. Signed-off-by: Andrea Righi ari...@develer.com --- include/linux/memcontrol.h | 74 +- mm/memcontrol.c| 354 2 files changed, 399 insertions(+), 29 deletions(-) (snip) +s64 mem_cgroup_page_stat(enum mem_cgroup_page_stat_item item) +{ + struct mem_cgroup_page_stat stat = {}; + struct mem_cgroup *memcg; + I think it would be better to add if (mem_cgroup_disabled()). + rcu_read_lock(); + memcg = mem_cgroup_from_task(current); + if (memcg) { + /* + * Recursively evaulate page statistics against all cgroup + * under hierarchy tree + */ + stat.item = item; + mem_cgroup_walk_tree(memcg, stat, mem_cgroup_page_stat_cb); + } else + stat.value = -ENOMEM; + rcu_read_unlock(); + + return stat.value; +} + static int mem_cgroup_count_children_cb(struct mem_cgroup *mem, void *data) { int *val = data; @@ -1263,10 +1419,10 @@ static void record_last_oom(struct mem_cgroup *mem) } /* - * Currently used to update mapped file statistics, but the routine can be - * generalized to update other statistics as well. + * Generalized routine to update memory cgroup statistics. */ -void mem_cgroup_update_file_mapped(struct page *page, int val) +void mem_cgroup_update_stat(struct page *page, + enum mem_cgroup_stat_index idx, int val) { struct mem_cgroup *mem; struct page_cgroup *pc; ditto. @@ -1286,7 +1442,8 @@ void mem_cgroup_update_file_mapped(struct page *page, int val) /* * Preemption is already disabled. We can use __this_cpu_xxx */ - __this_cpu_add(mem-stat-count[MEM_CGROUP_STAT_FILE_MAPPED], val); + VM_BUG_ON(idx = MEM_CGROUP_STAT_NSTATS); + __this_cpu_add(mem-stat-count[idx], val); done: unlock_page_cgroup(pc); Thanks, Daisuke Nishimura. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH v4 3/4] memcg: rework usage of stats by soft limit
On Sun, 27 Dec 2009 04:09:01 +0200, Kirill A. Shutemov kir...@shutemov.name wrote: Instead of incrementing counter on each page in/out and comparing it with constant, we set counter to constant, decrement counter on each page in/out and compare it with zero. We want to make comparing as fast as possible. On many RISC systems (probably not only RISC) comparing with zero is more effective than comparing with a constant, since not every constant can be immediate operand for compare instruction. Also, I've renamed MEM_CGROUP_STAT_EVENTS to MEM_CGROUP_STAT_SOFTLIMIT, since really it's not a generic counter. Signed-off-by: Kirill A. Shutemov kir...@shutemov.name --- mm/memcontrol.c | 19 ++- 1 files changed, 14 insertions(+), 5 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 1d71cb4..36eb7af 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -69,8 +69,9 @@ enum mem_cgroup_stat_index { MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */ MEM_CGROUP_STAT_PGPGIN_COUNT, /* # of pages paged in */ MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */ - MEM_CGROUP_STAT_EVENTS, /* sum of pagein + pageout for internal use */ MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */ + MEM_CGROUP_STAT_SOFTLIMIT, /* decrements on each page in/out. + used by soft limit implementation */ MEM_CGROUP_STAT_NSTATS, }; @@ -90,6 +91,13 @@ __mem_cgroup_stat_reset_safe(struct mem_cgroup_stat_cpu *stat, stat-count[idx] = 0; } +static inline void +__mem_cgroup_stat_set(struct mem_cgroup_stat_cpu *stat, + enum mem_cgroup_stat_index idx, s64 val) +{ + stat-count[idx] = val; +} + I think it would be better to name it __mem_cgroup_stat_set_safe. And could you remove the definition of __mem_cgroup_stat_reset ? Thanks, Daisuke Nishimura. static inline s64 __mem_cgroup_stat_read_local(struct mem_cgroup_stat_cpu *stat, enum mem_cgroup_stat_index idx) @@ -380,9 +388,10 @@ static bool mem_cgroup_soft_limit_check(struct mem_cgroup *mem) cpu = get_cpu(); cpustat = mem-stat.cpustat[cpu]; - val = __mem_cgroup_stat_read_local(cpustat, MEM_CGROUP_STAT_EVENTS); - if (unlikely(val SOFTLIMIT_EVENTS_THRESH)) { - __mem_cgroup_stat_reset_safe(cpustat, MEM_CGROUP_STAT_EVENTS); + val = __mem_cgroup_stat_read_local(cpustat, MEM_CGROUP_STAT_SOFTLIMIT); + if (unlikely(val 0)) { + __mem_cgroup_stat_set(cpustat, MEM_CGROUP_STAT_SOFTLIMIT, + SOFTLIMIT_EVENTS_THRESH); ret = true; } put_cpu(); @@ -515,7 +524,7 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *mem, else __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_PGPGOUT_COUNT, 1); - __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_EVENTS, 1); + __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_SOFTLIMIT, -1); put_cpu(); } -- 1.6.5.7 ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH v4 4/4] memcg: implement memory thresholds
On Sun, 27 Dec 2009 04:09:02 +0200, Kirill A. Shutemov kir...@shutemov.name wrote: It allows to register multiple memory and memsw thresholds and gets notifications when it crosses. To register a threshold application need: - create an eventfd; - open memory.usage_in_bytes or memory.memsw.usage_in_bytes; - write string like event_fd memory.usage_in_bytes threshold to cgroup.event_control. Application will be notified through eventfd when memory usage crosses threshold in any direction. It's applicable for root and non-root cgroup. It uses stats to track memory usage, simmilar to soft limits. It checks if we need to send event to userspace on every 100 page in/out. I guess it's good compromise between performance and accuracy of thresholds. Signed-off-by: Kirill A. Shutemov kir...@shutemov.name --- Documentation/cgroups/memory.txt | 19 +++- mm/memcontrol.c | 275 ++ 2 files changed, 293 insertions(+), 1 deletions(-) diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index b871f25..195af07 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -414,7 +414,24 @@ NOTE1: Soft limits take effect over a long period of time, since they involve NOTE2: It is recommended to set the soft limit always below the hard limit, otherwise the hard limit will take precedence. -8. TODO +8. Memory thresholds + +Memory controler implements memory thresholds using cgroups notification +API (see cgroups.txt). It allows to register multiple memory and memsw +thresholds and gets notifications when it crosses. + +To register a threshold application need: + - create an eventfd using eventfd(2); + - open memory.usage_in_bytes or memory.memsw.usage_in_bytes; + - write string like event_fd memory.usage_in_bytes threshold to + cgroup.event_control. + +Application will be notified through eventfd when memory usage crosses +threshold in any direction. + +It's applicable for root and non-root cgroup. + +9. TODO 1. Add support for accounting huge pages (as a separate controller) 2. Make per-cgroup scanner reclaim not-shared pages first diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 36eb7af..3a0a6a1 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c It would be a nitpick, but my patch(http://marc.info/?l=linux-mm-commitsm=126152804420992w=2) has already modified here. I think it might be better for you to apply my patches by hand or wait for next mmotm to be released to avoid bothering Andrew. (There is enough time left till the next merge window :)) (snip) +static void __mem_cgroup_threshold(struct mem_cgroup *memcg, bool swap) +{ + struct mem_cgroup_threshold_ary *thresholds; + u64 usage = mem_cgroup_usage(memcg, swap); + int i, cur; + I think calling mem_cgroup_usage() after checking if(!thresholds) decreases the overhead a little when we don't set any thresholds. I've confirmed that the change makes the assembler output different. + rcu_read_lock(); + if (!swap) { + thresholds = rcu_dereference(memcg-thresholds); + } else { + thresholds = rcu_dereference(memcg-memsw_thresholds); + } + + if (!thresholds) + goto unlock; + + cur = atomic_read(thresholds-cur); + + /* Check if a threshold crossed in any direction */ + + for(i = cur; i = 0 + unlikely(thresholds-entries[i].threshold usage); i--) { + atomic_dec(thresholds-cur); + eventfd_signal(thresholds-entries[i].eventfd, 1); + } + + for(i = cur + 1; i thresholds-size + unlikely(thresholds-entries[i].threshold = usage); i++) { + atomic_inc(thresholds-cur); + eventfd_signal(thresholds-entries[i].eventfd, 1); + } +unlock: + rcu_read_unlock(); +} + Thanks, Daisuke Nishimura. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH RFC v2 3/4] memcg: rework usage of stats by soft limit
On Sat, 12 Dec 2009 15:06:52 +0200 Kirill A. Shutemov kir...@shutemov.name wrote: On Sat, Dec 12, 2009 at 5:50 AM, Daisuke Nishimura d-nishim...@mtf.biglobe.ne.jp wrote: Sorry, I disagree this change. mem_cgroup_soft_limit_check() is used for checking how much current usage exceeds the soft_limit_in_bytes and updating softlimit tree asynchronously, instead of checking every charge/uncharge. What if you change the soft_limit_in_bytes, but the number of charges and uncharges are very balanced afterwards ? The softlimit tree will not be updated for a long time. I don't see how my patch affects the logic you've described. Statistics updates and checks in the same place. It just uses decrement instead of increment. Ah... my bad. Ignore this comment, please. I misunderstood this patch. And IIUC, it's the same for your threshold feature, right ? I think it would be better: - discard this change. - in 4/4, rename mem_cgroup_soft_limit_check to mem_cgroup_event_check, Â and instead of adding a new STAT counter, do like: Â Â Â Â if (mem_cgroup_event_check(mem)) { Â Â Â Â Â Â Â Â mem_cgroup_update_tree(mem, page); Â Â Â Â Â Â Â Â mem_cgroup_threshold(mem); Â Â Â Â } I think that mem_cgroup_update_tree() and mem_cgroup_threshold() should be run with different frequency. How to share MEM_CGROUP_STAT_EVENTS between soft limits and thresholds in this case? hmm, both softlimit and your threshold count events at the same place(charge and uncharge). So, I think those events can be shared. Is there any reason they should run in different frequency ? Thanks, Daisuke Nishimura. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH RFC v2 3/4] memcg: rework usage of stats by soft limit
On Sat, 12 Dec 2009 21:46:08 +0200 Kirill A. Shutemov kir...@shutemov.name wrote: On Sat, Dec 12, 2009 at 4:34 PM, Daisuke Nishimura d-nishim...@mtf.biglobe.ne.jp wrote: On Sat, 12 Dec 2009 15:06:52 +0200 Kirill A. Shutemov kir...@shutemov.name wrote: On Sat, Dec 12, 2009 at 5:50 AM, Daisuke Nishimura d-nishim...@mtf.biglobe.ne.jp wrote: And IIUC, it's the same for your threshold feature, right ? I think it would be better: - discard this change. - in 4/4, rename mem_cgroup_soft_limit_check to mem_cgroup_event_check, Â and instead of adding a new STAT counter, do like: Â Â Â Â if (mem_cgroup_event_check(mem)) { Â Â Â Â Â Â Â Â mem_cgroup_update_tree(mem, page); Â Â Â Â Â Â Â Â mem_cgroup_threshold(mem); Â Â Â Â } I think that mem_cgroup_update_tree() and mem_cgroup_threshold() should be run with different frequency. How to share MEM_CGROUP_STAT_EVENTS between soft limits and thresholds in this case? hmm, both softlimit and your threshold count events at the same place(charge and uncharge). So, I think those events can be shared. Is there any reason they should run in different frequency ? SOFTLIMIT_EVENTS_THRESH is 1000. If use the same value for thresholds, a threshold can be exceed on 1000*nr_cpu_id pages. It's too many. I think, that 100 is a reasonable value. O.K. I see. mem_cgroup_soft_limit_check() resets MEM_CGROUP_STAT_EVENTS when it reaches SOFTLIMIT_EVENTS_THRESH. If I will do the same thing for THRESHOLDS_EVENTS_THRESH (which is 100) , mem_cgroup_event_check() will never be 'true'. Any idea how to share MEM_CGROUP_STAT_EVENTS in this case? It's impossible if they have different frequency as you say. Thank you for your clarification. Daisuke Nishimura. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH RFC v2 4/4] memcg: implement memory thresholds
On Sat, 12 Dec 2009 00:59:19 +0200 Kirill A. Shutemov kir...@shutemov.name wrote: It allows to register multiple memory and memsw thresholds and gets notifications when it crosses. To register a threshold application need: - create an eventfd; - open memory.usage_in_bytes or memory.memsw.usage_in_bytes; - write string like event_fd memory.usage_in_bytes threshold to cgroup.event_control. Application will be notified through eventfd when memory usage crosses threshold in any direction. It's applicable for root and non-root cgroup. It uses stats to track memory usage, simmilar to soft limits. It checks if we need to send event to userspace on every 100 page in/out. I guess it's good compromise between performance and accuracy of thresholds. Signed-off-by: Kirill A. Shutemov kir...@shutemov.name --- mm/memcontrol.c | 263 +++ 1 files changed, 263 insertions(+), 0 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c6081cc..5ba2140 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6,6 +6,10 @@ * Copyright 2007 OpenVZ SWsoft Inc * Author: Pavel Emelianov xe...@openvz.org * + * Memory thresholds + * Copyright (C) 2009 Nokia Corporation + * Author: Kirill A. Shutemov + * * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or @@ -38,6 +42,7 @@ #include linux/vmalloc.h #include linux/mm_inline.h #include linux/page_cgroup.h +#include linux/eventfd.h #include internal.h #include asm/uaccess.h @@ -56,6 +61,7 @@ static int really_do_swap_account __initdata = 1; /* for remember boot option*/ static DEFINE_MUTEX(memcg_tasklist); /* can be hold under cgroup_mutex */ This mutex has already removed in current mmotm. Please write a patch for memcg based on mmot. #define SOFTLIMIT_EVENTS_THRESH (1000) +#define THRESHOLDS_EVENTS_THRESH (100) /* * Statistics for memory cgroup. (snip) @@ -1363,6 +1395,11 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm, if (mem_cgroup_soft_limit_check(mem)) mem_cgroup_update_tree(mem, page); done: + if (mem_cgroup_threshold_check(mem)) { + mem_cgroup_threshold(mem, false); + if (do_swap_account) + mem_cgroup_threshold(mem, true); + } return 0; nomem: css_put(mem-css); @@ -1906,6 +1943,11 @@ __mem_cgroup_uncharge_common(struct page *page, enum charge_type ctype) if (mem_cgroup_soft_limit_check(mem)) mem_cgroup_update_tree(mem, page); + if (mem_cgroup_threshold_check(mem)) { + mem_cgroup_threshold(mem, false); + if (do_swap_account) + mem_cgroup_threshold(mem, true); + } /* at swapout, this memcg will be accessed to record to swap */ if (ctype != MEM_CGROUP_CHARGE_TYPE_SWAPOUT) css_put(mem-css); Can if (do_swap_account) check be moved into mem_cgroup_threshold ? Thanks, Daisuke Nishimura. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH RFC v2 3/4] memcg: rework usage of stats by soft limit
Sorry, I disagree this change. mem_cgroup_soft_limit_check() is used for checking how much current usage exceeds the soft_limit_in_bytes and updating softlimit tree asynchronously, instead of checking every charge/uncharge. What if you change the soft_limit_in_bytes, but the number of charges and uncharges are very balanced afterwards ? The softlimit tree will not be updated for a long time. And IIUC, it's the same for your threshold feature, right ? I think it would be better: - discard this change. - in 4/4, rename mem_cgroup_soft_limit_check to mem_cgroup_event_check, and instead of adding a new STAT counter, do like: if (mem_cgroup_event_check(mem)) { mem_cgroup_update_tree(mem, page); mem_cgroup_threshold(mem); } Ah, yes. Current code doesn't call mem_cgroup_soft_limit_check() for root cgroup in charge path as you said in http://marc.info/?l=linux-mmm=126021128400687w=2. I think you can change there as you want, I can change my patch (http://marc.info/?l=linux-mmm=126023467303178w=2, it has not yet sent to Andrew anyway) to check mem_cgroup_is_root() in mem_cgroup_update_tree(). Thanks, Daisuke Nishimura. On Sat, 12 Dec 2009 00:59:18 +0200 Kirill A. Shutemov kir...@shutemov.name wrote: Instead of incrementing counter on each page in/out and comparing it with constant, we set counter to constant, decrement counter on each page in/out and compare it with zero. We want to make comparing as fast as possible. On many RISC systems (probably not only RISC) comparing with zero is more effective than comparing with a constant, since not every constant can be immediate operand for compare instruction. Also, I've renamed MEM_CGROUP_STAT_EVENTS to MEM_CGROUP_STAT_SOFTLIMIT, since really it's not a generic counter. Signed-off-by: Kirill A. Shutemov kir...@shutemov.name --- mm/memcontrol.c | 19 ++- 1 files changed, 14 insertions(+), 5 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 0ff65ed..c6081cc 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -69,8 +69,9 @@ enum mem_cgroup_stat_index { MEM_CGROUP_STAT_MAPPED_FILE, /* # of pages charged as file rss */ MEM_CGROUP_STAT_PGPGIN_COUNT, /* # of pages paged in */ MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */ - MEM_CGROUP_STAT_EVENTS, /* sum of pagein + pageout for internal use */ MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */ + MEM_CGROUP_STAT_SOFTLIMIT, /* decrements on each page in/out. + used by soft limit implementation */ MEM_CGROUP_STAT_NSTATS, }; @@ -90,6 +91,13 @@ __mem_cgroup_stat_reset_safe(struct mem_cgroup_stat_cpu *stat, stat-count[idx] = 0; } +static inline void +__mem_cgroup_stat_set(struct mem_cgroup_stat_cpu *stat, + enum mem_cgroup_stat_index idx, s64 val) +{ + stat-count[idx] = val; +} + static inline s64 __mem_cgroup_stat_read_local(struct mem_cgroup_stat_cpu *stat, enum mem_cgroup_stat_index idx) @@ -374,9 +382,10 @@ static bool mem_cgroup_soft_limit_check(struct mem_cgroup *mem) cpu = get_cpu(); cpustat = mem-stat.cpustat[cpu]; - val = __mem_cgroup_stat_read_local(cpustat, MEM_CGROUP_STAT_EVENTS); - if (unlikely(val SOFTLIMIT_EVENTS_THRESH)) { - __mem_cgroup_stat_reset_safe(cpustat, MEM_CGROUP_STAT_EVENTS); + val = __mem_cgroup_stat_read_local(cpustat, MEM_CGROUP_STAT_SOFTLIMIT); + if (unlikely(val 0)) { + __mem_cgroup_stat_set(cpustat, MEM_CGROUP_STAT_SOFTLIMIT, + SOFTLIMIT_EVENTS_THRESH); ret = true; } put_cpu(); @@ -509,7 +518,7 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *mem, else __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_PGPGOUT_COUNT, 1); - __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_EVENTS, 1); + __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_SOFTLIMIT, -1); put_cpu(); } -- 1.6.5.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH RFC v0 2/3] res_counter: implement thresholds
Hi. On Thu, 26 Nov 2009 19:11:16 +0200, Kirill A. Shutemov kir...@shutemov.name wrote: It allows to setup two thresholds: one above current usage and one below. Callback threshold_notifier() will be called if a threshold is crossed. Signed-off-by: Kirill A. Shutemov kir...@shutemov.name --- include/linux/res_counter.h | 44 +++ kernel/res_counter.c|4 +++ 2 files changed, 48 insertions(+), 0 deletions(-) diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h index fcb9884..bca99a5 100644 --- a/include/linux/res_counter.h +++ b/include/linux/res_counter.h @@ -9,6 +9,10 @@ * * Author: Pavel Emelianov xe...@openvz.org * + * Thresholds support + * Copyright (C) 2009 Nokia Corporation + * Author: Kirill A. Shutemov + * * See Documentation/cgroups/resource_counter.txt for more * info about what this counter is. */ @@ -42,6 +46,13 @@ struct res_counter { * the number of unsuccessful attempts to consume the resource */ unsigned long long failcnt; + + unsigned long long threshold_above; + unsigned long long threshold_below; + void (*threshold_notifier)(struct res_counter *counter, + unsigned long long usage, + unsigned long long threshold); + /* * the lock to protect all of the above. * the routines below consider this to be IRQ-safe @@ -145,6 +156,20 @@ static inline bool res_counter_soft_limit_check_locked(struct res_counter *cnt) return false; } +static inline void res_counter_threshold_notify_locked(struct res_counter *cnt) +{ + if (cnt-usage = cnt-threshold_above) { + cnt-threshold_notifier(cnt, cnt-usage, cnt-threshold_above); + return; + } + + if (cnt-usage cnt-threshold_below) { + cnt-threshold_notifier(cnt, cnt-usage, cnt-threshold_below); + return; + } +} + + /** * Get the difference between the usage and the soft limit * @cnt: The counter @@ -238,4 +263,23 @@ res_counter_set_soft_limit(struct res_counter *cnt, return 0; } +static inline int +res_counter_set_thresholds(struct res_counter *cnt, + unsigned long long threshold_above, + unsigned long long threshold_below) +{ + unsigned long flags; + int ret = -EINVAL; + + spin_lock_irqsave(cnt-lock, flags); + if ((cnt-usage threshold_above) + (cnt-usage = threshold_below)) { + cnt-threshold_above = threshold_above; + cnt-threshold_below = threshold_below; + ret = 0; + } + spin_unlock_irqrestore(cnt-lock, flags); + return ret; +} + #endif diff --git a/kernel/res_counter.c b/kernel/res_counter.c index bcdabf3..646c29c 100644 --- a/kernel/res_counter.c +++ b/kernel/res_counter.c @@ -20,6 +20,8 @@ void res_counter_init(struct res_counter *counter, struct res_counter *parent) spin_lock_init(counter-lock); counter-limit = RESOURCE_MAX; counter-soft_limit = RESOURCE_MAX; + counter-threshold_above = RESOURCE_MAX; + counter-threshold_below = 0ULL; counter-parent = parent; } @@ -33,6 +35,7 @@ int res_counter_charge_locked(struct res_counter *counter, unsigned long val) counter-usage += val; if (counter-usage counter-max_usage) counter-max_usage = counter-usage; + res_counter_threshold_notify_locked(counter); return 0; } @@ -73,6 +76,7 @@ void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val) val = counter-usage; counter-usage -= val; + res_counter_threshold_notify_locked(counter); } hmm.. this adds new checks to hot-path of process life cycle. Do you have any number on performance impact of these patches(w/o setting any threshold)? IMHO, it might be small enough to be ignored because KAMEZAWA-san's coalesce charge/uncharge patches have decreased charge/uncharge for res_counter itself, but I want to know just to make sure. Regards, Daisuke Nishimura. ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: No limitation in physical memory in cgroups
Hi. Thank you for using memory cgroup :) On Wed, 20 May 2009 13:52:29 +0800, anqin anqin@gmail.com wrote: Hi all, I would like to bind given physical memory to specific task group, however it does not take effect in my experiments. Are there something wrong in my experiments. My experiment is done under the kernel 2.6.29.3 and I constructed my task group by following commands: a) In shell #1, prepare a bash : # bash # echo $$ 2253 b) In shell #2, prepare the memory control via cgroupfs: # mount -t cgroup cgroup /mnt/mycgrp # cd /mn/mycgrp # mkdir mycontainer # echo 0 mycontainer/cpuset.mems # echo 0-1 mycontainer/cpuset.cpus # echo 2252 mycontainer/tasks # cat mycontainer/memory.usage_in_bytes 2875392 # echo 300 mycontainer/memory.max_usage_in_bytes # cat mycontainer/memory.max_usage_in_bytes 3002368 You need to write the limit to memory.limit_in_bytes. max_usage_in_bytes means the max of the usage so far(you can reset it by writing to it). See Documentation/cgroups/memory.txt(it might be a bit old). Thanks, Daisuke Nishimura. c) In Shell #1, run a memory consumer (in which, malloc() is called to allocate memory and not free until program is existed) to allocate 500M memory: # /tmp/memoy_consumer_program 500 In Shell #2, the used memory ascends from start point 2875392 when program begins (from number presented in memory.usage_in_bytes), but it return to start point when it touches the maximal boundary. On the other hand, I also run the top to watch the memory hold by memoy_consumer_program. In top, the memory (both virtual and rss memory ) is always growing without any limitation. Is this phenomenon the correct behaviors of memory cgroups? Best Regards, Anqin ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Containers mailing list contain...@lists.linux-foundation.org https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: memrlimit controller merge to mainline
On Wed, 30 Jul 2008 13:14:07 +0900, KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote: On Wed, 30 Jul 2008 12:11:15 +0900 KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote: On Wed, 30 Jul 2008 11:52:26 +0900 KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote: mem+swap controller means a shrink to memory resource controller (try_to_free_mem_cgroup_pages()) should drop only file caches. (Because kick-out-to-swap will never changes the usage.) right ? only global-lru can make a swap. maybe I can add optimization to do this. Hmm. I should see how OOM works under some situation. I'm thinking mem+swap controller in a different way: an add-on to mem controller, just as current swap controller. I mean adding memory.(mem+swap)_limit. (I'm sorry that I'm not a good writer of e-mail.) A brief summary about changes to mem controller. - mem+swap controller which limits the # sum of pages and swap_entries. - mem+swap controller just drops file caches when it reaches limit. - under mem+swap controller, recaliming Anon pages make no sense. Then, - LRU for Anon is not necessary. - LRU for tmpfs/shmem is not necessary. just showing account is better. - we should see try_to_free_mem_cgroup() again to avoid too much OOM. Maybe Retries=5 is too small because we never do swap under us. a problem like struck-into-ext3-journal can easily make file-cache reclaim difficult. - need some changes to documentation. - Should we have on/off switch of taking swap into account ? or should we implement mem+swap contoller in different name than memory controller ? If swap is not accounted, we need to do swap-out in memory reclaiming path, again. Then, mem+swap controller finally means - under mem+swap controller, program works with no swap. Only global LRU may make pages swapped-out. - If swap-accounting-mode is off, swap can be used unlimitedly. Hmm, sounds a bit differenct from what I want. How about others ? Thanks, Daisuke Nishimura. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [RFC] Transactional CGroup task attachment
On Fri, 11 Jul 2008 09:20:58 +0900, KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote: Thank you for your effort. On Wed, 9 Jul 2008 23:46:33 -0700 Paul Menage [EMAIL PROTECTED] wrote: 3) memory Curently the memory cgroup only uses the mm-owner's cgroup at charge time, and keeps a reference to the cgroup on the page. However, patches have been proposed that would move all non-shared (page count == 1) pages to the destination cgroup when the mm-owner moves to a new cgroup. Since it's not possible to prevent page count changes without locking all mms on the system, even this transaction approach can't really give guarantees. However, something like the following would probably be suitable. It's very similar to the memrlimit approach, except for the fact that we have to handle the fact that the number of pages we finally move might not be exactly the same as the number of pages we thought we'd be moving. prepare_attach_sleep() { down_read(mm-mmap_sem); if (mm-owner != state-task) return 0; count = count_unshared_pages(mm); // save the count charged to the new cgroup state-subsys[memcgroup_subsys_id] = (void *)count; if ((ret = res_counter_charge(state-dest, count)) { up_read(mm-mmap_sem); } return ret; } commit_attach() { if (mm-owner == state-task) { final_count = move_unshared_pages(mm, state-dest); res_counter_uncharge(state-src, final_count); count = state-subsys[memcgroup_subsys_id]; res_counter_force_charge(state-dest, final_count - count); } up_read(mm-mmap_sem); } abort_attach_sleep() { if (mm-owner == state-task) { count = state-subsys[memcgroup_subsys_id]; res_counter_uncharge(state-dest, count); } up_read(mm-mmap_sem); } At frist look, maybe works well. we need some special codes (to move resource) but that's all. My small concern is a state change between prepare_attach_sleep() - commit_attach(). Hmm...but as you say, we cannot do down_write(mmap_sem). Maybe inserting some check codes to mem_cgroup_charge() to stop charge while move is the last thing we can do. I have two comments. - I think page reclaiming code decreases the memory charge without holding mmap_sem(e.g. try_to_unmap(), __remove_mapping()). Shouldn't we handle these cases? - When swap controller is merged, I should implement prepare_attach_nosleep() which holds swap_lock. Anyway, if unwinding is supported officially, I think we can find a way to go. I think so too. Thanks, Daisuke Nishimura. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [RFC] Transactional CGroup task attachment
On Mon, 14 Jul 2008 16:54:44 +0900, KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote: On Mon, 14 Jul 2008 15:28:22 +0900 Daisuke Nishimura [EMAIL PROTECTED] wrote: On Fri, 11 Jul 2008 09:20:58 +0900, KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote: Thank you for your effort. On Wed, 9 Jul 2008 23:46:33 -0700 Paul Menage [EMAIL PROTECTED] wrote: 3) memory Curently the memory cgroup only uses the mm-owner's cgroup at charge time, and keeps a reference to the cgroup on the page. However, patches have been proposed that would move all non-shared (page count == 1) pages to the destination cgroup when the mm-owner moves to a new cgroup. Since it's not possible to prevent page count changes without locking all mms on the system, even this transaction approach can't really give guarantees. However, something like the following would probably be suitable. It's very similar to the memrlimit approach, except for the fact that we have to handle the fact that the number of pages we finally move might not be exactly the same as the number of pages we thought we'd be moving. prepare_attach_sleep() { down_read(mm-mmap_sem); if (mm-owner != state-task) return 0; count = count_unshared_pages(mm); // save the count charged to the new cgroup state-subsys[memcgroup_subsys_id] = (void *)count; if ((ret = res_counter_charge(state-dest, count)) { up_read(mm-mmap_sem); } return ret; } commit_attach() { if (mm-owner == state-task) { final_count = move_unshared_pages(mm, state-dest); res_counter_uncharge(state-src, final_count); count = state-subsys[memcgroup_subsys_id]; res_counter_force_charge(state-dest, final_count - count); } up_read(mm-mmap_sem); } abort_attach_sleep() { if (mm-owner == state-task) { count = state-subsys[memcgroup_subsys_id]; res_counter_uncharge(state-dest, count); } up_read(mm-mmap_sem); } At frist look, maybe works well. we need some special codes (to move resource) but that's all. My small concern is a state change between prepare_attach_sleep() - commit_attach(). Hmm...but as you say, we cannot do down_write(mmap_sem). Maybe inserting some check codes to mem_cgroup_charge() to stop charge while move is the last thing we can do. I have two comments. - I think page reclaiming code decreases the memory charge without holding mmap_sem(e.g. try_to_unmap(), __remove_mapping()). Shouldn't we handle these cases? I think decreasing is not problem, here. I don't like handle mmap-sem by some unclear way. I'd like to add some flag to mm_struct or page_struct to stop(skip/avoid) charge/uncharge while task move. It would be a good idea. - When swap controller is merged, I should implement prepare_attach_nosleep() which holds swap_lock. just making add_to_swap() fail during move is not enough ? This can only avoid increasing, I think. I thought it would be better to avoid decreasing too, just because some special handling on uncharged usage would be needed in rollback or commit. Anyway, I think it depends on how to implement move and rollback, and I will consider more. Thank you for your suggestion. Thanks, Daisuke Nishimura. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH -mm 1/5] swapcgroup (v3): add cgroup files
Hi, Dave-san. On Thu, 10 Jul 2008 13:35:36 -0700, Dave Hansen [EMAIL PROTECTED] wrote: On Fri, 2008-07-04 at 15:17 +0900, Daisuke Nishimura wrote: +config CGROUP_SWAP_RES_CTLR + bool Swap Resource Controller for Control Groups + depends on CGROUP_MEM_RES_CTLR SWAP + help + Provides a swap resource controller that manages and limits swap usage. + Implemented as a add-on to Memory Resource Controller. Could you make this just plain depend on 'CGROUP_MEM_RES_CTLR SWAP' and not make it configurable? I don't think the resource usage really justifies yet another .config knob to tune and break. :) I don't stick to using kernel config option. As I said in my ToDo, I'm going to implement another method (boot option or something) to disable(or enable?) this feature, so I can make this config not configurable after it. Thanks, Daisuke Nishimura. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH -mm 5/5] swapcgroup (v3): implement force_empty
On Sat, 5 Jul 2008 13:29:44 +0900, KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote: On Fri, 4 Jul 2008 21:33:01 +0900 Daisuke Nishimura [EMAIL PROTECTED] wrote: On Fri, 4 Jul 2008 19:16:38 +0900, KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote: On Fri, 4 Jul 2008 15:24:23 +0900 Daisuke Nishimura [EMAIL PROTECTED] wrote: This patch implements force_empty of swapcgroup. Currently, it simply uncharges all the charges from the group. I think there can be other implementations. What I thought are: - move all the charges to its parent. - unuse(swap in) all the swap charged to the group. 3. move all swap back to memory (see swapoff.) Do you mean swapping in all the swap including used by other groups? swapping in all swap used by the group (not by all group) O.K. I intended to say the same thing in 2. I'll try it and I think some part of this implementation can be used by shrinking support too. (snip) Hmm...but handling limit_change (at least, returns -EBUSY) will be necessary. I think so too. But I'm not sure now it's good or bad to support shrinking at limit_change about swap. Shrinking swap means increasing the memory usage and that may cause another swapout. yes. but who reduce the limit ? it's the admin or users. At leaset, returning -EBUSY is necesary. You can use res_counter: check limit change patch which I posted yesterday. I saw your patch, and I agree that returning -EBUSY is the first step. Do you consider a some magical way to move pages in swap back to memory ? In this patch, I modified the find_next_to_unuse() to find the entry charged to a specific group. It might be possible to modify try_to_unuse()(or define another function based on try_to_unuse()) to reduce swap usage of a specified group down to some threashold. But, I think, one problem here is from which device the swaps should be back to memory, or usage balance between swap devices. Ah, that's maybe difficult one. As memcg has its own LRU, add MRU to swap ...is not a choice ;( Swap devices are used in order of their priority, so storing per device usage might be usefull for this porpose... Anyway, I should consider more. In general, I like this set but we can't change the limit on demand. (maybe) (just putting it to TO-DO-List is okay to me.) I'm sorry but what do you mean by change the limit on demand? Could you explain more? In short, the administrator have to write the perfect plan to set each group's swap limit beforehand because we cannot decrease used swap. 1st problem is that the user cannot reduce the usage of swap by hand. (He can reduce by killing process or deleting shmem.) Once the usage of swap of a group grows, other groups can't use much. 2nd problem is there is no entity who controls the total amount of swap. The user/admin have to check the amount of free swap space by himself at planning each group's swap limit more carefully than memcg. So, I think rich-control of hierarchy will be of no use ;) All things should be planned before the system starts. In memcg, the amount of free memory is maintained by global LRU. It does much jobs for us. But free swap space isn't. It's just used on demand. If we can't decrease usage of swap by a group by hand, the problem which this swap-space-controller want to fix will not be fixed at pleasant level. Thank you for your explanation. I see your point and agree that the shrinking support is desireble. I'll add it to my ToDo. Anyway, please return -EBUSY at setting limit usage, at first :) That's enough for me, now. Yes, again. Thanks, Daisuke Nishimura. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH -mm 0/5] swapcgroup (v3)
Hi, Balbir-san. On Sat, 05 Jul 2008 12:22:25 +0530, Balbir Singh [EMAIL PROTECTED] wrote: Daisuke Nishimura wrote: Hi. This is new version of swapcgroup. Major changes from previous version - Rebased on 2.6.26-rc5-mm3. The new -mm has been released, but these patches can be applied on 2.6.26-rc8-mm1 too with only some offset warnings. I tested these patches on 2.6.26-rc5-mm3 with some fixes about memory, and it seems to work fine. - (NEW) Implemented force_empty. Currently, it simply uncharges all the charges from the group. Patches - [1/5] add cgroup files - [2/5] add a member to swap_info_struct - [3/5] implement charge and uncharge - [4/5] modify vm_swap_full() - [5/5] implement force_empty ToDo(in my thought. Feel free to add some others here.) - need some documentation Add to memory.txt? or create a new documentation file? I think memory.txt is good. But then, we'll need to add a Table of Contents to it, so that swap controller documentation can be located easily. I think memory.txt is a self-closed documentation, so I don't want to change it, honestlly. I'll write a documentation for swap as a new file first for review. - add option to disable only this feature I'm wondering if this option is needed. memcg has already the boot option to disable it. Is there any case where memory should be accounted but swap should not? That depends on what use case you are trying to provide. Let's say I needed backward compatibility with 2.6.25, then I would account for memory and leave out swap (even though we have swap controller). O.K. I'll add option. Thanks, Daisuke Nishimura. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH -mm 0/5] swapcgroup (v3)
Hi. This is new version of swapcgroup. Major changes from previous version - Rebased on 2.6.26-rc5-mm3. The new -mm has been released, but these patches can be applied on 2.6.26-rc8-mm1 too with only some offset warnings. I tested these patches on 2.6.26-rc5-mm3 with some fixes about memory, and it seems to work fine. - (NEW) Implemented force_empty. Currently, it simply uncharges all the charges from the group. Patches - [1/5] add cgroup files - [2/5] add a member to swap_info_struct - [3/5] implement charge and uncharge - [4/5] modify vm_swap_full() - [5/5] implement force_empty ToDo(in my thought. Feel free to add some others here.) - need some documentation Add to memory.txt? or create a new documentation file? - add option to disable only this feature I'm wondering if this option is needed. memcg has already the boot option to disable it. Is there any case where memory should be accounted but swap should not? - hierarchy support - move charges along with task Both of them need more discussion. Thanks, Daisuke Nishimura. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH -mm 1/5] swapcgroup (v3): add cgroup files
Even if limiting memory usage by memory cgroup, swap space is shared, so resource isolation is not enough. If one group uses up most of the swap space, it can affect other groups anyway. The purpose of swapcgroup is limiting swap usage per group as memory cgroup limits the RSS memory usage. It's now implemented as a add-on to memory cgroup. This patch adds cgroup files(and a member to struct mem_cgroup) for swapcgroup. Files to be added by this patch are: - memory.swap_usage_in_bytes - memory.swap_max_usage_in_bytes - memory.swap_limit_in_bytes - memory.swap_failcnt The meaning of those files are the same as memory cgroup. Change log v2-v3 - Rebased on 2.6.26-rc5-mm3. v1-v2 - Rebased on 2.6.26-rc2-mm1. - Implemented as a add-on to memory cgroup. Signed-off-by: Daisuke Nishimura [EMAIL PROTECTED] --- init/Kconfig|7 + mm/memcontrol.c | 67 +++ 2 files changed, 74 insertions(+), 0 deletions(-) diff --git a/init/Kconfig b/init/Kconfig index 847931a..c604b6d 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -418,6 +418,13 @@ config CGROUP_MEMRLIMIT_CTLR memory RSS and Page Cache control. Virtual address space control is provided by this controller. +config CGROUP_SWAP_RES_CTLR + bool Swap Resource Controller for Control Groups + depends on CGROUP_MEM_RES_CTLR SWAP + help + Provides a swap resource controller that manages and limits swap usage. + Implemented as a add-on to Memory Resource Controller. + config SYSFS_DEPRECATED bool diff --git a/mm/memcontrol.c b/mm/memcontrol.c index b8fe33c..ddc842b 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -133,6 +133,12 @@ struct mem_cgroup { * statistics. */ struct mem_cgroup_stat stat; +#ifdef CONFIG_CGROUP_SWAP_RES_CTLR + /* +* the counter to account for swap usage +*/ + struct res_counter swap_res; +#endif }; static struct mem_cgroup init_mem_cgroup; @@ -953,6 +959,39 @@ static int mem_control_stat_show(struct cgroup *cont, struct cftype *cft, return 0; } +#ifdef CONFIG_CGROUP_SWAP_RES_CTLR +static u64 swap_cgroup_read(struct cgroup *cont, struct cftype *cft) +{ + return res_counter_read_u64(mem_cgroup_from_cont(cont)-swap_res, + cft-private); +} + +static ssize_t swap_cgroup_write(struct cgroup *cont, struct cftype *cft, + struct file *file, const char __user *userbuf, + size_t nbytes, loff_t *ppos) +{ + return res_counter_write(mem_cgroup_from_cont(cont)-swap_res, + cft-private, userbuf, nbytes, ppos, + mem_cgroup_write_strategy); +} + +static int swap_cgroup_reset(struct cgroup *cont, unsigned int event) +{ + struct mem_cgroup *mem; + + mem = mem_cgroup_from_cont(cont); + switch (event) { + case RES_MAX_USAGE: + res_counter_reset_max(mem-swap_res); + break; + case RES_FAILCNT: + res_counter_reset_failcnt(mem-swap_res); + break; + } + return 0; +} +#endif + static struct cftype mem_cgroup_files[] = { { .name = usage_in_bytes, @@ -985,6 +1024,31 @@ static struct cftype mem_cgroup_files[] = { .name = stat, .read_map = mem_control_stat_show, }, +#ifdef CONFIG_CGROUP_SWAP_RES_CTLR + { + .name = swap_usage_in_bytes, + .private = RES_USAGE, + .read_u64 = swap_cgroup_read, + }, + { + .name = swap_max_usage_in_bytes, + .private = RES_MAX_USAGE, + .trigger = swap_cgroup_reset, + .read_u64 = swap_cgroup_read, + }, + { + .name = swap_limit_in_bytes, + .private = RES_LIMIT, + .write = swap_cgroup_write, + .read_u64 = swap_cgroup_read, + }, + { + .name = swap_failcnt, + .private = RES_FAILCNT, + .trigger = swap_cgroup_reset, + .read_u64 = swap_cgroup_read, + }, +#endif }; static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node) @@ -1063,6 +1127,9 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont) } res_counter_init(mem-res); +#ifdef CONFIG_CGROUP_SWAP_RES_CTLR + res_counter_init(mem-swap_res); +#endif for_each_node_state(node, N_POSSIBLE) if (alloc_mem_cgroup_per_zone_info(mem, node)) ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH -mm 2/5] swapcgroup (v3): add a member to swap_info_struct
This patch add a member to swap_info_struct for cgroup. This member, array of pointers to mem_cgroup, is used to remember to which cgroup each swap entries are charged. The memory for this array of pointers is allocated on swapon, and freed on swapoff. Change log v2-v3 - Rebased on 2.6.26-rc5-mm3 - add helper functions and removed #ifdef from sys_swapon()/sys_swapoff(). - add check on mem_cgroup_subsys.disabled v1-v2 - Rebased on 2.6.26-rc2-mm1 - Implemented as a add-on to memory cgroup. Signed-off-by: Daisuke Nishimura [EMAIL PROTECTED] --- include/linux/memcontrol.h | 20 +++- include/linux/swap.h |3 +++ mm/memcontrol.c| 36 mm/swapfile.c | 11 +++ 4 files changed, 69 insertions(+), 1 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index ee1b2fc..b6ff509 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -24,6 +24,7 @@ struct mem_cgroup; struct page_cgroup; struct page; struct mm_struct; +struct swap_info_struct; #ifdef CONFIG_CGROUP_MEM_RES_CTLR @@ -165,5 +166,22 @@ static inline long mem_cgroup_calc_reclaim(struct mem_cgroup *mem, } #endif /* CONFIG_CGROUP_MEM_CONT */ -#endif /* _LINUX_MEMCONTROL_H */ +#ifdef CONFIG_CGROUP_SWAP_RES_CTLR +extern struct mem_cgroup **swap_info_clear_memcg(struct swap_info_struct *p); +extern int swap_info_alloc_memcg(struct swap_info_struct *p, + unsigned long maxpages); +#else +static inline +struct mem_cgroup **swap_info_clear_memcg(struct swap_info_struct *p) +{ + return NULL; +} +static inline +int swap_info_alloc_memcg(struct swap_info_struct *p, unsigned long maxpages) +{ + return 0; +} +#endif + +#endif /* _LINUX_MEMCONTROL_H */ diff --git a/include/linux/swap.h b/include/linux/swap.h index a3af95b..6e1b03d 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -142,6 +142,9 @@ struct swap_info_struct { struct swap_extent *curr_swap_extent; unsigned old_block_size; unsigned short * swap_map; +#ifdef CONFIG_CGROUP_SWAP_RES_CTLR + struct mem_cgroup **memcg; +#endif unsigned int lowest_bit; unsigned int highest_bit; unsigned int cluster_next; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index ddc842b..81bb7fa 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1209,3 +1209,39 @@ struct cgroup_subsys mem_cgroup_subsys = { .attach = mem_cgroup_move_task, .early_init = 0, }; + +#ifdef CONFIG_CGROUP_SWAP_RES_CTLR +/* called with swap_lock held */ +struct mem_cgroup **swap_info_clear_memcg(struct swap_info_struct *p) +{ + struct mem_cgroup **mem; + + /* just clear p-memcg, without checking mem_cgroup_subsys.disabled */ + mem = p-memcg; + p-memcg = NULL; + + return mem; +} + +/* called without swap_lock held */ +int swap_info_alloc_memcg(struct swap_info_struct *p, unsigned long maxpages) +{ + int ret = 0; + + if (mem_cgroup_subsys.disabled) + goto out; + + p-memcg = vmalloc(maxpages * sizeof(struct mem_cgroup *)); + if (!p-memcg) { + /* make swapon fail */ + printk(KERN_ERR Unable to allocate memory for memcg\n); + ret = -ENOMEM; + goto out; + } + memset(p-memcg, 0, maxpages * sizeof(struct mem_cgroup *)); + +out: + return ret; +} +#endif + diff --git a/mm/swapfile.c b/mm/swapfile.c index bf7d13d..312c573 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1228,6 +1228,7 @@ asmlinkage long sys_swapoff(const char __user * specialfile) unsigned short *swap_map; struct file *swap_file, *victim; struct address_space *mapping; + struct mem_cgroup **memcg = NULL; struct inode *inode; char * pathname; int i, type, prev; @@ -1328,10 +1329,12 @@ asmlinkage long sys_swapoff(const char __user * specialfile) p-max = 0; swap_map = p-swap_map; p-swap_map = NULL; + memcg = swap_info_clear_memcg(p); p-flags = 0; spin_unlock(swap_lock); mutex_unlock(swapon_mutex); vfree(swap_map); + vfree(memcg); inode = mapping-host; if (S_ISBLK(inode-i_mode)) { struct block_device *bdev = I_BDEV(inode); @@ -1475,6 +1478,7 @@ asmlinkage long sys_swapon(const char __user * specialfile, int swap_flags) unsigned long maxpages = 1; int swapfilesize; unsigned short *swap_map; + struct mem_cgroup **memcg = NULL; struct page *page = NULL; struct inode *inode = NULL; int did_down = 0; @@ -1498,6 +1502,7 @@ asmlinkage long sys_swapon(const char __user * specialfile, int swap_flags) p-swap_file = NULL; p-old_block_size = 0; p-swap_map = NULL; + swap_info_clear_memcg(p); p-lowest_bit = 0; p-highest_bit = 0; p
[Devel] [PATCH -mm 3/5] swapcgroup (v3): implement charge and uncharge
This patch implements charge and uncharge of swapcgroup. - what will be charged ? charge the number of swap entries in bytes. - when to charge/uncharge ? charge at get_swap_entry(), and uncharge at swap_entry_free(). - to what group charge the swap entry ? To determine to what mem_cgroup the swap entry should be charged, I changed the argument of get_swap_entry() from (void) to (struct page *). As a result, get_swap_entry() can determine to what mem_cgroup it should charge the swap entry by referring to page-page_cgroup-mem_cgroup. - from what group uncharge the swap entry ? I added in previous patch to swap_info_struct a member 'struct swap_cgroup **', array of pointer to which swap_cgroup the swap entry is charged. Change log v2-v3 - Rebased on 2.6.26-rc5-mm3 - make swap_cgroup_charge() fail when !pc. - add check on mem_cgroup_subsys.disabled v1-v2 - Rebased on 2.6.26-rc2-mm1 - Implemented as a add-on to memory cgroup. Signed-off-by: Daisuke Nishimura [EMAIL PROTECTED] --- include/linux/memcontrol.h | 17 ++ include/linux/swap.h |4 +- mm/memcontrol.c| 53 mm/shmem.c |2 +- mm/swap_state.c|2 +- mm/swapfile.c | 13 ++- 6 files changed, 86 insertions(+), 5 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index b6ff509..f3c84f7 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -170,6 +170,11 @@ static inline long mem_cgroup_calc_reclaim(struct mem_cgroup *mem, extern struct mem_cgroup **swap_info_clear_memcg(struct swap_info_struct *p); extern int swap_info_alloc_memcg(struct swap_info_struct *p, unsigned long maxpages); +extern int swap_cgroup_charge(struct page *page, + struct swap_info_struct *si, + unsigned long offset); +extern void swap_cgroup_uncharge(struct swap_info_struct *si, + unsigned long offset); #else static inline struct mem_cgroup **swap_info_clear_memcg(struct swap_info_struct *p) @@ -182,6 +187,18 @@ int swap_info_alloc_memcg(struct swap_info_struct *p, unsigned long maxpages) { return 0; } + +static inline int swap_cgroup_charge(struct page *page, + struct swap_info_struct *si, + unsigned long offset) +{ + return 0; +} + +static inline void swap_cgroup_uncharge(struct swap_info_struct *si, + unsigned long offset) +{ +} #endif #endif /* _LINUX_MEMCONTROL_H */ diff --git a/include/linux/swap.h b/include/linux/swap.h index 6e1b03d..f80255b 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -298,7 +298,7 @@ extern struct page *swapin_readahead(swp_entry_t, gfp_t, /* linux/mm/swapfile.c */ extern long total_swap_pages; extern void si_swapinfo(struct sysinfo *); -extern swp_entry_t get_swap_page(void); +extern swp_entry_t get_swap_page(struct page *); extern swp_entry_t get_swap_page_of_type(int); extern int swap_duplicate(swp_entry_t); extern int valid_swaphandles(swp_entry_t, unsigned long *); @@ -405,7 +405,7 @@ static inline int remove_exclusive_swap_page_ref(struct page *page) return 0; } -static inline swp_entry_t get_swap_page(void) +static inline swp_entry_t get_swap_page(struct page *page) { swp_entry_t entry; entry.val = 0; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 81bb7fa..d16d0a5 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1243,5 +1243,58 @@ int swap_info_alloc_memcg(struct swap_info_struct *p, unsigned long maxpages) out: return ret; } + +int swap_cgroup_charge(struct page *page, + struct swap_info_struct *si, + unsigned long offset) +{ + int ret; + struct page_cgroup *pc; + struct mem_cgroup *mem; + + if (mem_cgroup_subsys.disabled) + return 0; + + lock_page_cgroup(page); + pc = page_get_page_cgroup(page); + if (unlikely(!pc)) { + /* make get_swap_page fail */ + unlock_page_cgroup(page); + return -ENOMEM; + } else { + mem = pc-mem_cgroup; + css_get(mem-css); + } + unlock_page_cgroup(page); + + ret = res_counter_charge(mem-swap_res, PAGE_SIZE); + if (!ret) + si-memcg[offset] = mem; + else + css_put(mem-css); + + return ret; +} + +void swap_cgroup_uncharge(struct swap_info_struct *si, + unsigned long offset) +{ + struct mem_cgroup *mem = si-memcg[offset]; + + if (mem_cgroup_subsys.disabled) + return; + + /* mem would be NULL: +* 1. when get_swap_page() failed at charging swap_cgroup, +*and called swap_entry_free(). +* 2. when this swap entry had
[Devel] [PATCH -mm 4/5] swapcgroup (v3): modify vm_swap_full()
This patch modifies vm_swap_full() to calculate the rate of swap usage per cgroup. The purpose of this change is to free freeable swap caches (that is, swap entries) per cgroup, so that swap_cgroup_charge() fails less frequently. I tested whether this patch can reduce the swap usage or not, by running qsbench (8 x 40M) 10 times in mem:128M/swap:256M group and mem:256M/swap:128M group. And I confirmed that this patch can keep swap usage to (half of the limit usage the limit), whereas without this patch, swap usage tends to keep near to the limit. Change log v2-v3 - Rebased on 2.6.26-rc5-mm3 - take into account global swap usage. - change arg of vm_swap_full from page to memcg. - add check on mem_cgroup_subsys.disabled v1-v2 - new patch Signed-off-by: Daisuke Nishimura [EMAIL PROTECTED] --- include/linux/memcontrol.h | 12 include/linux/swap.h |4 +++- mm/memcontrol.c| 18 ++ mm/memory.c|4 +++- mm/swapfile.c | 38 +- mm/vmscan.c|6 +++--- 6 files changed, 76 insertions(+), 6 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index f3c84f7..1e6eb0c 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -175,6 +175,8 @@ extern int swap_cgroup_charge(struct page *page, unsigned long offset); extern void swap_cgroup_uncharge(struct swap_info_struct *si, unsigned long offset); +extern struct mem_cgroup *page_to_memcg(struct page *page); +extern int swap_cgroup_vm_swap_full(struct mem_cgroup *memcg); #else static inline struct mem_cgroup **swap_info_clear_memcg(struct swap_info_struct *p) @@ -199,6 +201,16 @@ static inline void swap_cgroup_uncharge(struct swap_info_struct *si, unsigned long offset) { } + +static inline struct mem_cgroup *page_to_memcg(struct page *page) +{ + return NULL; +} + +static inline int swap_cgroup_vm_swap_full(struct mem_cgroup *memcg) +{ + return 0; +} #endif #endif /* _LINUX_MEMCONTROL_H */ diff --git a/include/linux/swap.h b/include/linux/swap.h index f80255b..2b95df9 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -161,7 +161,9 @@ struct swap_list_t { }; /* Swap 50% full? Release swapcache more aggressively.. */ -#define vm_swap_full() (nr_swap_pages*2 total_swap_pages) +#define vm_swap_full(memcg) ((nr_swap_pages*2 total_swap_pages) \ + || swap_cgroup_vm_swap_full(memcg)) + /* linux/mm/page_alloc.c */ extern unsigned long totalram_pages; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index d16d0a5..160fca1 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1296,5 +1296,23 @@ void swap_cgroup_uncharge(struct swap_info_struct *si, css_put(mem-css); } } + +int swap_cgroup_vm_swap_full(struct mem_cgroup *memcg) +{ + u64 usage; + u64 limit; + + if (memcg) + css_get(memcg-css); + else + return 0; + + usage = res_counter_read_u64(memcg-swap_res, RES_USAGE); + limit = res_counter_read_u64(memcg-swap_res, RES_LIMIT); + + css_put(memcg-css); + + return usage * 2 limit; +} #endif diff --git a/mm/memory.c b/mm/memory.c index 536d748..afcf737 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2230,7 +2230,9 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma, page_add_anon_rmap(page, vma, address); swap_free(entry); - if (vm_swap_full() || (vma-vm_flags VM_LOCKED) || PageMlocked(page)) + if (vm_swap_full(page_to_memcg(page)) + || (vma-vm_flags VM_LOCKED) + || PageMlocked(page)) remove_exclusive_swap_page(page); unlock_page(page); diff --git a/mm/swapfile.c b/mm/swapfile.c index 57798c5..6a863bc 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -28,6 +28,7 @@ #include linux/capability.h #include linux/syscalls.h #include linux/memcontrol.h +#include linux/cgroup.h #include asm/pgtable.h #include asm/tlbflush.h @@ -447,7 +448,8 @@ void free_swap_and_cache(swp_entry_t entry) /* Only cache user (+us), or swap space full? Free it! */ /* Also recheck PageSwapCache after page is locked (above) */ if (PageSwapCache(page) !PageWriteback(page) - (one_user || vm_swap_full())) { + (one_user + || vm_swap_full(page_to_memcg(page { delete_from_swap_cache(page); SetPageDirty(page); } @@ -1889,3 +1891,37 @@ int valid_swaphandles(swp_entry_t entry, unsigned long *offset) *offset = ++toff; return nr_pages? ++nr_pages: 0; } + +#ifdef CONFIG_CGROUP_SWAP_RES_CTLR +/* + * returns
[Devel] [PATCH -mm 5/5] swapcgroup (v3): implement force_empty
This patch implements force_empty of swapcgroup. Currently, it simply uncharges all the charges from the group. I think there can be other implementations. What I thought are: - move all the charges to its parent. - unuse(swap in) all the swap charged to the group. But in any case, I think before implementing this way, hierarchy and move charges support are needed. So I think this is enough for the first step. Change log v2-v3 - new patch Signed-off-by: Daisuke Nishimura [EMAIL PROTECTED] --- include/linux/memcontrol.h |5 mm/memcontrol.c| 47 +++ mm/swapfile.c | 52 +-- 3 files changed, 91 insertions(+), 13 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 1e6eb0c..8c4bad0 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -177,6 +177,7 @@ extern void swap_cgroup_uncharge(struct swap_info_struct *si, unsigned long offset); extern struct mem_cgroup *page_to_memcg(struct page *page); extern int swap_cgroup_vm_swap_full(struct mem_cgroup *memcg); +extern void __swap_cgroup_force_empty(struct mem_cgroup *mem); #else static inline struct mem_cgroup **swap_info_clear_memcg(struct swap_info_struct *p) @@ -211,6 +212,10 @@ static inline int swap_cgroup_vm_swap_full(struct mem_cgroup *memcg) { return 0; } + +static inline void __swap_cgroup_force_empty(struct mem_cgroup *mem) +{ +} #endif #endif /* _LINUX_MEMCONTROL_H */ diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 160fca1..f93907c 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -826,6 +826,34 @@ static void mem_cgroup_force_empty_list(struct mem_cgroup *mem, spin_unlock_irqrestore(mz-lru_lock, flags); } +static inline int mem_cgroup_in_use(struct mem_cgroup *mem) +{ + return mem-res.usage 0; +} + +static inline int swap_cgroup_in_use(struct mem_cgroup *mem) +{ +#ifdef CONFIG_CGROUP_SWAP_RES_CTLR + return mem-swap_res.usage 0; +#else + return 0; +#endif +} + +static void __mem_cgroup_force_empty(struct mem_cgroup *mem) +{ + int node, zid; + + for_each_node_state(node, N_POSSIBLE) + for (zid = 0; zid MAX_NR_ZONES; zid++) { + struct mem_cgroup_per_zone *mz; + enum lru_list l; + mz = mem_cgroup_zoneinfo(mem, node, zid); + for_each_lru(l) + mem_cgroup_force_empty_list(mem, mz, l); + } +} + /* * make mem_cgroup's charge to be 0 if there is no task. * This enables deleting this mem_cgroup. @@ -833,7 +861,6 @@ static void mem_cgroup_force_empty_list(struct mem_cgroup *mem, static int mem_cgroup_force_empty(struct mem_cgroup *mem) { int ret = -EBUSY; - int node, zid; css_get(mem-css); /* @@ -841,18 +868,17 @@ static int mem_cgroup_force_empty(struct mem_cgroup *mem) * active_list - inactive_list while we don't take a lock. * So, we have to do loop here until all lists are empty. */ - while (mem-res.usage 0) { + while (mem_cgroup_in_use(mem) || swap_cgroup_in_use(mem)) { if (atomic_read(mem-css.cgroup-count) 0) goto out; - for_each_node_state(node, N_POSSIBLE) - for (zid = 0; zid MAX_NR_ZONES; zid++) { - struct mem_cgroup_per_zone *mz; - enum lru_list l; - mz = mem_cgroup_zoneinfo(mem, node, zid); - for_each_lru(l) - mem_cgroup_force_empty_list(mem, mz, l); - } + + if (mem_cgroup_in_use(mem)) + __mem_cgroup_force_empty(mem); + + if (swap_cgroup_in_use(mem)) + __swap_cgroup_force_empty(mem); } + ret = 0; out: css_put(mem-css); @@ -1289,6 +1315,7 @@ void swap_cgroup_uncharge(struct swap_info_struct *si, *and called swap_entry_free(). * 2. when this swap entry had been assigned by *get_swap_page_of_type() (via SWSUSP?). +* 3. force empty can make such entries. */ if (mem) { res_counter_uncharge(mem-swap_res, PAGE_SIZE); diff --git a/mm/swapfile.c b/mm/swapfile.c index 6a863bc..2263ae8 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -704,15 +704,30 @@ static int unuse_mm(struct mm_struct *mm, } /* + * return the mem_cgroup to which a entry is charged + */ +static inline struct mem_cgroup *entry_to_memcg(swp_entry_t entry) +{ +#ifdef CONFIG_CGROUP_SWAP_RES_CTLR + return swap_info[swp_type(entry)].memcg[swp_offset(entry)]; +#else + return NULL; +#endif +} + +/* * Scan swap_map from current position to next entry still
[Devel] Re: [PATCH -mm 5/5] swapcgroup (v3): implement force_empty
Hi, Yamamoto-san. Thank you for your comment. On Fri, 4 Jul 2008 15:54:31 +0900 (JST), [EMAIL PROTECTED] (YAMAMOTO Takashi) wrote: hi, +/* + * uncharge all the entries that are charged to the group. + */ +void __swap_cgroup_force_empty(struct mem_cgroup *mem) +{ + struct swap_info_struct *p; + int type; + + spin_lock(swap_lock); + for (type = swap_list.head; type = 0; type = swap_info[type].next) { + p = swap_info + type; + + if ((p-flags SWP_ACTIVE) == SWP_ACTIVE) { + unsigned int i = 0; + + spin_unlock(swap_lock); what prevents the device from being swapoff'ed while you drop swap_lock? Nothing. After searching the entry to be uncharged(find_next_to_unuse below), I recheck under swap_lock whether the entry is charged to the group. Even if the device is swapoff'ed, swap_off must have uncharged the entry, so I don't think it's needed anyway. YAMAMOTO Takashi + while ((i = find_next_to_unuse(p, i, mem)) != 0) { + spin_lock(swap_lock); + if (p-swap_map[i] p-memcg[i] == mem) Ah, I think it should be added !p-swap_map to check the device has not been swapoff'ed. Thanks, Daisuke Nishimura. + swap_cgroup_uncharge(p, i); + spin_unlock(swap_lock); + } + spin_lock(swap_lock); + } + } + spin_unlock(swap_lock); + + return; +} #endif ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH -mm 5/5] swapcgroup (v3): implement force_empty
On Fri, 4 Jul 2008 16:48:28 +0900 (JST), [EMAIL PROTECTED] (YAMAMOTO Takashi) wrote: Hi, Yamamoto-san. Thank you for your comment. On Fri, 4 Jul 2008 15:54:31 +0900 (JST), [EMAIL PROTECTED] (YAMAMOTO Takashi) wrote: hi, +/* + * uncharge all the entries that are charged to the group. + */ +void __swap_cgroup_force_empty(struct mem_cgroup *mem) +{ + struct swap_info_struct *p; + int type; + + spin_lock(swap_lock); + for (type = swap_list.head; type = 0; type = swap_info[type].next) { + p = swap_info + type; + + if ((p-flags SWP_ACTIVE) == SWP_ACTIVE) { + unsigned int i = 0; + + spin_unlock(swap_lock); what prevents the device from being swapoff'ed while you drop swap_lock? Nothing. After searching the entry to be uncharged(find_next_to_unuse below), I recheck under swap_lock whether the entry is charged to the group. Even if the device is swapoff'ed, swap_off must have uncharged the entry, so I don't think it's needed anyway. YAMAMOTO Takashi + while ((i = find_next_to_unuse(p, i, mem)) != 0) { + spin_lock(swap_lock); + if (p-swap_map[i] p-memcg[i] == mem) Ah, I think it should be added !p-swap_map to check the device has not been swapoff'ed. find_next_to_unuse seems to have fragile assumptions and can dereference p-swap_map as well. You're right. Thank you for pointing it out! I'll consider more. Thanks, Daisuke Nishimura. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH -mm 4/5] swapcgroup (v3): modify vm_swap_full()
Hi, Kamezawa-san. On Fri, 4 Jul 2008 18:58:45 +0900, KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote: On Fri, 4 Jul 2008 15:22:44 +0900 Daisuke Nishimura [EMAIL PROTECTED] wrote: /* Swap 50% full? Release swapcache more aggressively.. */ -#define vm_swap_full() (nr_swap_pages*2 total_swap_pages) +#define vm_swap_full(memcg) ((nr_swap_pages*2 total_swap_pages) \ + || swap_cgroup_vm_swap_full(memcg)) + Maybe nitpick but I like == vm_swap_full(page) ((nr_swap_pages *2 total_swap_pages) || swap_cgroup_vm_swap_full_page(page)) == rather than vm_swap_full(memcg) Well, I used page in v2, but Kosaki-san said vm_swap_full() is not page-granularity operation so it should be changed. And more, @@ -1317,7 +1317,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone, __mod_zone_page_state(zone, NR_LRU_BASE + lru, pgmoved); pgmoved = 0; spin_unlock_irq(zone-lru_lock); - if (vm_swap_full()) + if (vm_swap_full(sc-mem_cgroup)) pagevec_swap_free(pvec); __pagevec_release(pvec); spin_lock_irq(zone-lru_lock); @@ -1328,7 +1328,7 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone, __count_zone_vm_events(PGREFILL, zone, pgscanned); __count_vm_events(PGDEACTIVATE, pgdeactivate); spin_unlock_irq(zone-lru_lock); - if (vm_swap_full()) + if (vm_swap_full(sc-mem_cgroup)) pagevec_swap_free(pvec); pagevec_release(pvec); page cannot be determined in those places. I don't want to change pagevec_swap_free(), so I changed the argument of vm_swap_full(). And could you change this to inline funcition ? Of course. I think it would be better. Thanks, Daisuke Nishimura. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH -mm 0/5] swapcgroup (v3)
On Fri, 4 Jul 2008 18:40:33 +0900, KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote: On Fri, 4 Jul 2008 15:15:36 +0900 Daisuke Nishimura [EMAIL PROTECTED] wrote: Hi. This is new version of swapcgroup. Major changes from previous version - Rebased on 2.6.26-rc5-mm3. The new -mm has been released, but these patches can be applied on 2.6.26-rc8-mm1 too with only some offset warnings. I tested these patches on 2.6.26-rc5-mm3 with some fixes about memory, and it seems to work fine. - (NEW) Implemented force_empty. Currently, it simply uncharges all the charges from the group. Patches - [1/5] add cgroup files - [2/5] add a member to swap_info_struct - [3/5] implement charge and uncharge - [4/5] modify vm_swap_full() - [5/5] implement force_empty ToDo(in my thought. Feel free to add some others here.) - need some documentation Add to memory.txt? or create a new documentation file? Maybe new documentation file is better. O.K. - add option to disable only this feature I'm wondering if this option is needed. memcg has already the boot option to disable it. Is there any case where memory should be accounted but swap should not? On x86-32, area for vmalloc() is very small and array of swap_info_struct[] will use much amount of it. If vmalloc() area is too small, the kernel cannot load modules. It would be critical. I'll add an option. Thanks, Daisuke Nishimura. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH -mm 5/5] swapcgroup (v3): implement force_empty
On Fri, 4 Jul 2008 19:16:38 +0900, KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote: On Fri, 4 Jul 2008 15:24:23 +0900 Daisuke Nishimura [EMAIL PROTECTED] wrote: This patch implements force_empty of swapcgroup. Currently, it simply uncharges all the charges from the group. I think there can be other implementations. What I thought are: - move all the charges to its parent. - unuse(swap in) all the swap charged to the group. 3. move all swap back to memory (see swapoff.) Do you mean swapping in all the swap including used by other groups? It would be one choice anyway. But in any case, I think before implementing this way, hierarchy and move charges support are needed. So I think this is enough for the first step. I don't think hierarchy/automatic-load-balancer for swap cg is necessary. It's the problem of how the hierarchy would be, I think. I'm saying hierarchy here just to mean some kind of feature where a parent includes their children. I think hierarchy is needed if we implement the choice 1 above, and I personally think it would be the best choice. Hmm...but handling limit_change (at least, returns -EBUSY) will be necessary. I think so too. But I'm not sure now it's good or bad to support shrinking at limit_change about swap. Shrinking swap means increasing the memory usage and that may cause another swapout. Do you consider a some magical way to move pages in swap back to memory ? In this patch, I modified the find_next_to_unuse() to find the entry charged to a specific group. It might be possible to modify try_to_unuse()(or define another function based on try_to_unuse()) to reduce swap usage of a specified group down to some threashold. But, I think, one problem here is from which device the swaps should be back to memory, or usage balance between swap devices. In general, I like this set but we can't change the limit on demand. (maybe) (just putting it to TO-DO-List is okay to me.) I'm sorry but what do you mean by change the limit on demand? Could you explain more? Thanks, Daisuke Nishimura. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [RFD][PATCH] memcg: Move Usage at Task Move
On Wed, 11 Jun 2008 13:57:34 +0530 Balbir Singh [EMAIL PROTECTED] wrote: (snip) 2. Don't move any usage at task move. (current implementation.) Pros. - no complication in the code. Cons. - A task's usage is chareged to wrong cgroup. - Not sure, but I believe the users don't want this. I'd say stick with this unless there a strong arguments in favour of changing, based on concrete needs. One reasone is that I think a typical usage of memory controller is fork()-move-exec(). (by libcg ?) and exec() will flush the all usage. Exactly - this is a good reason *not* to implement move - because then you drag all the usage of the middleware daemon into the new cgroup. Yes. The other thing is that charges will eventually fade away. Please see the cgroup implementation of page_referenced() and mark_page_accessed(). The original group on memory pressure will drop pages that were left behind by a task that migrates. The new group will pick it up if referenced. Hum.. So, it seems that some kind of Lazy Mode(#3 of Kamezawa-san's) has been implemented already. But, one of the reason that I think usage should be moved is to make the usage as accurate as possible, that is the size of memory used by processes in the group at the moment. I agree that statistics is not the purpose of memcg(and swap), but, IMHO, it's useful feature of memcg. Administrators can know how busy or idle each groups are by it. Thanks, Daisuke Nishimura. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH] cgroup: support checking of subsystem dependencies
On Thu, 19 Jun 2008 08:40:24 +0800, Li Zefan [EMAIL PROTECTED] wrote: KAMEZAWA Hiroyuki wrote: On Wed, 18 Jun 2008 16:01:49 +0800 Li Zefan [EMAIL PROTECTED] wrote: This allows one subsystem to require that it only be mounted when some other subsystems are also present in the proposed hierarchy. For example if subsystem foo depends on bar, the following will fail: # mount -t cgroup -ofoo xxx /dev/cgroup You should mount with both subsystems: # mount -t cgroup -ofoo,bar xxx /dev/cgroup I'm just curious. May I ask Is there such cgroup subsystem now ? Currently no, but people who are developing or wants to develop new subsystems may find this useful, for example Daisuke Nishimura-san's swap controller. My current version is implemented as addon to memory controller, so it can be used by just mounting memory controller :) But anyway, this feature might be used in future by some subsystems. Thanks, Daisuke Nishimura. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [RFD][PATCH] memcg: Move Usage at Task Move
Hi, Kamezawa-san. Sorry for late reply. On Fri, 6 Jun 2008 10:52:35 +0900, KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote: Move Usage at Task Move (just an experimantal for discussion) I tested this but don't think bug-free. In current memcg, when task moves to a new cg, the usage remains in the old cg. This is considered to be not good. I agree. This is a trial to move usage from old cg to new cg at task move. Finally, you'll see the problems we have to handle are failure and rollback. This one's Basic algorithm is 0. can_attach() is called. 1. count movable pages by scanning page table. isolate all pages from LRU. 2. try to create enough room in new memory cgroup 3. start moving page accouing 4. putback pages to LRU. 5. can_attach() for other cgroups are called. You isolate pages and move charges of them by can_attach(), but it means that pages that are allocated between page isolation and moving tsk-cgroups remains charged to old group, right? I think it would be better if possible to move charges by attach() as cpuset migrates pages by cpuset_attach(). But one of the problem of it is that attch() does not return any value, so there is no way to notify failure... A case study. group_A - limit=1G, task_X's usage= 800M. group_B - limit=1G, usage=500M. For moving task_X from group_A to group_B. - group_B should be reclaimed or have enough room. While moving task_X from group_A to group_B. - group_B's memory usage can be changed - group_A's memory usage can be changed We accounts the resouce based on pages. Then, we can't move all resource usage at once. If group_B has no more room when we've moved 700M of task_X to group_B, we have to move 700M of task_X back to group_A. So I implemented roll-back. But other process may use up group_A's available resource at that point. For avoiding that, preserve 800M in group_B before moving task_X means that task_X can occupy 1600M of resource at moving. (So I don't do in this patch.) This patch uses Best-Effort rollback. Failure in rollback is ignored and the usage is just leaked. If implement rollback in kernel, I think it must not fail to prevent leak of usage. How about using charge_force for rollbak? Or, instead of implementing rollback in kernel, how about making user(or middle ware?) re-echo pid to rollbak on failure? Roll-back can happen when (a) in phase 3. cannot move a page to new cgroup because of limit. (b) in phase 5. other cgourp subsys returns error in can_attach(). Isn't rollbak needed on failure between can_attach and attach(e.g. failure on find_css_set, ...)? +int mem_cgroup_recharge_task(struct mem_cgroup *newcg, + struct task_struct *task) +{ (snip) + /* create enough room before move */ + necessary = info.count * PAGE_SIZE; + + do { + spin_lock(newcg-res.lock); + if (newcg-res.limit necessary) + rc = -ENOMEM; I think it should be (newcg-res.limit necessary). Thanks, Daisuke Nishimura. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [RFD][PATCH] memcg: Move Usage at Task Move
On Tue, 10 Jun 2008 17:26:37 +0900, KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote: This is a trial to move usage from old cg to new cg at task move. Finally, you'll see the problems we have to handle are failure and rollback. This one's Basic algorithm is 0. can_attach() is called. 1. count movable pages by scanning page table. isolate all pages from LRU. 2. try to create enough room in new memory cgroup 3. start moving page accouing 4. putback pages to LRU. 5. can_attach() for other cgroups are called. You isolate pages and move charges of them by can_attach(), but it means that pages that are allocated between page isolation and moving tsk-cgroups remains charged to old group, right? yes. I think it would be better if possible to move charges by attach() as cpuset migrates pages by cpuset_attach(). But one of the problem of it is that attch() does not return any value, so there is no way to notify failure... yes, here again. it makes roll-back more difficult. I think so too. That's why I said one of the problem. A case study. group_A - limit=1G, task_X's usage= 800M. group_B - limit=1G, usage=500M. For moving task_X from group_A to group_B. - group_B should be reclaimed or have enough room. While moving task_X from group_A to group_B. - group_B's memory usage can be changed - group_A's memory usage can be changed We accounts the resouce based on pages. Then, we can't move all resource usage at once. If group_B has no more room when we've moved 700M of task_X to group_B, we have to move 700M of task_X back to group_A. So I implemented roll-back. But other process may use up group_A's available resource at that point. For avoiding that, preserve 800M in group_B before moving task_X means that task_X can occupy 1600M of resource at moving. (So I don't do in this patch.) This patch uses Best-Effort rollback. Failure in rollback is ignored and the usage is just leaked. If implement rollback in kernel, I think it must not fail to prevent leak of usage. How about using charge_force for rollbak? means allowing to exceed limit ? Yes. I agree that exceeding limit is not good, but I just feel that it's better than leaking usage. Of cource, I think usage should be decreased later by some methods. Or, instead of implementing rollback in kernel, how about making user(or middle ware?) re-echo pid to rollbak on failure? If the users does well, the system works in better way is O.K. If the users doesn't well, the system works in broken way is very bad. Hum... I think users must know what they are doing. They must know that moving a process to another group that doesn't have enough room for it may fail with half state, if it is the behavior of kernel. And they should handle the error by themselves, IMHO. Thanks, Daisuke Nishimura. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [RFD][PATCH] memcg: Move Usage at Task Move
On Wed, 11 Jun 2008 13:14:37 +0900, KAMEZAWA Hiroyuki [EMAIL PROTECTED] wrote: On Wed, 11 Jun 2008 12:44:46 +0900 (JST) [EMAIL PROTECTED] (YAMAMOTO Takashi) wrote: I'm now considering following logic. How do you think ? Assume: move TASK from group:CURR to group:DEST. == move_task(TASK, CURR, DEST) if (DEST's limit is unlimited) moving TASK return success. usage = check_usage_of_task(TASK). /* try to reserve enough room in destionation */ if (try_to_reserve_enough_room(DEST, usage)) { move TASK to DEST and move pages AMAP. /* usage_of_task(TASK) can be changed while we do this. Then, we move AMAP. */ return success; } return failure. == AMAP means that you might leave some random charges in CURR? yes. but we can reduce bad case by some way - reserve more than necessary. or - read_lock mm-sem while move. I preffer the latter. Though it's expencive, I think moving a task would not happen so offen. Thanks, Daisuke Nishimura. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 0/4] swapcgroup(v2)
On 2008/05/27 16:42 +0900, Balbir Singh wrote: YAMAMOTO Takashi wrote: hi, Thanks for looking into this. Yamamoto-San is also looking into a swap controller. Is there a consensus on the approach? Not yet, but I think we should have some consensus each other before going further. Thanks, Daisuke Nishimura. while nishimura-san's one still seems to have a lot of todo, it seems good enough as a start point to me. so i'd like to withdraw mine. nishimura-san, is it ok for you? Of cource. I'll work hard to make it better. I would suggest that me merge the good parts from both into the swap controller. Having said that I'll let the two of you decide on what the good aspects of both are. I cannot see any immediate overlap, but there might be some w.r.t. infrastructure used. Well, you mean you'll make another patch based on yamamoto-san's and mine? Basically, I think it's difficult to merge because we charge different objects. Thanks, Daisuke Nishimura. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 0/4] swapcgroup(v2)
On 2008/05/27 22:18 +0900, Balbir Singh wrote: Daisuke Nishimura wrote: On 2008/05/27 16:42 +0900, Balbir Singh wrote: YAMAMOTO Takashi wrote: hi, Thanks for looking into this. Yamamoto-San is also looking into a swap controller. Is there a consensus on the approach? Not yet, but I think we should have some consensus each other before going further. Thanks, Daisuke Nishimura. while nishimura-san's one still seems to have a lot of todo, it seems good enough as a start point to me. so i'd like to withdraw mine. nishimura-san, is it ok for you? Of cource. I'll work hard to make it better. I would suggest that me merge the good parts from both into the swap controller. Having said that I'll let the two of you decide on what the good aspects of both are. I cannot see any immediate overlap, but there might be some w.r.t. infrastructure used. Well, you mean you'll make another patch based on yamamoto-san's and mine? Basically, I think it's difficult to merge because we charge different objects. Yes, I know - that's why I said infrastructure, to see the grouping and common data structure aspects. I'll try and review the code. OK. I'll wait for your patch. Thanks, Daisuke Nishimura. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 0/4] swapcgroup(v2)
On 2008/05/27 22:46 +0900, Balbir Singh wrote: Daisuke Nishimura wrote: On 2008/05/27 22:18 +0900, Balbir Singh wrote: Daisuke Nishimura wrote: On 2008/05/27 16:42 +0900, Balbir Singh wrote: YAMAMOTO Takashi wrote: hi, Thanks for looking into this. Yamamoto-San is also looking into a swap controller. Is there a consensus on the approach? Not yet, but I think we should have some consensus each other before going further. Thanks, Daisuke Nishimura. while nishimura-san's one still seems to have a lot of todo, it seems good enough as a start point to me. so i'd like to withdraw mine. nishimura-san, is it ok for you? Of cource. I'll work hard to make it better. I would suggest that me merge the good parts from both into the swap controller. Having said that I'll let the two of you decide on what the good aspects of both are. I cannot see any immediate overlap, but there might be some w.r.t. infrastructure used. Well, you mean you'll make another patch based on yamamoto-san's and mine? Basically, I think it's difficult to merge because we charge different objects. Yes, I know - that's why I said infrastructure, to see the grouping and common data structure aspects. I'll try and review the code. OK. I'll wait for your patch. Hi, Daisuke-San I am not sending out any patch (sorry for the confusion, if I caused it). I am going to review the swapcgroup patchset. No problem :-) I would very appreciate it if you review swap cgroup patches. I'll update my patches based on your and others' comments. Thanks, Daisuke Nishimura. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 3/4] swapcgroup: implement charge/uncharge
On 2008/05/22 16:37 +0900, KAMEZAWA Hiroyuki wrote: On Thu, 22 May 2008 15:20:05 +0900 Daisuke Nishimura [EMAIL PROTECTED] wrote: +#ifdef CONFIG_CGROUP_SWAP_RES_CTLR +int swap_cgroup_charge(struct page *page, +struct swap_info_struct *si, +unsigned long offset) +{ +int ret; +struct page_cgroup *pc; +struct mem_cgroup *mem; + +lock_page_cgroup(page); +pc = page_get_page_cgroup(page); +if (unlikely(!pc)) +mem = init_mem_cgroup; +else +mem = pc-mem_cgroup; +unlock_page_cgroup(page); If !pc, the page is used before memory controller is available. But is it good to be charged to init_mem_cgroup() ? I'm sorry, but I can't understand this situation. memory controller is initialized at kernel initialization, so aren't processes created after it is initialized? How about returning 'failure' in this case ? I think returning 'failure' here is not so bad. Which of below do you mean by 'failure'? A. make it fail to get swap entry, so the page cannot be swapped out. B. don't charge this swap entry to any cgroup, but the page would be swapped out. I don't want to do B, because I don't want to make such not-charged-to-anywhere entries. And I've seen several times this condition(!pc) becomes true, so I charged to init_mem_cgroup. BTW, I noticed that almost all of functions I added by this patch set should check mem_cgroup_subsys.disabled first because it depend on memory cgroup. + +css_get(mem-css); move this css_get() before unlock_page_cgroup() is safer. OK, thanks. Daisuke Nishimura. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 4/4] swapcgroup: modify vm_swap_full for cgroup
On 2008/05/22 21:32 +0900, KOSAKI Motohiro wrote: perhaps, I don't understand your intention exactly. Why can't you make wrapper function? e.g. vm_swap_full(page_to_memcg(page)) OK. I'll try it. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 0/4] swapcgroup(v2)
Hi. I updated my swapcgroup patch. Major changes from previous version(*1): - Rebased on 2.6.26-rc2-mm1 + KAMEZAWA-san's performance improvement patchset v4. - Implemented as a add-on to memory cgroup. So, there is no need to add a new member to page_cgroup now. - (NEW)Modified vm_swap_full() to calculate the rate of swap usage per cgroup. Patchs: - [1/4] add cgroup files - [2/4] add member to swap_info_struct for cgroup - [3/4] implement charge/uncharge - [4/4] modify vm_swap_full for cgroup ToDo: - handle force_empty. - make it possible for users to select if they use this feature or not, and avoid overhead for users not using this feature. - move charges along with task move between cgroups. *1 https://lists.linux-foundation.org/pipermail/containers/2008-March/010216.html Thanks, Daisuke Nishimura. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 1/4] swapcgroup: add cgroup files
This patch add cgroup files(and a member to struct mem_cgroup) for swapcgroup. The files to be added are: - memory.swap_usage_in_bytes - memory.swap_max_usage_in_bytes - memory.swap_limit_in_bytes - memory.swap_failcnt The meaning of those files are the same as memory cgroup. Signed-off-by: Daisuke Nishimura [EMAIL PROTECTED] --- init/Kconfig|7 + mm/memcontrol.c | 67 +++ 2 files changed, 74 insertions(+), 0 deletions(-) diff --git a/init/Kconfig b/init/Kconfig index 6135d07..aabbb83 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -407,6 +407,13 @@ config CGROUP_MEM_RES_CTLR This config option also selects MM_OWNER config option, which could in turn add some fork/exit overhead. +config CGROUP_SWAP_RES_CTLR + bool Swap Resource Controller for Control Groups + depends on CGROUP_MEM_RES_CTLR SWAP + help + Provides a swap resource controller that manages and limits swap usage. + Implemented as a add-on to Memory Resource Controller. + config SYSFS_DEPRECATED bool diff --git a/mm/memcontrol.c b/mm/memcontrol.c index a96577f..a837215 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -141,6 +141,12 @@ struct mem_cgroup { * statistics. */ struct mem_cgroup_stat stat; +#ifdef CONFIG_CGROUP_SWAP_RES_CTLR + /* +* the counter to account for swap usage +*/ + struct res_counter swap_res; +#endif }; static struct mem_cgroup init_mem_cgroup; @@ -960,6 +966,39 @@ static int mem_control_stat_show(struct cgroup *cont, struct cftype *cft, return 0; } +#ifdef CONFIG_CGROUP_SWAP_RES_CTLR +static u64 swap_cgroup_read(struct cgroup *cont, struct cftype *cft) +{ + return res_counter_read_u64(mem_cgroup_from_cont(cont)-swap_res, + cft-private); +} + +static ssize_t swap_cgroup_write(struct cgroup *cont, struct cftype *cft, + struct file *file, const char __user *userbuf, + size_t nbytes, loff_t *ppos) +{ + return res_counter_write(mem_cgroup_from_cont(cont)-swap_res, + cft-private, userbuf, nbytes, ppos, + mem_cgroup_write_strategy); +} + +static int swap_cgroup_reset(struct cgroup *cont, unsigned int event) +{ + struct mem_cgroup *mem; + + mem = mem_cgroup_from_cont(cont); + switch (event) { + case RES_MAX_USAGE: + res_counter_reset_max(mem-swap_res); + break; + case RES_FAILCNT: + res_counter_reset_failcnt(mem-swap_res); + break; + } + return 0; +} +#endif + static struct cftype mem_cgroup_files[] = { { .name = usage_in_bytes, @@ -992,6 +1031,31 @@ static struct cftype mem_cgroup_files[] = { .name = stat, .read_map = mem_control_stat_show, }, +#ifdef CONFIG_CGROUP_SWAP_RES_CTLR + { + .name = swap_usage_in_bytes, + .private = RES_USAGE, + .read_u64 = swap_cgroup_read, + }, + { + .name = swap_max_usage_in_bytes, + .private = RES_MAX_USAGE, + .trigger = swap_cgroup_reset, + .read_u64 = swap_cgroup_read, + }, + { + .name = swap_limit_in_bytes, + .private = RES_LIMIT, + .write = swap_cgroup_write, + .read_u64 = swap_cgroup_read, + }, + { + .name = swap_failcnt, + .private = RES_FAILCNT, + .trigger = swap_cgroup_reset, + .read_u64 = swap_cgroup_read, + }, +#endif }; static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node) @@ -1069,6 +1133,9 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont) } res_counter_init(mem-res); +#ifdef CONFIG_CGROUP_SWAP_RES_CTLR + res_counter_init(mem-swap_res); +#endif for_each_node_state(node, N_POSSIBLE) if (alloc_mem_cgroup_per_zone_info(mem, node)) ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 2/4] swapcgroup: add member to swap_info_struct for cgroup
This patch add a member to swap_info_struct for cgroup. This member, array of pointers to mem_cgroup, is used to remember to which cgroup each swap entries are charged. The memory for this array of pointers is allocated on swapon, and freed on swapoff. Signed-off-by: Daisuke Nishimura [EMAIL PROTECTED] --- include/linux/swap.h |3 +++ mm/swapfile.c| 32 2 files changed, 35 insertions(+), 0 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index de40f16..67de27b 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -141,6 +141,9 @@ struct swap_info_struct { struct swap_extent *curr_swap_extent; unsigned old_block_size; unsigned short * swap_map; +#ifdef CONFIG_CGROUP_SWAP_RES_CTLR + struct mem_cgroup **memcg; +#endif unsigned int lowest_bit; unsigned int highest_bit; unsigned int cluster_next; diff --git a/mm/swapfile.c b/mm/swapfile.c index d3caf3a..232bf20 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1207,6 +1207,9 @@ asmlinkage long sys_swapoff(const char __user * specialfile) { struct swap_info_struct * p = NULL; unsigned short *swap_map; +#ifdef CONFIG_CGROUP_SWAP_RES_CTLR + struct mem_cgroup **memcg; +#endif struct file *swap_file, *victim; struct address_space *mapping; struct inode *inode; @@ -1309,10 +1312,17 @@ asmlinkage long sys_swapoff(const char __user * specialfile) p-max = 0; swap_map = p-swap_map; p-swap_map = NULL; +#ifdef CONFIG_CGROUP_SWAP_RES_CTLR + memcg = p-memcg; + p-memcg = NULL; +#endif p-flags = 0; spin_unlock(swap_lock); mutex_unlock(swapon_mutex); vfree(swap_map); +#ifdef CONFIG_CGROUP_SWAP_RES_CTLR + vfree(memcg); +#endif inode = mapping-host; if (S_ISBLK(inode-i_mode)) { struct block_device *bdev = I_BDEV(inode); @@ -1456,6 +1466,9 @@ asmlinkage long sys_swapon(const char __user * specialfile, int swap_flags) unsigned long maxpages = 1; int swapfilesize; unsigned short *swap_map; +#ifdef CONFIG_CGROUP_SWAP_RES_CTLR + struct mem_cgroup **memcg; +#endif struct page *page = NULL; struct inode *inode = NULL; int did_down = 0; @@ -1479,6 +1492,9 @@ asmlinkage long sys_swapon(const char __user * specialfile, int swap_flags) p-swap_file = NULL; p-old_block_size = 0; p-swap_map = NULL; +#ifdef CONFIG_CGROUP_SWAP_RES_CTLR + p-memcg = NULL; +#endif p-lowest_bit = 0; p-highest_bit = 0; p-cluster_nr = 0; @@ -1651,6 +1667,15 @@ asmlinkage long sys_swapon(const char __user * specialfile, int swap_flags) 1 /* header page */; if (error) goto bad_swap; + +#ifdef CONFIG_CGROUP_SWAP_RES_CTLR + p-memcg = vmalloc(maxpages * sizeof(struct mem_cgroup *)); + if (!p-memcg) { + error = -ENOMEM; + goto bad_swap; + } + memset(p-memcg, 0, maxpages * sizeof(struct mem_cgroup *)); +#endif } if (nr_good_pages) { @@ -1710,11 +1735,18 @@ bad_swap_2: swap_map = p-swap_map; p-swap_file = NULL; p-swap_map = NULL; +#ifdef CONFIG_CGROUP_SWAP_RES_CTLR + memcg = p-memcg; + p-memcg = NULL; +#endif p-flags = 0; if (!(swap_flags SWAP_FLAG_PREFER)) ++least_priority; spin_unlock(swap_lock); vfree(swap_map); +#ifdef CONFIG_CGROUP_SWAP_RES_CTLR + vfree(memcg); +#endif if (swap_file) filp_close(swap_file, NULL); out: ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] [PATCH 3/4] swapcgroup: implement charge/uncharge
This patch implements charge and uncharge of swapcgroup. - what will be charged ? charge the number of swap entries in bytes. - when to charge/uncharge ? charge at get_swap_entry(), and uncharge at swap_entry_free(). - to what group charge the swap entry ? To determine to what mem_cgroup the swap entry should be charged, I changed the argument of get_swap_entry() from (void) to (struct page *). As a result, get_swap_entry() can determine to what mem_cgroup it should charge the swap entry by referring to page-page_cgroup-mem_cgroup. - from what group uncharge the swap entry ? I added to swap_info_struct a member 'struct swap_cgroup **', array of pointer to which swap_cgroup the swap entry is charged. Signed-off-by: Daisuke Nishimura [EMAIL PROTECTED] --- include/linux/memcontrol.h | 21 +++ include/linux/swap.h |4 +- mm/memcontrol.c| 47 mm/shmem.c |2 +- mm/swap_state.c|2 +- mm/swapfile.c | 14 - 6 files changed, 85 insertions(+), 5 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index fdf3967..a7e6621 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -24,6 +24,7 @@ struct mem_cgroup; struct page_cgroup; struct page; struct mm_struct; +struct swap_info_struct; #ifdef CONFIG_CGROUP_MEM_RES_CTLR @@ -172,5 +173,25 @@ static inline long mem_cgroup_calc_reclaim_inactive(struct mem_cgroup *mem, } #endif /* CONFIG_CGROUP_MEM_CONT */ +#ifdef CONFIG_CGROUP_SWAP_RES_CTLR +extern int swap_cgroup_charge(struct page *page, + struct swap_info_struct *si, + unsigned long offset); +extern void swap_cgroup_uncharge(struct swap_info_struct *si, + unsigned long offset); +#else /* CONFIG_CGROUP_SWAP_RES_CTLR */ +static inline int swap_cgroup_charge(struct page *page, + struct swap_info_struct *si, + unsigned long offset) +{ + return 0; +} + +static inline void swap_cgroup_uncharge(struct swap_info_struct *si, + unsigned long offset) +{ +} +#endif /* CONFIG_CGROUP_SWAP_RES_CTLR */ + #endif /* _LINUX_MEMCONTROL_H */ diff --git a/include/linux/swap.h b/include/linux/swap.h index 67de27b..18887f0 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -241,7 +241,7 @@ extern struct page *swapin_readahead(swp_entry_t, gfp_t, /* linux/mm/swapfile.c */ extern long total_swap_pages; extern void si_swapinfo(struct sysinfo *); -extern swp_entry_t get_swap_page(void); +extern swp_entry_t get_swap_page(struct page *); extern swp_entry_t get_swap_page_of_type(int); extern int swap_duplicate(swp_entry_t); extern int valid_swaphandles(swp_entry_t, unsigned long *); @@ -342,7 +342,7 @@ static inline int remove_exclusive_swap_page(struct page *p) return 0; } -static inline swp_entry_t get_swap_page(void) +static inline swp_entry_t get_swap_page(struct page *page) { swp_entry_t entry; entry.val = 0; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index a837215..84e803d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1220,3 +1220,50 @@ struct cgroup_subsys mem_cgroup_subsys = { .attach = mem_cgroup_move_task, .early_init = 0, }; + +#ifdef CONFIG_CGROUP_SWAP_RES_CTLR +int swap_cgroup_charge(struct page *page, + struct swap_info_struct *si, + unsigned long offset) +{ + int ret; + struct page_cgroup *pc; + struct mem_cgroup *mem; + + lock_page_cgroup(page); + pc = page_get_page_cgroup(page); + if (unlikely(!pc)) + mem = init_mem_cgroup; + else + mem = pc-mem_cgroup; + unlock_page_cgroup(page); + + css_get(mem-css); + ret = res_counter_charge(mem-swap_res, PAGE_SIZE); + if (!ret) + si-memcg[offset] = mem; + else + css_put(mem-css); + + return ret; +} + +void swap_cgroup_uncharge(struct swap_info_struct *si, + unsigned long offset) +{ + struct mem_cgroup *mem = si-memcg[offset]; + + /* mem would be NULL: +* 1. when get_swap_page() failed at charging swap_cgroup, +*and called swap_entry_free(). +* 2. when this swap entry had been assigned by +*get_swap_page_of_type() (via SWSUSP?). +*/ + if (mem) { + res_counter_uncharge(mem-swap_res, PAGE_SIZE); + si-memcg[offset] = NULL; + css_put(mem-css); + } +} +#endif + diff --git a/mm/shmem.c b/mm/shmem.c index 95b056d..69f8909 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1029,7 +1029,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc) * want to check if there's a redundant swappage
[Devel] [PATCH 4/4] swapcgroup: modify vm_swap_full for cgroup
This patch modifies vm_swap_full() to calculate the rate of swap usage per cgroup. The purpose of this change is to free freeable swap caches (that is, swap entries) per cgroup, so that swap_cgroup_charge() fails less frequently. I think I can free freeable swap caches more agressively with one of Rik's splitlru patchset: [patch 04/14] free swap space on swap-in/activation but, I should verify whether this change to vm_swap_full() is valid. Signed-off-by: Daisuke Nishimura [EMAIL PROTECTED] --- include/linux/memcontrol.h |3 +++ include/linux/swap.h |6 +- mm/memcontrol.c| 10 ++ mm/memory.c|2 +- mm/swapfile.c | 35 ++- 5 files changed, 53 insertions(+), 3 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index a7e6621..256b298 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -179,6 +179,9 @@ extern int swap_cgroup_charge(struct page *page, unsigned long offset); extern void swap_cgroup_uncharge(struct swap_info_struct *si, unsigned long offset); +extern int swap_cgroup_vm_swap_full(struct page *page); +extern u64 swap_cgroup_read_usage(struct mem_cgroup *mem); +extern u64 swap_cgroup_read_limit(struct mem_cgroup *mem); #else /* CONFIG_CGROUP_SWAP_RES_CTLR */ static inline int swap_cgroup_charge(struct page *page, struct swap_info_struct *si, diff --git a/include/linux/swap.h b/include/linux/swap.h index 18887f0..ef156c9 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -159,8 +159,12 @@ struct swap_list_t { int next; /* swapfile to be used next */ }; +#ifndef CONFIG_CGROUP_SWAP_RES_CTLR /* Swap 50% full? Release swapcache more aggressively.. */ -#define vm_swap_full() (nr_swap_pages*2 total_swap_pages) +#define vm_swap_full(page) (nr_swap_pages*2 total_swap_pages) +#else +#define vm_swap_full(page) swap_cgroup_vm_swap_full(page) +#endif /* linux/mm/page_alloc.c */ extern unsigned long totalram_pages; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 84e803d..58d72ca 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1265,5 +1265,15 @@ void swap_cgroup_uncharge(struct swap_info_struct *si, css_put(mem-css); } } + +u64 swap_cgroup_read_usage(struct mem_cgroup *mem) +{ + return res_counter_read_u64(mem-swap_res, RES_USAGE); +} + +u64 swap_cgroup_read_limit(struct mem_cgroup *mem) +{ + return res_counter_read_u64(mem-swap_res, RES_LIMIT); +} #endif diff --git a/mm/memory.c b/mm/memory.c index df8f0e9..be2ff96 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2175,7 +2175,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma, page_add_anon_rmap(page, vma, address); swap_free(entry); - if (vm_swap_full()) + if (vm_swap_full(page)) remove_exclusive_swap_page(page); unlock_page(page); diff --git a/mm/swapfile.c b/mm/swapfile.c index 682b71e..9256c2d 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -429,7 +429,7 @@ void free_swap_and_cache(swp_entry_t entry) /* Only cache user (+us), or swap space full? Free it! */ /* Also recheck PageSwapCache after page is locked (above) */ if (PageSwapCache(page) !PageWriteback(page) - (one_user || vm_swap_full())) { + (one_user || vm_swap_full(page))) { delete_from_swap_cache(page); SetPageDirty(page); } @@ -1892,3 +1892,36 @@ int valid_swaphandles(swp_entry_t entry, unsigned long *offset) *offset = ++toff; return nr_pages? ++nr_pages: 0; } + +#ifdef CONFIG_CGROUP_SWAP_RES_CTLR +int swap_cgroup_vm_swap_full(struct page *page) +{ + int ret; + struct swap_info_struct *p; + struct mem_cgroup *mem; + u64 usage; + u64 limit; + swp_entry_t entry; + + VM_BUG_ON(!PageLocked(page)); + VM_BUG_ON(!PageSwapCache(page)); + + ret = 0; + entry.val = page_private(page); + p = swap_info_get(entry); + if (!p) + goto out; + + mem = p-memcg[swp_offset(entry)]; + usage = swap_cgroup_read_usage(mem) / PAGE_SIZE; + limit = swap_cgroup_read_limit(mem) / PAGE_SIZE; + limit = (limit total_swap_pages) ? limit : total_swap_pages; + + ret = usage * 2 limit; + + spin_unlock(swap_lock); + +out: + return ret; +} +#endif ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 2/4] swapcgroup: add member to swap_info_struct for cgroup
Hi. On 2008/05/22 16:23 +0900, KAMEZAWA Hiroyuki wrote: On Thu, 22 May 2008 15:18:51 +0900 Daisuke Nishimura [EMAIL PROTECTED] wrote: This patch add a member to swap_info_struct for cgroup. This member, array of pointers to mem_cgroup, is used to remember to which cgroup each swap entries are charged. The memory for this array of pointers is allocated on swapon, and freed on swapoff. Hi, in general, #ifdefs in the middle of functions are not good style. I'd like to comment some hints. I completely agree that it's not good style. Signed-off-by: Daisuke Nishimura [EMAIL PROTECTED] --- include/linux/swap.h |3 +++ mm/swapfile.c| 32 2 files changed, 35 insertions(+), 0 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index de40f16..67de27b 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -141,6 +141,9 @@ struct swap_info_struct { struct swap_extent *curr_swap_extent; unsigned old_block_size; unsigned short * swap_map; +#ifdef CONFIG_CGROUP_SWAP_RES_CTLR +struct mem_cgroup **memcg; +#endif unsigned int lowest_bit; unsigned int highest_bit; unsigned int cluster_next; diff --git a/mm/swapfile.c b/mm/swapfile.c index d3caf3a..232bf20 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1207,6 +1207,9 @@ asmlinkage long sys_swapoff(const char __user * specialfile) { struct swap_info_struct * p = NULL; unsigned short *swap_map; +#ifdef CONFIG_CGROUP_SWAP_RES_CTLR +struct mem_cgroup **memcg; +#endif Remove #ifdef. struct mem_cgroup **memcg = NULL; good idea. I'll do it. struct file *swap_file, *victim; struct address_space *mapping; struct inode *inode; @@ -1309,10 +1312,17 @@ asmlinkage long sys_swapoff(const char __user * specialfile) p-max = 0; swap_map = p-swap_map; p-swap_map = NULL; +#ifdef CONFIG_CGROUP_SWAP_RES_CTLR +memcg = p-memcg; +p-memcg = NULL; +#endif == #ifdef CONFIG_CGROUP_SWAP_RES_CTR void swap_cgroup_init_memcg(p, memcg) { do something. } #else void swap_cgroup_init_memcg(p, memcg) { } #endif == I think swap_cgroup_init_memcg should return old value of p-memcg, and I would like to name it swap_cgroup_clear_memcg, because it is called by sys_swapoff, so clear rather than init would be better. How about something like this? struct mem_cgroup **swap_cgroup_clear_memcg(p, memcg) { struct mem_cgroup **mem; mem = p-memcg; p-memcg = NULL; return mem; } and at sys_swapoff(): struct mem_cgroup **memcg; : memcg = swap_cgroup_clear_memcg(p, memcg); : if (memcg) vfree(memcg); p-flags = 0; spin_unlock(swap_lock); mutex_unlock(swapon_mutex); vfree(swap_map); +#ifdef CONFIG_CGROUP_SWAP_RES_CTLR +vfree(memcg); +#endif if (memcg) vfree(memcg); will do. inode = mapping-host; if (S_ISBLK(inode-i_mode)) { struct block_device *bdev = I_BDEV(inode); @@ -1456,6 +1466,9 @@ asmlinkage long sys_swapon(const char __user * specialfile, int swap_flags) unsigned long maxpages = 1; int swapfilesize; unsigned short *swap_map; +#ifdef CONFIG_CGROUP_SWAP_RES_CTLR +struct mem_cgroup **memcg; +#endif Remove #ifdefs will do. struct page *page = NULL; struct inode *inode = NULL; int did_down = 0; @@ -1479,6 +1492,9 @@ asmlinkage long sys_swapon(const char __user * specialfile, int swap_flags) p-swap_file = NULL; p-old_block_size = 0; p-swap_map = NULL; +#ifdef CONFIG_CGROUP_SWAP_RES_CTLR +p-memcg = NULL; +#endif void init_swap_ctlr_memcg(p); I would like to call this one swap_cgroup_init_memcg. p-lowest_bit = 0; p-highest_bit = 0; p-cluster_nr = 0; @@ -1651,6 +1667,15 @@ asmlinkage long sys_swapon(const char __user * specialfile, int swap_flags) 1 /* header page */; if (error) goto bad_swap; + +#ifdef CONFIG_CGROUP_SWAP_RES_CTLR +p-memcg = vmalloc(maxpages * sizeof(struct mem_cgroup *)); +if (!p-memcg) { +error = -ENOMEM; +goto bad_swap; +} +memset(p-memcg, 0, maxpages * sizeof(struct mem_cgroup *)); +#endif void alloc_swap_ctlr_memcg(p) OK. I'll implement swap_cgroup_alloc_memcg. But this implies swapon will fail at memory shortage. Is it good ? Hum. Would it be better to just disabling this feature? } if (nr_good_pages) { @@ -1710,11 +1735,18 @@ bad_swap_2: swap_map = p-swap_map; p-swap_file = NULL; p-swap_map = NULL; +#ifdef CONFIG_CGROUP_SWAP_RES_CTLR +memcg = p-memcg; +p-memcg = NULL; +#endif p-flags = 0; if (!(swap_flags SWAP_FLAG_PREFER)) ++least_priority; spin_unlock(swap_lock
[Devel] Re: [PATCH 4/4] swapcgroup: modify vm_swap_full for cgroup
Hi, On 2008/05/22 17:00 +0900, KOSAKI Motohiro wrote: Hi, +#ifndef CONFIG_CGROUP_SWAP_RES_CTLR /* Swap 50% full? Release swapcache more aggressively.. */ -#define vm_swap_full() (nr_swap_pages*2 total_swap_pages) +#define vm_swap_full(page) (nr_swap_pages*2 total_swap_pages) +#else +#define vm_swap_full(page) swap_cgroup_vm_swap_full(page) +#endif I'd prefer #ifdef rather than #ifndef. so... #ifdef CONFIG_CGROUP_SWAP_RES_CTLR your definition #else original definition #endif OK. I'll change it. and vm_swap_full() isn't page granularity operation. this is memory(or swap) cgroup operation. this argument is slightly odd. But what callers of vm_swap_full() know is page, not mem_cgroup. I don't want to add to callers something like: pc = get_page_cgroup(page); mem = pc-mem_cgroup; vm_swap_full(mem); Thanks, Daisuke Nishimura. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 4/4] swapcgroup: modify vm_swap_full for cgroup
Hi, On 2008/05/22 15:45 +0900, YAMAMOTO Takashi wrote: @@ -1892,3 +1892,36 @@ int valid_swaphandles(swp_entry_t entry, unsigned long *offset) *offset = ++toff; return nr_pages? ++nr_pages: 0; } + +#ifdef CONFIG_CGROUP_SWAP_RES_CTLR +int swap_cgroup_vm_swap_full(struct page *page) +{ +int ret; +struct swap_info_struct *p; +struct mem_cgroup *mem; +u64 usage; +u64 limit; +swp_entry_t entry; + +VM_BUG_ON(!PageLocked(page)); +VM_BUG_ON(!PageSwapCache(page)); + +ret = 0; +entry.val = page_private(page); +p = swap_info_get(entry); +if (!p) +goto out; + +mem = p-memcg[swp_offset(entry)]; +usage = swap_cgroup_read_usage(mem) / PAGE_SIZE; +limit = swap_cgroup_read_limit(mem) / PAGE_SIZE; +limit = (limit total_swap_pages) ? limit : total_swap_pages; + +ret = usage * 2 limit; + +spin_unlock(swap_lock); + +out: +return ret; +} +#endif shouldn't it check the global usage (nr_swap_pages) as well? YAMAMOTO Takashi I didn't check global usage because I didn't want some group to be influenced by other groups. But in above code, there would be some cases that vm_swap_full() returns false even when more than half of swap is used in global. Thanks you for pointing it out. How about something like this? : usage = swap_cgroup_read_usage(mem);//no need to align to number of page limit = swap_cgroup_read_limit(mem);//no need to align to number of page ret = (usage * 2 limit) || (nr_swap_pages * 2 total_swap_pages) : Thanks, Daisuke Nishimura. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 0/4] swapcgroup(v2)
On 2008/05/22 16:44 +0900, KAMEZAWA Hiroyuki wrote: On Thu, 22 May 2008 15:13:41 +0900 Daisuke Nishimura [EMAIL PROTECTED] wrote: Hi. I updated my swapcgroup patch. seems good in general. Thanks :-) ToDo: - handle force_empty. Without this, we can do rmdir() against cgroup with swap. right ? You are right. There are some cases that cgroup dir cannot be removed because there remains some swap usage even when no tasks remain in the dir. In such cases, the only way to remove the dir is currently to do swapoff. So, I think this is the most important todo. - make it possible for users to select if they use this feature or not, and avoid overhead for users not using this feature. - move charges along with task move between cgroups. I think memory-controller's anon pages should also do this But how do you think about shared entries ? Yes. This is a big problem. I don't have any practical idea yet, but at least I think it should be avoided for some shared entry to be charged to different groups. Thanks, Daisuke Nishimura. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 0/4] swapcgroup(v2)
(sorry, I sent the previous mail before completing it) On 2008/05/23 11:10 +0900, Daisuke Nishimura wrote: On 2008/05/22 16:44 +0900, KAMEZAWA Hiroyuki wrote: (snip) - make it possible for users to select if they use this feature or not, and avoid overhead for users not using this feature. - move charges along with task move between cgroups. I think memory-controller's anon pages should also do this I want it too. Not only it's usefull for users IMHO, but alsoI need it to charge a swap to the group which the task belongs to at the point of swapout. Thanks, Daisuke Nishimura. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel
[Devel] Re: [PATCH 0/4] swapcgroup(v2)
Hi. On 2008/05/23 6:27 +0900, Balbir Singh wrote: Daisuke Nishimura wrote: Hi. I updated my swapcgroup patch. Major changes from previous version(*1): - Rebased on 2.6.26-rc2-mm1 + KAMEZAWA-san's performance improvement patchset v4. - Implemented as a add-on to memory cgroup. So, there is no need to add a new member to page_cgroup now. - (NEW)Modified vm_swap_full() to calculate the rate of swap usage per cgroup. Patchs: - [1/4] add cgroup files - [2/4] add member to swap_info_struct for cgroup - [3/4] implement charge/uncharge - [4/4] modify vm_swap_full for cgroup ToDo: - handle force_empty. - make it possible for users to select if they use this feature or not, and avoid overhead for users not using this feature. - move charges along with task move between cgroups. Thanks for looking into this. Yamamoto-San is also looking into a swap controller. Is there a consensus on the approach? Not yet, but I think we should have some consensus each other before going further. Thanks, Daisuke Nishimura. ___ Containers mailing list [EMAIL PROTECTED] https://lists.linux-foundation.org/mailman/listinfo/containers ___ Devel mailing list Devel@openvz.org https://openvz.org/mailman/listinfo/devel