Author: arekm Date: Thu Nov 10 05:57:46 2011 GMT Module: packages Tag: LINUX_2_6_38 ---- Log message: - rel 6; cgroup memory limit fixes
---- Files affected: packages/kernel: kernel-small_fixes.patch (1.25.2.7 -> 1.25.2.8) , kernel.spec (1.924.2.10 -> 1.924.2.11) ---- Diffs: ================================================================ Index: packages/kernel/kernel-small_fixes.patch diff -u packages/kernel/kernel-small_fixes.patch:1.25.2.7 packages/kernel/kernel-small_fixes.patch:1.25.2.8 --- packages/kernel/kernel-small_fixes.patch:1.25.2.7 Mon Sep 26 09:04:03 2011 +++ packages/kernel/kernel-small_fixes.patch Thu Nov 10 06:57:40 2011 @@ -454,3 +454,458 @@ * Flags for inode locking. * Bit ranges: 1<<1 - 1<<16-1 -- iolock/ilock modes (bitfield) * 1<<16 - 1<<32-1 -- lockdep annotation (integers) +commit 79dfdaccd1d5b40ff7cf4a35a0e63696ebb78b4d +Author: Michal Hocko <[email protected]> +Date: Tue Jul 26 16:08:23 2011 -0700 + + memcg: make oom_lock 0 and 1 based rather than counter + + Commit 867578cb ("memcg: fix oom kill behavior") introduced a oom_lock + counter which is incremented by mem_cgroup_oom_lock when we are about to + handle memcg OOM situation. mem_cgroup_handle_oom falls back to a sleep + if oom_lock > 1 to prevent from multiple oom kills at the same time. + The counter is then decremented by mem_cgroup_oom_unlock called from the + same function. + + This works correctly but it can lead to serious starvations when we have + many processes triggering OOM and many CPUs available for them (I have + tested with 16 CPUs). + + Consider a process (call it A) which gets the oom_lock (the first one + that got to mem_cgroup_handle_oom and grabbed memcg_oom_mutex) and other + processes that are blocked on the mutex. While A releases the mutex and + calls mem_cgroup_out_of_memory others will wake up (one after another) + and increase the counter and fall into sleep (memcg_oom_waitq). + + Once A finishes mem_cgroup_out_of_memory it takes the mutex again and + decreases oom_lock and wakes other tasks (if releasing memory by + somebody else - e.g. killed process - hasn't done it yet). + + A testcase would look like: + Assume malloc XXX is a program allocating XXX Megabytes of memory + which touches all allocated pages in a tight loop + # swapoff SWAP_DEVICE + # cgcreate -g memory:A + # cgset -r memory.oom_control=0 A + # cgset -r memory.limit_in_bytes= 200M + # for i in `seq 100` + # do + # cgexec -g memory:A malloc 10 & + # done + + The main problem here is that all processes still race for the mutex and + there is no guarantee that we will get counter back to 0 for those that + got back to mem_cgroup_handle_oom. In the end the whole convoy + in/decreases the counter but we do not get to 1 that would enable + killing so nothing useful can be done. The time is basically unbounded + because it highly depends on scheduling and ordering on mutex (I have + seen this taking hours...). + + This patch replaces the counter by a simple {un}lock semantic. As + mem_cgroup_oom_{un}lock works on the a subtree of a hierarchy we have to + make sure that nobody else races with us which is guaranteed by the + memcg_oom_mutex. + + We have to be careful while locking subtrees because we can encounter a + subtree which is already locked: hierarchy: + + A + / \ + B \ + /\ \ + C D E + + B - C - D tree might be already locked. While we want to enable locking + E subtree because OOM situations cannot influence each other we + definitely do not want to allow locking A. + + Therefore we have to refuse lock if any subtree is already locked and + clear up the lock for all nodes that have been set up to the failure + point. + + On the other hand we have to make sure that the rest of the world will + recognize that a group is under OOM even though it doesn't have a lock. + Therefore we have to introduce under_oom variable which is incremented + and decremented for the whole subtree when we enter resp. leave + mem_cgroup_handle_oom. under_oom, unlike oom_lock, doesn't need be + updated under memcg_oom_mutex because its users only check a single + group and they use atomic operations for that. + + This can be checked easily by the following test case: + + # cgcreate -g memory:A + # cgset -r memory.use_hierarchy=1 A + # cgset -r memory.oom_control=1 A + # cgset -r memory.limit_in_bytes= 100M + # cgset -r memory.memsw.limit_in_bytes= 100M + # cgcreate -g memory:A/B + # cgset -r memory.oom_control=1 A/B + # cgset -r memory.limit_in_bytes=20M + # cgset -r memory.memsw.limit_in_bytes=20M + # cgexec -g memory:A/B malloc 30 & #->this will be blocked by OOM of group B + # cgexec -g memory:A malloc 80 & #->this will be blocked by OOM of group A + + While B gets oom_lock A will not get it. Both of them go into sleep and + wait for an external action. We can make the limit higher for A to + enforce waking it up + + # cgset -r memory.memsw.limit_in_bytes=300M A + # cgset -r memory.limit_in_bytes=300M A + + malloc in A has to wake up even though it doesn't have oom_lock. + + Finally, the unlock path is very easy because we always unlock only the + subtree we have locked previously while we always decrement under_oom. + + Signed-off-by: Michal Hocko <[email protected]> + Signed-off-by: KAMEZAWA Hiroyuki <[email protected]> + Cc: Balbir Singh <[email protected]> + Signed-off-by: Andrew Morton <[email protected]> + Signed-off-by: Linus Torvalds <[email protected]> + +diff --git a/mm/memcontrol.c b/mm/memcontrol.c +index 8559966..95d6c25 100644 +--- a/mm/memcontrol.c ++++ b/mm/memcontrol.c +@@ -246,7 +246,10 @@ struct mem_cgroup { + * Should the accounting and control be hierarchical, per subtree? + */ + bool use_hierarchy; +- atomic_t oom_lock; ++ ++ bool oom_lock; ++ atomic_t under_oom; ++ + atomic_t refcnt; + + int swappiness; +@@ -1722,37 +1725,83 @@ static int mem_cgroup_hierarchical_reclaim(struct mem_cgroup *root_mem, + /* + * Check OOM-Killer is already running under our hierarchy. + * If someone is running, return false. ++ * Has to be called with memcg_oom_mutex + */ + static bool mem_cgroup_oom_lock(struct mem_cgroup *mem) + { +- int x, lock_count = 0; +- struct mem_cgroup *iter; ++ int lock_count = -1; ++ struct mem_cgroup *iter, *failed = NULL; ++ bool cond = true; + +- for_each_mem_cgroup_tree(iter, mem) { +- x = atomic_inc_return(&iter->oom_lock); +- lock_count = max(x, lock_count); ++ for_each_mem_cgroup_tree_cond(iter, mem, cond) { ++ bool locked = iter->oom_lock; ++ ++ iter->oom_lock = true; ++ if (lock_count == -1) ++ lock_count = iter->oom_lock; ++ else if (lock_count != locked) { ++ /* ++ * this subtree of our hierarchy is already locked ++ * so we cannot give a lock. ++ */ ++ lock_count = 0; ++ failed = iter; ++ cond = false; ++ } + } + +- if (lock_count == 1) +- return true; +- return false; ++ if (!failed) ++ goto done; ++ ++ /* ++ * OK, we failed to lock the whole subtree so we have to clean up ++ * what we set up to the failing subtree ++ */ ++ cond = true; ++ for_each_mem_cgroup_tree_cond(iter, mem, cond) { ++ if (iter == failed) { ++ cond = false; ++ continue; ++ } ++ iter->oom_lock = false; ++ } ++done: ++ return lock_count; + } + ++/* ++ * Has to be called with memcg_oom_mutex ++ */ + static int mem_cgroup_oom_unlock(struct mem_cgroup *mem) + { + struct mem_cgroup *iter; + ++ for_each_mem_cgroup_tree(iter, mem) ++ iter->oom_lock = false; ++ return 0; ++} ++ ++static void mem_cgroup_mark_under_oom(struct mem_cgroup *mem) ++{ ++ struct mem_cgroup *iter; ++ ++ for_each_mem_cgroup_tree(iter, mem) ++ atomic_inc(&iter->under_oom); ++} ++ ++static void mem_cgroup_unmark_under_oom(struct mem_cgroup *mem) ++{ ++ struct mem_cgroup *iter; ++ + /* + * When a new child is created while the hierarchy is under oom, + * mem_cgroup_oom_lock() may not be called. We have to use + * atomic_add_unless() here. + */ + for_each_mem_cgroup_tree(iter, mem) +- atomic_add_unless(&iter->oom_lock, -1, 0); +- return 0; ++ atomic_add_unless(&iter->under_oom, -1, 0); + } + +- + static DEFINE_MUTEX(memcg_oom_mutex); + static DECLARE_WAIT_QUEUE_HEAD(memcg_oom_waitq); + +@@ -1794,7 +1843,7 @@ static void memcg_wakeup_oom(struct mem_cgroup *mem) + + static void memcg_oom_recover(struct mem_cgroup *mem) + { +- if (mem && atomic_read(&mem->oom_lock)) ++ if (mem && atomic_read(&mem->under_oom)) + memcg_wakeup_oom(mem); + } + +@@ -1812,6 +1861,8 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask) + owait.wait.private = current; + INIT_LIST_HEAD(&owait.wait.task_list); + need_to_kill = true; ++ mem_cgroup_mark_under_oom(mem); ++ + /* At first, try to OOM lock hierarchy under mem.*/ + mutex_lock(&memcg_oom_mutex); + locked = mem_cgroup_oom_lock(mem); +@@ -1835,10 +1886,13 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t mask) + finish_wait(&memcg_oom_waitq, &owait.wait); + } + mutex_lock(&memcg_oom_mutex); +- mem_cgroup_oom_unlock(mem); ++ if (locked) ++ mem_cgroup_oom_unlock(mem); + memcg_wakeup_oom(mem); + mutex_unlock(&memcg_oom_mutex); + ++ mem_cgroup_unmark_under_oom(mem); ++ + if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current)) + return false; + /* Give chance to dying process */ +@@ -4505,7 +4559,7 @@ static int mem_cgroup_oom_register_event(struct cgroup *cgrp, + list_add(&event->list, &memcg->oom_notify); + + /* already in OOM ? */ +- if (atomic_read(&memcg->oom_lock)) ++ if (atomic_read(&memcg->under_oom)) + eventfd_signal(eventfd, 1); + mutex_unlock(&memcg_oom_mutex); + +@@ -4540,7 +4594,7 @@ static int mem_cgroup_oom_control_read(struct cgroup *cgrp, + + cb->fill(cb, "oom_kill_disable", mem->oom_kill_disable); + +- if (atomic_read(&mem->oom_lock)) ++ if (atomic_read(&mem->under_oom)) + cb->fill(cb, "under_oom", 1); + else + cb->fill(cb, "under_oom", 0); +commit 1d65f86db14806cf7b1218c7b4ecb8b4db5af27d +Author: KAMEZAWA Hiroyuki <[email protected]> +Date: Mon Jul 25 17:12:27 2011 -0700 + + mm: preallocate page before lock_page() at filemap COW + + Currently we are keeping faulted page locked throughout whole __do_fault + call (except for page_mkwrite code path) after calling file system's fault + code. If we do early COW, we allocate a new page which has to be charged + for a memcg (mem_cgroup_newpage_charge). + + This function, however, might block for unbounded amount of time if memcg + oom killer is disabled or fork-bomb is running because the only way out of + the OOM situation is either an external event or OOM-situation fix. + + In the end we are keeping the faulted page locked and blocking other + processes from faulting it in which is not good at all because we are + basically punishing potentially an unrelated process for OOM condition in + a different group (I have seen stuck system because of ld-2.11.1.so being + locked). + + We can do test easily. + + % cgcreate -g memory:A + % cgset -r memory.limit_in_bytes=64M A + % cgset -r memory.memsw.limit_in_bytes=64M A + % cd kernel_dir; cgexec -g memory:A make -j + + Then, the whole system will live-locked until you kill 'make -j' + by hands (or push reboot...) This is because some important page in a + a shared library are locked. + + Considering again, the new page is not necessary to be allocated + with lock_page() held. And usual page allocation may dive into + long memory reclaim loop with holding lock_page() and can cause + very long latency. + + There are 3 ways. + 1. do allocation/charge before lock_page() + Pros. - simple and can handle page allocation in the same manner. + This will reduce holding time of lock_page() in general. + Cons. - we do page allocation even if ->fault() returns error. + + 2. do charge after unlock_page(). Even if charge fails, it's just OOM. + Pros. - no impact to non-memcg path. + Cons. - implemenation requires special cares of LRU and we need to modify + page_add_new_anon_rmap()... + + 3. do unlock->charge->lock again method. + Pros. - no impact to non-memcg path. + Cons. - This may kill LOCK_PAGE_RETRY optimization. We need to release + lock and get it again... + + This patch moves "charge" and memory allocation for COW page + before lock_page(). Then, we can avoid scanning LRU with holding + a lock on a page and latency under lock_page() will be reduced. + + Then, above livelock disappears. + + [[email protected]: fix code layout] + Signed-off-by: KAMEZAWA Hiroyuki <[email protected]> + Reported-by: Lutz Vieweg <[email protected]> + Original-idea-by: Michal Hocko <[email protected]> + Cc: Michal Hocko <[email protected]> + Cc: Ying Han <[email protected]> + Cc: Johannes Weiner <[email protected]> + Cc: Daisuke Nishimura <[email protected]> + Signed-off-by: Andrew Morton <[email protected]> + Signed-off-by: Linus Torvalds <[email protected]> + +diff --git a/mm/memory.c b/mm/memory.c +index a58bbeb..3c9f3aa 100644 +--- a/mm/memory.c ++++ b/mm/memory.c +@@ -3093,14 +3093,34 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma, + pte_t *page_table; + spinlock_t *ptl; + struct page *page; ++ struct page *cow_page; + pte_t entry; + int anon = 0; +- int charged = 0; + struct page *dirty_page = NULL; + struct vm_fault vmf; + int ret; + int page_mkwrite = 0; + ++ /* ++ * If we do COW later, allocate page befor taking lock_page() ++ * on the file cache page. This will reduce lock holding time. ++ */ ++ if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) { ++ ++ if (unlikely(anon_vma_prepare(vma))) ++ return VM_FAULT_OOM; ++ ++ cow_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address); ++ if (!cow_page) ++ return VM_FAULT_OOM; ++ ++ if (mem_cgroup_newpage_charge(cow_page, mm, GFP_KERNEL)) { ++ page_cache_release(cow_page); ++ return VM_FAULT_OOM; ++ } ++ } else ++ cow_page = NULL; ++ + vmf.virtual_address = (void __user *)(address & PAGE_MASK); + vmf.pgoff = pgoff; + vmf.flags = flags; +@@ -3109,12 +3129,13 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma, + ret = vma->vm_ops->fault(vma, &vmf); + if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | + VM_FAULT_RETRY))) +- return ret; ++ goto uncharge_out; + + if (unlikely(PageHWPoison(vmf.page))) { + if (ret & VM_FAULT_LOCKED) + unlock_page(vmf.page); +- return VM_FAULT_HWPOISON; ++ ret = VM_FAULT_HWPOISON; ++ goto uncharge_out; + } + + /* +@@ -3132,23 +3153,8 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma, + page = vmf.page; + if (flags & FAULT_FLAG_WRITE) { + if (!(vma->vm_flags & VM_SHARED)) { ++ page = cow_page; + anon = 1; +- if (unlikely(anon_vma_prepare(vma))) { +- ret = VM_FAULT_OOM; +- goto out; +- } +- page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, +- vma, address); +- if (!page) { +- ret = VM_FAULT_OOM; +- goto out; +- } +- if (mem_cgroup_newpage_charge(page, mm, GFP_KERNEL)) { +- ret = VM_FAULT_OOM; +- page_cache_release(page); +- goto out; +- } +- charged = 1; + copy_user_highpage(page, vmf.page, address, vma); + __SetPageUptodate(page); + } else { +@@ -3217,8 +3223,8 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma, + /* no need to invalidate: a not-present page won't be cached */ + update_mmu_cache(vma, address, page_table); + } else { +- if (charged) +- mem_cgroup_uncharge_page(page); ++ if (cow_page) ++ mem_cgroup_uncharge_page(cow_page); + if (anon) + page_cache_release(page); + else +@@ -3227,7 +3233,6 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma, + + pte_unmap_unlock(page_table, ptl); + +-out: + if (dirty_page) { + struct address_space *mapping = page->mapping; + +@@ -3257,6 +3262,13 @@ out: + unwritable_page: + page_cache_release(page); + return ret; ++uncharge_out: ++ /* fs's fault handler get error */ ++ if (cow_page) { ++ mem_cgroup_uncharge_page(cow_page); ++ page_cache_release(cow_page); ++ } ++ return ret; + } + + static int do_linear_fault(struct mm_struct *mm, struct vm_area_struct *vma, ================================================================ Index: packages/kernel/kernel.spec diff -u packages/kernel/kernel.spec:1.924.2.10 packages/kernel/kernel.spec:1.924.2.11 --- packages/kernel/kernel.spec:1.924.2.10 Wed Oct 5 21:39:08 2011 +++ packages/kernel/kernel.spec Thu Nov 10 06:57:40 2011 @@ -95,7 +95,7 @@ %define basever 2.6.38 %define postver .8 -%define rel 5 +%define rel 6 %define _enable_debug_packages 0 @@ -1577,6 +1577,9 @@ All persons listed below can be reached at <cvs_login>@pld-linux.org $Log$ +Revision 1.924.2.11 2011/11/10 05:57:40 arekm +- rel 6; cgroup memory limit fixes + Revision 1.924.2.10 2011/10/05 19:39:08 glen - patch-2.5.9-3.amd64 fails to apply grsec patch, 2.6.1 applies fine, so assume 2.6.0 is needed ================================================================ ---- CVS-web: http://cvs.pld-linux.org/cgi-bin/cvsweb.cgi/packages/kernel/kernel-small_fixes.patch?r1=1.25.2.7&r2=1.25.2.8&f=u http://cvs.pld-linux.org/cgi-bin/cvsweb.cgi/packages/kernel/kernel.spec?r1=1.924.2.10&r2=1.924.2.11&f=u _______________________________________________ pld-cvs-commit mailing list [email protected] http://lists.pld-linux.org/mailman/listinfo/pld-cvs-commit
