kernel.spe...

arekm Wed, 09 Nov 2011 21:58:02 -0800

Author: arekm                        Date: Thu Nov 10 05:57:46 2011 GMT
Module: packages                      Tag: LINUX_2_6_38
---- Log message:
- rel 6; cgroup memory limit fixes


---- Files affected:
packages/kernel:
   kernel-small_fixes.patch (1.25.2.7 -> 1.25.2.8) , kernel.spec (1.924.2.10 -> 
1.924.2.11) 

---- Diffs:

================================================================
Index: packages/kernel/kernel-small_fixes.patch
diff -u packages/kernel/kernel-small_fixes.patch:1.25.2.7 
packages/kernel/kernel-small_fixes.patch:1.25.2.8
--- packages/kernel/kernel-small_fixes.patch:1.25.2.7   Mon Sep 26 09:04:03 2011
+++ packages/kernel/kernel-small_fixes.patch    Thu Nov 10 06:57:40 2011
@@ -454,3 +454,458 @@
   * Flags for inode locking.
   * Bit ranges:        1<<1  - 1<<16-1 -- iolock/ilock modes (bitfield)
   *            1<<16 - 1<<32-1 -- lockdep annotation (integers)
+commit 79dfdaccd1d5b40ff7cf4a35a0e63696ebb78b4d
+Author: Michal Hocko <[email protected]>
+Date:   Tue Jul 26 16:08:23 2011 -0700
+
+    memcg: make oom_lock 0 and 1 based rather than counter
+    
+    Commit 867578cb ("memcg: fix oom kill behavior") introduced a oom_lock
+    counter which is incremented by mem_cgroup_oom_lock when we are about to
+    handle memcg OOM situation.  mem_cgroup_handle_oom falls back to a sleep
+    if oom_lock > 1 to prevent from multiple oom kills at the same time.
+    The counter is then decremented by mem_cgroup_oom_unlock called from the
+    same function.
+    
+    This works correctly but it can lead to serious starvations when we have
+    many processes triggering OOM and many CPUs available for them (I have
+    tested with 16 CPUs).
+    
+    Consider a process (call it A) which gets the oom_lock (the first one
+    that got to mem_cgroup_handle_oom and grabbed memcg_oom_mutex) and other
+    processes that are blocked on the mutex.  While A releases the mutex and
+    calls mem_cgroup_out_of_memory others will wake up (one after another)
+    and increase the counter and fall into sleep (memcg_oom_waitq).
+    
+    Once A finishes mem_cgroup_out_of_memory it takes the mutex again and
+    decreases oom_lock and wakes other tasks (if releasing memory by
+    somebody else - e.g.  killed process - hasn't done it yet).
+    
+    A testcase would look like:
+      Assume malloc XXX is a program allocating XXX Megabytes of memory
+      which touches all allocated pages in a tight loop
+      # swapoff SWAP_DEVICE
+      # cgcreate -g memory:A
+      # cgset -r memory.oom_control=0   A
+      # cgset -r memory.limit_in_bytes= 200M
+      # for i in `seq 100`
+      # do
+      #     cgexec -g memory:A   malloc 10 &
+      # done
+    
+    The main problem here is that all processes still race for the mutex and
+    there is no guarantee that we will get counter back to 0 for those that
+    got back to mem_cgroup_handle_oom.  In the end the whole convoy
+    in/decreases the counter but we do not get to 1 that would enable
+    killing so nothing useful can be done.  The time is basically unbounded
+    because it highly depends on scheduling and ordering on mutex (I have
+    seen this taking hours...).
+    
+    This patch replaces the counter by a simple {un}lock semantic.  As
+    mem_cgroup_oom_{un}lock works on the a subtree of a hierarchy we have to
+    make sure that nobody else races with us which is guaranteed by the
+    memcg_oom_mutex.
+    
+    We have to be careful while locking subtrees because we can encounter a
+    subtree which is already locked: hierarchy:
+    
+              A
+            /   \
+           B     \
+          /\      \
+         C  D     E
+    
+    B - C - D tree might be already locked.  While we want to enable locking
+    E subtree because OOM situations cannot influence each other we
+    definitely do not want to allow locking A.
+    
+    Therefore we have to refuse lock if any subtree is already locked and
+    clear up the lock for all nodes that have been set up to the failure
+    point.
+    
+    On the other hand we have to make sure that the rest of the world will
+    recognize that a group is under OOM even though it doesn't have a lock.
+    Therefore we have to introduce under_oom variable which is incremented
+    and decremented for the whole subtree when we enter resp.  leave
+    mem_cgroup_handle_oom.  under_oom, unlike oom_lock, doesn't need be
+    updated under memcg_oom_mutex because its users only check a single
+    group and they use atomic operations for that.
+    
+    This can be checked easily by the following test case:
+    
+      # cgcreate -g memory:A
+      # cgset -r memory.use_hierarchy=1 A
+      # cgset -r memory.oom_control=1   A
+      # cgset -r memory.limit_in_bytes= 100M
+      # cgset -r memory.memsw.limit_in_bytes= 100M
+      # cgcreate -g memory:A/B
+      # cgset -r memory.oom_control=1 A/B
+      # cgset -r memory.limit_in_bytes=20M
+      # cgset -r memory.memsw.limit_in_bytes=20M
+      # cgexec -g memory:A/B malloc 30  &    #->this will be blocked by OOM of 
group B
+      # cgexec -g memory:A   malloc 80  &    #->this will be blocked by OOM of 
group A
+    
+    While B gets oom_lock A will not get it.  Both of them go into sleep and
+    wait for an external action.  We can make the limit higher for A to
+    enforce waking it up
+    
+      # cgset -r memory.memsw.limit_in_bytes=300M A
+      # cgset -r memory.limit_in_bytes=300M A
+    
+    malloc in A has to wake up even though it doesn't have oom_lock.
+    
+    Finally, the unlock path is very easy because we always unlock only the
+    subtree we have locked previously while we always decrement under_oom.
+    
+    Signed-off-by: Michal Hocko <[email protected]>
+    Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
+    Cc: Balbir Singh <[email protected]>
+    Signed-off-by: Andrew Morton <[email protected]>
+    Signed-off-by: Linus Torvalds <[email protected]>
+
+diff --git a/mm/memcontrol.c b/mm/memcontrol.c
+index 8559966..95d6c25 100644
+--- a/mm/memcontrol.c
++++ b/mm/memcontrol.c
+@@ -246,7 +246,10 @@ struct mem_cgroup {
+        * Should the accounting and control be hierarchical, per subtree?
+        */
+       bool use_hierarchy;
+-      atomic_t        oom_lock;
++
++      bool            oom_lock;
++      atomic_t        under_oom;
++
+       atomic_t        refcnt;
+ 
+       int     swappiness;
+@@ -1722,37 +1725,83 @@ static int mem_cgroup_hierarchical_reclaim(struct 
mem_cgroup *root_mem,
+ /*
+  * Check OOM-Killer is already running under our hierarchy.
+  * If someone is running, return false.
++ * Has to be called with memcg_oom_mutex
+  */
+ static bool mem_cgroup_oom_lock(struct mem_cgroup *mem)
+ {
+-      int x, lock_count = 0;
+-      struct mem_cgroup *iter;
++      int lock_count = -1;
++      struct mem_cgroup *iter, *failed = NULL;
++      bool cond = true;
+ 
+-      for_each_mem_cgroup_tree(iter, mem) {
+-              x = atomic_inc_return(&iter->oom_lock);
+-              lock_count = max(x, lock_count);
++      for_each_mem_cgroup_tree_cond(iter, mem, cond) {
++              bool locked = iter->oom_lock;
++
++              iter->oom_lock = true;
++              if (lock_count == -1)
++                      lock_count = iter->oom_lock;
++              else if (lock_count != locked) {
++                      /*
++                       * this subtree of our hierarchy is already locked
++                       * so we cannot give a lock.
++                       */
++                      lock_count = 0;
++                      failed = iter;
++                      cond = false;
++              }
+       }
+ 
+-      if (lock_count == 1)
+-              return true;
+-      return false;
++      if (!failed)
++              goto done;
++
++      /*
++       * OK, we failed to lock the whole subtree so we have to clean up
++       * what we set up to the failing subtree
++       */
++      cond = true;
++      for_each_mem_cgroup_tree_cond(iter, mem, cond) {
++              if (iter == failed) {
++                      cond = false;
++                      continue;
++              }
++              iter->oom_lock = false;
++      }
++done:
++      return lock_count;
+ }
+ 
++/*
++ * Has to be called with memcg_oom_mutex
++ */
+ static int mem_cgroup_oom_unlock(struct mem_cgroup *mem)
+ {
+       struct mem_cgroup *iter;
+ 
++      for_each_mem_cgroup_tree(iter, mem)
++              iter->oom_lock = false;
++      return 0;
++}
++
++static void mem_cgroup_mark_under_oom(struct mem_cgroup *mem)
++{
++      struct mem_cgroup *iter;
++
++      for_each_mem_cgroup_tree(iter, mem)
++              atomic_inc(&iter->under_oom);
++}
++
++static void mem_cgroup_unmark_under_oom(struct mem_cgroup *mem)
++{
++      struct mem_cgroup *iter;
++
+       /*
+        * When a new child is created while the hierarchy is under oom,
+        * mem_cgroup_oom_lock() may not be called. We have to use
+        * atomic_add_unless() here.
+        */
+       for_each_mem_cgroup_tree(iter, mem)
+-              atomic_add_unless(&iter->oom_lock, -1, 0);
+-      return 0;
++              atomic_add_unless(&iter->under_oom, -1, 0);
+ }
+ 
+-
+ static DEFINE_MUTEX(memcg_oom_mutex);
+ static DECLARE_WAIT_QUEUE_HEAD(memcg_oom_waitq);
+ 
+@@ -1794,7 +1843,7 @@ static void memcg_wakeup_oom(struct mem_cgroup *mem)
+ 
+ static void memcg_oom_recover(struct mem_cgroup *mem)
+ {
+-      if (mem && atomic_read(&mem->oom_lock))
++      if (mem && atomic_read(&mem->under_oom))
+               memcg_wakeup_oom(mem);
+ }
+ 
+@@ -1812,6 +1861,8 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, gfp_t 
mask)
+       owait.wait.private = current;
+       INIT_LIST_HEAD(&owait.wait.task_list);
+       need_to_kill = true;
++      mem_cgroup_mark_under_oom(mem);
++
+       /* At first, try to OOM lock hierarchy under mem.*/
+       mutex_lock(&memcg_oom_mutex);
+       locked = mem_cgroup_oom_lock(mem);
+@@ -1835,10 +1886,13 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *mem, 
gfp_t mask)
+               finish_wait(&memcg_oom_waitq, &owait.wait);
+       }
+       mutex_lock(&memcg_oom_mutex);
+-      mem_cgroup_oom_unlock(mem);
++      if (locked)
++              mem_cgroup_oom_unlock(mem);
+       memcg_wakeup_oom(mem);
+       mutex_unlock(&memcg_oom_mutex);
+ 
++      mem_cgroup_unmark_under_oom(mem);
++
+       if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
+               return false;
+       /* Give chance to dying process */
+@@ -4505,7 +4559,7 @@ static int mem_cgroup_oom_register_event(struct cgroup 
*cgrp,
+       list_add(&event->list, &memcg->oom_notify);
+ 
+       /* already in OOM ? */
+-      if (atomic_read(&memcg->oom_lock))
++      if (atomic_read(&memcg->under_oom))
+               eventfd_signal(eventfd, 1);
+       mutex_unlock(&memcg_oom_mutex);
+ 
+@@ -4540,7 +4594,7 @@ static int mem_cgroup_oom_control_read(struct cgroup 
*cgrp,
+ 
+       cb->fill(cb, "oom_kill_disable", mem->oom_kill_disable);
+ 
+-      if (atomic_read(&mem->oom_lock))
++      if (atomic_read(&mem->under_oom))
+               cb->fill(cb, "under_oom", 1);
+       else
+               cb->fill(cb, "under_oom", 0);
+commit 1d65f86db14806cf7b1218c7b4ecb8b4db5af27d
+Author: KAMEZAWA Hiroyuki <[email protected]>
+Date:   Mon Jul 25 17:12:27 2011 -0700
+
+    mm: preallocate page before lock_page() at filemap COW
+    
+    Currently we are keeping faulted page locked throughout whole __do_fault
+    call (except for page_mkwrite code path) after calling file system's fault
+    code.  If we do early COW, we allocate a new page which has to be charged
+    for a memcg (mem_cgroup_newpage_charge).
+    
+    This function, however, might block for unbounded amount of time if memcg
+    oom killer is disabled or fork-bomb is running because the only way out of
+    the OOM situation is either an external event or OOM-situation fix.
+    
+    In the end we are keeping the faulted page locked and blocking other
+    processes from faulting it in which is not good at all because we are
+    basically punishing potentially an unrelated process for OOM condition in
+    a different group (I have seen stuck system because of ld-2.11.1.so being
+    locked).
+    
+    We can do test easily.
+    
+     % cgcreate -g memory:A
+     % cgset -r memory.limit_in_bytes=64M A
+     % cgset -r memory.memsw.limit_in_bytes=64M A
+     % cd kernel_dir; cgexec -g memory:A make -j
+    
+    Then, the whole system will live-locked until you kill 'make -j'
+    by hands (or push reboot...) This is because some important page in a
+    a shared library are locked.
+    
+    Considering again, the new page is not necessary to be allocated
+    with lock_page() held. And usual page allocation may dive into
+    long memory reclaim loop with holding lock_page() and can cause
+    very long latency.
+    
+    There are 3 ways.
+      1. do allocation/charge before lock_page()
+         Pros. - simple and can handle page allocation in the same manner.
+                 This will reduce holding time of lock_page() in general.
+         Cons. - we do page allocation even if ->fault() returns error.
+    
+      2. do charge after unlock_page(). Even if charge fails, it's just OOM.
+         Pros. - no impact to non-memcg path.
+         Cons. - implemenation requires special cares of LRU and we need to 
modify
+                 page_add_new_anon_rmap()...
+    
+      3. do unlock->charge->lock again method.
+         Pros. - no impact to non-memcg path.
+         Cons. - This may kill LOCK_PAGE_RETRY optimization. We need to release
+                 lock and get it again...
+    
+    This patch moves "charge" and memory allocation for COW page
+    before lock_page(). Then, we can avoid scanning LRU with holding
+    a lock on a page and latency under lock_page() will be reduced.
+    
+    Then, above livelock disappears.
+    
+    [[email protected]: fix code layout]
+    Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
+    Reported-by: Lutz Vieweg <[email protected]>
+    Original-idea-by: Michal Hocko <[email protected]>
+    Cc: Michal Hocko <[email protected]>
+    Cc: Ying Han <[email protected]>
+    Cc: Johannes Weiner <[email protected]>
+    Cc: Daisuke Nishimura <[email protected]>
+    Signed-off-by: Andrew Morton <[email protected]>
+    Signed-off-by: Linus Torvalds <[email protected]>
+
+diff --git a/mm/memory.c b/mm/memory.c
+index a58bbeb..3c9f3aa 100644
+--- a/mm/memory.c
++++ b/mm/memory.c
+@@ -3093,14 +3093,34 @@ static int __do_fault(struct mm_struct *mm, struct 
vm_area_struct *vma,
+       pte_t *page_table;
+       spinlock_t *ptl;
+       struct page *page;
++      struct page *cow_page;
+       pte_t entry;
+       int anon = 0;
+-      int charged = 0;
+       struct page *dirty_page = NULL;
+       struct vm_fault vmf;
+       int ret;
+       int page_mkwrite = 0;
+ 
++      /*
++       * If we do COW later, allocate page befor taking lock_page()
++       * on the file cache page. This will reduce lock holding time.
++       */
++      if ((flags & FAULT_FLAG_WRITE) && !(vma->vm_flags & VM_SHARED)) {
++
++              if (unlikely(anon_vma_prepare(vma)))
++                      return VM_FAULT_OOM;
++
++              cow_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, address);
++              if (!cow_page)
++                      return VM_FAULT_OOM;
++
++              if (mem_cgroup_newpage_charge(cow_page, mm, GFP_KERNEL)) {
++                      page_cache_release(cow_page);
++                      return VM_FAULT_OOM;
++              }
++      } else
++              cow_page = NULL;
++
+       vmf.virtual_address = (void __user *)(address & PAGE_MASK);
+       vmf.pgoff = pgoff;
+       vmf.flags = flags;
+@@ -3109,12 +3129,13 @@ static int __do_fault(struct mm_struct *mm, struct 
vm_area_struct *vma,
+       ret = vma->vm_ops->fault(vma, &vmf);
+       if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE |
+                           VM_FAULT_RETRY)))
+-              return ret;
++              goto uncharge_out;
+ 
+       if (unlikely(PageHWPoison(vmf.page))) {
+               if (ret & VM_FAULT_LOCKED)
+                       unlock_page(vmf.page);
+-              return VM_FAULT_HWPOISON;
++              ret = VM_FAULT_HWPOISON;
++              goto uncharge_out;
+       }
+ 
+       /*
+@@ -3132,23 +3153,8 @@ static int __do_fault(struct mm_struct *mm, struct 
vm_area_struct *vma,
+       page = vmf.page;
+       if (flags & FAULT_FLAG_WRITE) {
+               if (!(vma->vm_flags & VM_SHARED)) {
++                      page = cow_page;
+                       anon = 1;
+-                      if (unlikely(anon_vma_prepare(vma))) {
+-                              ret = VM_FAULT_OOM;
+-                              goto out;
+-                      }
+-                      page = alloc_page_vma(GFP_HIGHUSER_MOVABLE,
+-                                              vma, address);
+-                      if (!page) {
+-                              ret = VM_FAULT_OOM;
+-                              goto out;
+-                      }
+-                      if (mem_cgroup_newpage_charge(page, mm, GFP_KERNEL)) {
+-                              ret = VM_FAULT_OOM;
+-                              page_cache_release(page);
+-                              goto out;
+-                      }
+-                      charged = 1;
+                       copy_user_highpage(page, vmf.page, address, vma);
+                       __SetPageUptodate(page);
+               } else {
+@@ -3217,8 +3223,8 @@ static int __do_fault(struct mm_struct *mm, struct 
vm_area_struct *vma,
+               /* no need to invalidate: a not-present page won't be cached */
+               update_mmu_cache(vma, address, page_table);
+       } else {
+-              if (charged)
+-                      mem_cgroup_uncharge_page(page);
++              if (cow_page)
++                      mem_cgroup_uncharge_page(cow_page);
+               if (anon)
+                       page_cache_release(page);
+               else
+@@ -3227,7 +3233,6 @@ static int __do_fault(struct mm_struct *mm, struct 
vm_area_struct *vma,
+ 
+       pte_unmap_unlock(page_table, ptl);
+ 
+-out:
+       if (dirty_page) {
+               struct address_space *mapping = page->mapping;
+ 
+@@ -3257,6 +3262,13 @@ out:
+ unwritable_page:
+       page_cache_release(page);
+       return ret;
++uncharge_out:
++      /* fs's fault handler get error */
++      if (cow_page) {
++              mem_cgroup_uncharge_page(cow_page);
++              page_cache_release(cow_page);
++      }
++      return ret;
+ }
+ 
+ static int do_linear_fault(struct mm_struct *mm, struct vm_area_struct *vma,

================================================================
Index: packages/kernel/kernel.spec
diff -u packages/kernel/kernel.spec:1.924.2.10 
packages/kernel/kernel.spec:1.924.2.11
--- packages/kernel/kernel.spec:1.924.2.10      Wed Oct  5 21:39:08 2011
+++ packages/kernel/kernel.spec Thu Nov 10 06:57:40 2011
@@ -95,7 +95,7 @@
 
 %define                basever         2.6.38
 %define                postver         .8
-%define                rel             5
+%define                rel             6
 
 %define                _enable_debug_packages                  0
 
@@ -1577,6 +1577,9 @@
 All persons listed below can be reached at <cvs_login>@pld-linux.org
 
 $Log$
+Revision 1.924.2.11  2011/11/10 05:57:40  arekm
+- rel 6; cgroup memory limit fixes
+
 Revision 1.924.2.10  2011/10/05 19:39:08  glen
 - patch-2.5.9-3.amd64 fails to apply grsec patch, 2.6.1 applies fine, so 
assume 2.6.0 is needed
 
================================================================

---- CVS-web:
    
http://cvs.pld-linux.org/cgi-bin/cvsweb.cgi/packages/kernel/kernel-small_fixes.patch?r1=1.25.2.7&r2=1.25.2.8&f=u
    
http://cvs.pld-linux.org/cgi-bin/cvsweb.cgi/packages/kernel/kernel.spec?r1=1.924.2.10&r2=1.924.2.11&f=u

_______________________________________________
pld-cvs-commit mailing list
[email protected]
http://lists.pld-linux.org/mailman/listinfo/pld-cvs-commit

packages (LINUX_2_6_38): kernel/kernel-small_fixes.patch, kernel/kernel.spe...

Reply via email to