[Devel] [PATCH RH7 2/9] mm: memcontrol: fix cgroup creation failure after many small jobs

Pavel Tikhomirov Tue, 04 Jul 2023 23:40:53 -0700

From: Johannes Weiner <[email protected]>

The memory controller has quite a bit of state that usually outlives the
cgroup and pins its CSS until said state disappears.  At the same time
it imposes a 16-bit limit on the CSS ID space to economically store IDs
in the wild.  Consequently, when we use cgroups to contain frequent but
small and short-lived jobs that leave behind some page cache, we quickly
run into the 64k limitations of outstanding CSSs.  Creating a new cgroup
fails with -ENOSPC while there are only a few, or even no user-visible
cgroups in existence.


Although pinning CSSs past cgroup removal is common, there are only two
instances that actually need an ID after a cgroup is deleted: cache
shadow entries and swapout records.

Cache shadow entries reference the ID weakly and can deal with the CSS
having disappeared when it's looked up later.  They pose no hurdle.

Swap-out records do need to pin the css to hierarchically attribute
swapins after the cgroup has been deleted; though the only pages that
remain swapped out after offlining are tmpfs/shmem pages.  And those
references are under the user's control, so they are manageable.

This patch introduces a private 16-bit memcg ID and switches swap and
cache shadow entries over to using that.  This ID can then be recycled
after offlining when the CSS remains pinned only by objects that don't
specifically need it.

This script demonstrates the problem by faulting one cache page in a new
cgroup and deleting it again:

  set -e
  mkdir -p pages
  for x in `seq 128000`; do
    [ $((x % 1000)) -eq 0 ] && echo $x
    mkdir /cgroup/foo
    echo $$ >/cgroup/foo/cgroup.procs
    echo trex >pages/$x
    echo $$ >/cgroup/cgroup.procs
    rmdir /cgroup/foo
  done

When run on an unpatched kernel, we eventually run out of possible IDs
even though there are no visible cgroups:

  [root@ham ~]# ./cssidstress.sh
  [...]
  65000
  mkdir: cannot create directory '/cgroup/foo': No space left on device

After this patch, the IDs get released upon cgroup destruction and the
cache and css objects get released once memory reclaim kicks in.

[[email protected]: init the IDR]
  Link: http://lkml.kernel.org/r/[email protected]
Fixes: b2052564e66d ("mm: memcontrol: continue cache reclaim from offlined 
groups")
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Johannes Weiner <[email protected]>
Reported-by: John Garcia <[email protected]>
Reviewed-by: Vladimir Davydov <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Cc: Nikolay Borisov <[email protected]>
Cc: <[email protected]>    [3.19+]
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>

Changes:
- rhel almost ported it, so some things are already there
- remove synchronize_rcu() from rhel version
- skip mem_cgroup_try_charge_swap hunk
- in original patch we didn't have if (!cont->parent) return 0; thing in
mem_cgroup_css_online, we should handle memcg->id.id before this not to
break refcounting on cgroups without parent (e.g. root_mem_cgroup)

https://jira.vzint.dev/browse/PSBM-147036

(cherry picked from commit 73f576c04b9410ed19660f74f97521bee6e1c546)
Signed-off-by: Pavel Tikhomirov <[email protected]>
---
 mm/memcontrol.c | 62 +++++++++++++++++++++++++++++++++----------------
 1 file changed, 42 insertions(+), 20 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a334e9a1a311..6356b6532163 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -176,6 +176,11 @@ enum mem_cgroup_events_target {
 #define SOFTLIMIT_EVENTS_TARGET 1024
 #define NUMAINFO_EVENTS_TARGET 1024
 
+struct mem_cgroup_id {
+       int id;
+       atomic_t ref;
+};
+
 static void mem_cgroup_id_put(struct mem_cgroup *memcg);
 
 struct mem_cgroup_stat_cpu {
@@ -302,7 +307,7 @@ struct mem_cgroup {
        struct cgroup_subsys_state css;
 
        /* Private memcg ID. Used to ID objects that outlive the cgroup */
-       unsigned short id;
+       struct mem_cgroup_id id;
 
        /*
         * the counter to account for memory usage
@@ -6665,14 +6670,24 @@ unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
 {
        if (mem_cgroup_disabled())
                return 0;
-       return memcg->id;
+
+       return memcg->id.id;
+}
+
+static void mem_cgroup_id_get(struct mem_cgroup *memcg)
+{
+       atomic_inc(&memcg->id.ref);
 }
 
 static void mem_cgroup_id_put(struct mem_cgroup *memcg)
 {
-       idr_remove(&mem_cgroup_idr, memcg->id);
-       memcg->id = 0;
-       synchronize_rcu();
+       if (atomic_dec_and_test(&memcg->id.ref)) {
+               idr_remove(&mem_cgroup_idr, memcg->id.id);
+               memcg->id.id = 0;
+
+               /* Memcg ID pins CSS */
+               css_put(&memcg->css);
+       }
 }
 
 /**
@@ -6726,7 +6741,6 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
 {
        struct mem_cgroup *memcg;
        size_t size;
-       int id;
        int i, ret;
        int node;
 
@@ -6737,14 +6751,12 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
        if (!memcg)
                return NULL;
 
-       id = idr_alloc(&mem_cgroup_idr, NULL,
+       memcg->id.id = idr_alloc(&mem_cgroup_idr, NULL,
                       1, MEM_CGROUP_ID_MAX,
                       GFP_KERNEL);
-       if (id < 0)
+       if (memcg->id.id < 0)
                goto fail;
 
-       memcg->id = id;
-
        memcg->stat = alloc_percpu(struct mem_cgroup_stat_cpu);
        if (!memcg->stat)
                goto out_free;
@@ -6759,8 +6771,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
                        goto out_pcpu_free;
        }
        spin_lock_init(&memcg->pcp_counter_lock);
-       idr_replace(&mem_cgroup_idr, memcg, memcg->id);
-       synchronize_rcu();
+       idr_replace(&mem_cgroup_idr, memcg, memcg->id.id);
        return memcg;
 
 out_pcpu_free:
@@ -6772,10 +6783,8 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
        for_each_node(node)
                free_mem_cgroup_per_zone_info(memcg, node);
 
-       if (memcg->id > 0) {
-               idr_remove(&mem_cgroup_idr, memcg->id);
-               synchronize_rcu();
-       }
+       if (memcg->id.id > 0)
+               idr_remove(&mem_cgroup_idr, memcg->id.id);
 fail:
        kfree(memcg);
        return NULL;
@@ -6918,11 +6927,18 @@ mem_cgroup_css_online(struct cgroup *cont)
 {
        struct mem_cgroup *memcg, *parent;
 
-       if (!cont->parent)
-               return 0;
-
        mutex_lock(&memcg_create_mutex);
        memcg = mem_cgroup_from_cont(cont);
+
+       /* Online state pins memcg ID, memcg ID pins CSS */
+       mem_cgroup_id_get(memcg);
+       css_get(&memcg->css);
+
+       if (!cont->parent) {
+               mutex_unlock(&memcg_create_mutex);
+               return 0;
+       }
+
        parent = mem_cgroup_from_cont(cont->parent);
 
        memcg->use_hierarchy = parent->use_hierarchy;
@@ -7038,6 +7054,8 @@ static void mem_cgroup_css_offline(struct cgroup *cont)
         * no longer iterate over it.
         */
        release_oom_context(&memcg->oom_ctx);
+
+       mem_cgroup_id_put(memcg);
 }
 
 static void mem_cgroup_css_free(struct cgroup *cont)
@@ -7760,6 +7778,7 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t 
entry)
        VM_BUG_ON_PAGE(!(pc->flags & PCG_MEMSW), page);
        memcg = pc->mem_cgroup;
 
+       mem_cgroup_id_get(memcg);
        oldid = swap_cgroup_record(entry, mem_cgroup_id(memcg));
        VM_BUG_ON_PAGE(oldid, page);
        mem_cgroup_swap_statistics(memcg, true);
@@ -7775,6 +7794,9 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t 
entry)
 
        mem_cgroup_charge_statistics(memcg, page, -1);
        memcg_check_events(memcg, page);
+
+       if (!mem_cgroup_is_root(memcg))
+               css_put(&memcg->css);
 }
 
 /**
@@ -7799,7 +7821,7 @@ void mem_cgroup_uncharge_swap(swp_entry_t entry)
                        page_counter_uncharge(&memcg->memsw, 1);
                mem_cgroup_swap_statistics(memcg, false);
                this_cpu_inc(memcg->stat->events[MEM_CGROUP_EVENTS_PSWPIN]);
-               css_put(&memcg->css);
+               mem_cgroup_id_put(memcg);
        }
        rcu_read_unlock();
 }
-- 
2.40.1

_______________________________________________
Devel mailing list
[email protected]
https://lists.openvz.org/mailman/listinfo/devel

[Devel] [PATCH RH7 2/9] mm: memcontrol: fix cgroup creation failure after many small jobs

Reply via email to