On Wed 27-09-17 14:09:34, Roman Gushchin wrote:
> Traditionally, the OOM killer is operating on a process level.
> Under oom conditions, it finds a process with the highest oom score
> and kills it.
> 
> This behavior doesn't suit well the system with many running
> containers:
> 
> 1) There is no fairness between containers. A small container with
> few large processes will be chosen over a large one with huge
> number of small processes.
> 
> 2) Containers often do not expect that some random process inside
> will be killed. In many cases much safer behavior is to kill
> all tasks in the container. Traditionally, this was implemented
> in userspace, but doing it in the kernel has some advantages,
> especially in a case of a system-wide OOM.
> 
> To address these issues, the cgroup-aware OOM killer is introduced.
> 
> Under OOM conditions, it looks for the biggest memory consumer:
> a leaf memory cgroup or a memory cgroup with the memory.oom_group
> option set. Then it kills either a task with the biggest memory
> footprint, either all belonging tasks, if memory.oom_group is set.
> If a cgroup has memory.oom_group set, all descendant cgroups
> implicitly inherit the memory.oom_group setting.

I think it would be better to separate oom_group into its own patch.
So this patch would just add the cgroup awareness and oom_group will
build on top of that.

Wrt. to the implicit inheritance you brought up in a separate email
thread [1]. Let me quote
: after some additional thinking I don't think anymore that implicit
: propagation of oom_group is a good idea.  Let me explain: assume we
: have memcg A with memory.max and memory.oom_group set, and nested
: memcg A/B with memory.max set. Let's imagine we have an OOM event if
: A/B. What is an expected system behavior?
: We have OOM scoped to A/B, and any action should be also scoped to A/B.
: We really shouldn't touch processes which are not belonging to A/B.
: That means we should either kill the biggest process in A/B, either all
: processes in A/B. It's natural to make A/B/memory.oom_group responsible
: for this decision. It's strange to make the depend on A/memory.oom_group, IMO.
: It really makes no sense, and makes oom_group knob really hard to describe.
: 
: Also, after some off-list discussion, we've realized that memory.oom_knob
: should be delegatable. The workload should have control over it to express
: dependency between processes.

OK, I have asked about this already but I am not sure the answer was
very explicit. So let me ask again. When exactly a subtree would
disagree with the parent on oom_group? In other words when do we want a
different cleanup based on the OOM root? I am not saying this is wrong
I am just curious about a practical example.

> Tasks with oom_score_adj set to -1000 are considered as unkillable.
> 
> The root cgroup is treated as a leaf memory cgroup, so it's score
> is compared with other leaf and oom_group memory cgroups.
> The oom_group option is not supported for the root cgroup.
> Due to memcg statistics implementation a special algorithm
> is used for estimating root cgroup oom_score: we define it
> as maximum oom_score of the belonging tasks.

[1] http://lkml.kernel.org/r/[email protected]

[...]
> +static long memcg_oom_badness(struct mem_cgroup *memcg,
> +                           const nodemask_t *nodemask,
> +                           unsigned long totalpages)
> +{
> +     long points = 0;
> +     int nid;
> +     pg_data_t *pgdat;
> +
> +     /*
> +      * We don't have necessary stats for the root memcg,
> +      * so we define it's oom_score as the maximum oom_score
> +      * of the belonging tasks.
> +      */

Why not a sum of all tasks which would more resemble what we do for
other memcgs? Sure this would require ignoring oom_score_adj so
oom_badness would have to be tweaked a bit (basically split it into
__oom_badness which calculates the value without the bias and
oom_badness on top adding the bias on top of the scaled value).

> +     if (memcg == root_mem_cgroup) {
> +             struct css_task_iter it;
> +             struct task_struct *task;
> +             long score, max_score = 0;
> +
> +             css_task_iter_start(&memcg->css, 0, &it);
> +             while ((task = css_task_iter_next(&it))) {
> +                     score = oom_badness(task, memcg, nodemask,
> +                                         totalpages);
> +                     if (score > max_score)
> +                             max_score = score;
> +             }
> +             css_task_iter_end(&it);
> +
> +             return max_score;
> +     }
> +
> +     for_each_node_state(nid, N_MEMORY) {
> +             if (nodemask && !node_isset(nid, *nodemask))
> +                     continue;
> +
> +             points += mem_cgroup_node_nr_lru_pages(memcg, nid,
> +                             LRU_ALL_ANON | BIT(LRU_UNEVICTABLE));
> +
> +             pgdat = NODE_DATA(nid);
> +             points += lruvec_page_state(mem_cgroup_lruvec(pgdat, memcg),
> +                                         NR_SLAB_UNRECLAIMABLE);
> +     }
> +
> +     points += memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) /
> +             (PAGE_SIZE / 1024);
> +     points += memcg_page_state(memcg, MEMCG_SOCK);
> +     points += memcg_page_state(memcg, MEMCG_SWAP);
> +
> +     return points;
> +}
> +
> +/*
> + * Checks if the given memcg is a valid OOM victim and returns a number,
> + * which means the folowing:
> + *   -1: there are inflight OOM victim tasks, belonging to the memcg
> + *    0: memcg is not eligible, e.g. all belonging tasks are protected
> + *       by oom_score_adj set to OOM_SCORE_ADJ_MIN
> + *   >0: memcg is eligible, and the returned value is an estimation
> + *       of the memory footprint
> + */
> +static long oom_evaluate_memcg(struct mem_cgroup *memcg,
> +                            const nodemask_t *nodemask,
> +                            unsigned long totalpages)
> +{
> +     struct css_task_iter it;
> +     struct task_struct *task;
> +     int eligible = 0;
> +
> +     /*
> +      * Memcg is OOM eligible if there are OOM killable tasks inside.
> +      *
> +      * We treat tasks with oom_score_adj set to OOM_SCORE_ADJ_MIN
> +      * as unkillable.
> +      *
> +      * If there are inflight OOM victim tasks inside the memcg,
> +      * we return -1.
> +      */
> +     css_task_iter_start(&memcg->css, 0, &it);
> +     while ((task = css_task_iter_next(&it))) {
> +             if (!eligible &&
> +                 task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN)
> +                     eligible = 1;
> +
> +             if (tsk_is_oom_victim(task) &&
> +                 !test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags)) {
> +                     eligible = -1;
> +                     break;
> +             }
> +     }
> +     css_task_iter_end(&it);
> +
> +     if (eligible <= 0)
> +             return eligible;
> +
> +     return memcg_oom_badness(memcg, nodemask, totalpages);
> +}
> +
> +static void select_victim_memcg(struct mem_cgroup *root, struct oom_control 
> *oc)
> +{
> +     struct mem_cgroup *iter, *parent;
> +
> +     /*
> +      * If OOM is memcg-wide, and the memcg or it's ancestor has
> +      * the oom_group flag, simple select the memcg as a victim.
> +      */
> +     if (oc->memcg && mem_cgroup_oom_group(oc->memcg)) {
> +             oc->chosen_memcg = oc->memcg;
> +             css_get(&oc->chosen_memcg->css);
> +             oc->chosen_points = oc->memcg->oom_score;
> +             return;
> +     }
> +
> +     oc->chosen_memcg = NULL;
> +
> +     /*
> +      * The oom_score is calculated for leaf memcgs and propagated upwards
> +      * by the tree.
> +      *
> +      * for_each_mem_cgroup_tree() walks the tree in pre-order,
> +      * so we simple reset oom_score for non-lead cgroups before
> +      * starting accumulating an actual value from underlying sub-tree.
> +      *
> +      * Root memcg is treated as a leaf memcg.
> +      */
> +     rcu_read_lock();
> +     for_each_mem_cgroup_tree(iter, root) {
> +             if (memcg_has_children(iter) && iter != root_mem_cgroup) {
> +                     iter->oom_score = 0;
> +                     continue;
> +             }
> +
> +             iter->oom_score = oom_evaluate_memcg(iter, oc->nodemask,
> +                                                  oc->totalpages);
> +
> +             /*
> +              * Ignore empty and non-eligible memory cgroups.
> +              */
> +             if (iter->oom_score == 0)
> +                     continue;
> +
> +             /*
> +              * If there are inflight OOM victims, we don't need to look
> +              * further for new victims.
> +              */
> +             if (iter->oom_score == -1) {
> +                     oc->chosen_memcg = INFLIGHT_VICTIM;
> +                     mem_cgroup_iter_break(root, iter);
> +                     break;
> +             }
> +
> +             if (iter->oom_score > oc->chosen_points) {
> +                     oc->chosen_memcg = iter;
> +                     oc->chosen_points = iter->oom_score;
> +             }
> +
> +             for (parent = parent_mem_cgroup(iter); parent && parent != root;
> +                  parent = parent_mem_cgroup(parent)) {
> +                     parent->oom_score += iter->oom_score;
> +
> +                     if (mem_cgroup_oom_group(parent) &&
> +                         parent->oom_score > oc->chosen_points) {
> +                             oc->chosen_memcg = parent;
> +                             oc->chosen_points = parent->oom_score;
> +                     }
> +             }
> +     }
> +
> +     if (oc->chosen_memcg && oc->chosen_memcg != INFLIGHT_VICTIM)
> +             css_get(&oc->chosen_memcg->css);
> +
> +     rcu_read_unlock();
> +}


As I've written in a private email, things will get much easier if you
get rid of memcg->oom_score and simply do the recursive oom_score
evaluation of eligible inter nodes. You would basically do
        for_each_mem_cgroup_tree(root, iter) {
                if (!memcg_oom_eligible(iter))
                        continue;

                oom_score = oom_evaluate_memcg(iter, mask);
                if (oom_score == -1) {
                        oc->chosen_memcg = INFLIGHT_VICTIM;
                        mem_cgroup_iter_break(root, iter);
                        break;
                }
                if (oom_score > oc->chosen_points) {
                        mark_new_oom_memcg(iter);
                }

                /* potential optimization to skip the whole subtree if
                 * iter is not leaf */
        }

where
bool memcg_oom_eligible(struct mem_cgroup *memcg)
{
        if (cgroup_has_tasks(memcg->css.cgroup))
                return true;
        if (mem_cgroup_oom_group(memcg))
                return true;
        return false;
}

unsigned long __oom_evaluate_memcg(struct mem_cgroup *memcg, mask)
{
        /* check eligible tasks - oom victims OOM_SCORE_ADJ_MIN */
        /* calculate badness */
}

unsigned long oom_evaluate_memcg(struct mem_cgroup *memcg, mask)
{
        unsigned long score = 0;

        if (memcg == root_mem_cgroup) {
                for_each_task()
                        score += __oom_badness(task, mask);
                return score
        }

        for_each_mem_cgroup_tree(memcg, iter) {
                unsigned long memcg_score = __oom_evaluate_memcg(iter, mask);
                if (memcg_score == -1) {
                        mem_cgroup_iter_break(memcg, iter);
                        return -1;
                }
        }

        return score;
}

This should be also simple to split for oom_group in a separate patch
while keeping the overall code structure.
Does this make any sense to you?

[...]
> @@ -962,6 +968,48 @@ static void oom_kill_process(struct oom_control *oc, 
> const char *message)
>       __oom_kill_process(victim);
>  }
>  
> +static int oom_kill_memcg_member(struct task_struct *task, void *unused)
> +{
> +     if (!tsk_is_oom_victim(task)) {

How can this happen?

> +             get_task_struct(task);
> +             __oom_kill_process(task);
> +     }
> +     return 0;
> +}
> +
> +static bool oom_kill_memcg_victim(struct oom_control *oc)
> +{
> +     static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
> +                                   DEFAULT_RATELIMIT_BURST);
> +
> +     if (oc->chosen_memcg == NULL || oc->chosen_memcg == INFLIGHT_VICTIM)
> +             return oc->chosen_memcg;
> +
> +     /* Always begin with the task with the biggest memory footprint */
> +     oc->chosen_points = 0;
> +     oc->chosen_task = NULL;
> +     mem_cgroup_scan_tasks(oc->chosen_memcg, oom_evaluate_task, oc);
> +
> +     if (oc->chosen_task == NULL || oc->chosen_task == INFLIGHT_VICTIM)
> +             goto out;
> +
> +     if (__ratelimit(&oom_rs))
> +             dump_header(oc, oc->chosen_task);

Hmm, does the full dump_header really apply for the new heuristic? E.g.
does it make sense to dump_tasks()? Would it make sense to print stats
of all eligible memcgs instead?

> +
> +     __oom_kill_process(oc->chosen_task);
> +
> +     /* If oom_group flag is set, kill all belonging tasks */
> +     if (mem_cgroup_oom_group(oc->chosen_memcg))
> +             mem_cgroup_scan_tasks(oc->chosen_memcg, oom_kill_memcg_member,
> +                                   NULL);
> +
> +     schedule_timeout_killable(1);

I would prefer if we had this timeout at a single place in
out_of_memory()

Other than that the semantic (sans oom_group which needs more
clarification) makes sense to me.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to