If process forks some children when it has is_child_subreaper flag enabled they will inherit has_child_subreaper flag - first group, when is_child_subreaper is disabled forked children will not inherit it - second group. So child-subreaper does not reparent all his descendants when their parents die. Having these two differently behaving groups can lead to confusion. Also it is a problem for CRIU, as when we restore process tree we need to somehow determine which descendants belong to which group and much harder - to put them exactly to these group.
To simplify these we can add a propagation of has_child_subreaper flag on PR_SET_CHILD_SUBREAPER, walking all descendants of child- subreaper to setup has_child_subreaper flag. In common cases when process like systemd first sets itself to be a child-subreaper and only after that forks its services, we will have zero-length list of descendants to walk. Testing with binary subtree of 2^15 processes prctl took < 0.007 sec and has shown close to linear dependency(~0.2 * n * usec) on lower numbers of processes. Using csr_descendant list to collect descendants and do tree walk without recursion. Optimize: a) When descendant already has has_child_subreaper flag all his subtree has it too already. b) When some descendant is child_reaper, it's subtree is in different pidns from us(original child-subreaper) and processes from other pidns will never reparent to us. So we can skip their(a,b) subtree from walk. Signed-off-by: Pavel Tikhomirov <[email protected]> --- include/linux/sched.h | 2 ++ kernel/sys.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++++++- 2 files changed, 52 insertions(+), 1 deletion(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index 4d19052..9cb44c4 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1715,6 +1715,8 @@ struct task_struct { struct signal_struct *signal; struct sighand_struct *sighand; + struct list_head csr_descendant; + sigset_t blocked, real_blocked; sigset_t saved_sigmask; /* restored if set_restore_sigmask() was used */ struct sigpending pending; diff --git a/kernel/sys.c b/kernel/sys.c index 842914e..05b6d7d 100644 --- a/kernel/sys.c +++ b/kernel/sys.c @@ -2063,6 +2063,55 @@ static int prctl_get_tid_address(struct task_struct *me, int __user **tid_addr) } #endif +static DEFINE_SPINLOCK(descendants_lock); + +static void prctl_set_child_subreaper(struct task_struct *reaper, bool arg2) +{ + LIST_HEAD(descendants); + + reaper->signal->is_child_subreaper = arg2; + if (!arg2) + return; + + spin_lock(&descendants_lock); + read_lock(&tasklist_lock); + + list_add(&reaper->csr_descendant, &descendants); + + while (!list_empty(&descendants)) { + struct task_struct *tsk; + struct task_struct *p; + + tsk = list_first_entry(&descendants, struct task_struct, + csr_descendant); + + list_for_each_entry(p, &tsk->children, sibling) { + /* + * If tsk has has_child_subreaper - all its decendants + * already have these flag too and new decendants will + * inherit it on fork, so nothing to be done here. + */ + if (p->signal->has_child_subreaper) + continue; + + /* + * If we've found child_reaper - skip descendants in + * it's subtree as they will never get out pidns + */ + if (is_child_reaper(task_pid(p))) + continue; + + p->signal->has_child_subreaper = 1; + list_add(&p->csr_descendant, &descendants); + } + + list_del_init(&tsk->csr_descendant); + } + + read_unlock(&tasklist_lock); + spin_unlock(&descendants_lock); +} + SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, unsigned long, arg4, unsigned long, arg5) { @@ -2213,7 +2262,7 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, error = prctl_get_tid_address(me, (int __user **)arg2); break; case PR_SET_CHILD_SUBREAPER: - me->signal->is_child_subreaper = !!arg2; + prctl_set_child_subreaper(me, !!arg2); break; case PR_GET_CHILD_SUBREAPER: error = put_user(me->signal->is_child_subreaper, -- 2.9.3

