On Wed, Jul 17, 2019 at 04:47:58PM -0400, Joel Fernandes wrote: > On Wed, Jul 17, 2019 at 11:09:59AM -0700, Suren Baghdasaryan wrote: > > On Wed, Jul 17, 2019 at 10:56 AM Christian Brauner <[email protected]> > > wrote: > > > > > > On Wed, Jul 17, 2019 at 01:21:00PM -0400, Joel Fernandes wrote: > > > > From: Suren Baghdasaryan <[email protected]> > > > > > > > > There is a race between reading task->exit_state in pidfd_poll and > > > > writing > > > > it after do_notify_parent calls do_notify_pidfd. Expected sequence of > > > > events is: > > > > > > > > CPU 0 CPU 1 > > > > ------------------------------------------------ > > > > exit_notify > > > > do_notify_parent > > > > do_notify_pidfd > > > > tsk->exit_state = EXIT_DEAD > > > > pidfd_poll > > > > if (tsk->exit_state) > > > > > > > > However nothing prevents the following sequence: > > > > > > > > CPU 0 CPU 1 > > > > ------------------------------------------------ > > > > exit_notify > > > > do_notify_parent > > > > do_notify_pidfd > > > > pidfd_poll > > > > if (tsk->exit_state) > > > > tsk->exit_state = EXIT_DEAD > > > > > > > > This causes a polling task to wait forever, since poll blocks because > > > > exit_state is 0 and the waiting task is not notified again. A stress > > > > test continuously doing pidfd poll and process exits uncovered this bug, > > > > and the below patch fixes it. > > > > > > > > To fix this, we set tsk->exit_state before calling do_notify_pidfd. > > > > > > > > Cc: [email protected] > > > > Signed-off-by: Suren Baghdasaryan <[email protected]> > > > > Signed-off-by: Joel Fernandes (Google) <[email protected]> > > > > > > That means in such a situation other users will see EXIT_ZOMBIE where > > > they didn't see that before until after the parent failed to get > > > notified. > > > > I'm a little nervous about that myself even though in my stress > > testing this worked fine. I think the safest change would be to move > > do_notify_pidfd() out of do_notify_parent() and call it after > > tsk->exit_state is finalized. The downside of that is that there are 4 > > My initial approach to pidfd polling did it this way, and I remember there > was a break in semantics where this does not work well. We want the > notification to happen in do_notify_parent() so that it is in sync with the > wait APIs.. > > I don't see a risk with this patch though. But let us see what Oleg's eyes > find.
I've been going through the various codepaths and that change should be fine. The places I looked at that worried me were release_task(), reparent_leader(), wait_consider_task() and their callers. But all of these either take read_lock(&tasklist_lock) or write_lock_irq(&tasklist_lock) themselves or are called with them held, same for ptrace_attach() and ptrace_detach(). And the whole sequence that switches to autoreaping when the parent ingores SIGCHLD in do_notify_parent() and wait_task_zombie() is under write_lock_irq(&tasklist_lock) as well so setting it to EXIT_ZOMBIE before do_notify_parent() and switching it to EXIT_DEAD when the parent ignores SIGCHLD should be safe. If we missed a sublety then we'll know pretty soon I'm sure. I'll pick this up now. We'll have some time anyway. > > > places we call do_notify_parent(), so instead of calling > > do_notify_pidfd() one time from do_notify_parent() we will be calling > > it 4 times now. > > > > Also my original patch had memory barriers to ensure correct ordering > > of tsk->exit_state writes before reads. In this final version Joel > > removed them, so I suppose he found out they are not needed. Joel, > > please clarify. > > The barriers were initially add by me to your patch, but then I felt it may > not be needed so I removed them before sending the patch. My initial concern > was something like the following: > > CPU 0 CPU 1 > ------------------------------------------------ > store tsk->exit_state = 1 > /* smp_wmb() ? */ > do_notify_parent > wake up > poll_wait() > /* smp_rmb(); ? */ > read tsk->exit_state = 0 > block... > > > I was initially concerned if tsk->exit_state write would be missed by the > polling thread and we would block forever (similar to this bug). > > I don't think this is possible anymore since wakeup implies release-barrier wake_up_all() which is used in do_notify_pidfd() implies a general memory barrier if something is actually woken up. > and waiting implies acquire barrier AFAIU. I am still not fully sure though, poll_wait() when used with eventpoll hits add_wait_queue which takes spin_lock_irqsave() which implies an acquire barrier if I remember memory_barriers right. > so yeah if you guys think it is an issue, let us add the memory barriers. As > such I know memory barrier additions to the kernel requires justification, > otherwise Linus calls it "Voodoo programming". So let us convince ourself > first if memory barriers are needed before adding them anyway. I didn't see it as an issue either. > > thanks, > > - Joel > > > > > > Thanks! > > > > > That's a rather subtle internal change. I was worried about > > > __ptrace_detach() since it explicitly checks for EXIT_ZOMBIE but it > > > seems to me that this is fine since we hold > > > write_lock_irq(&tasklist_lock); > > > at the point when we do set p->exit_signal. > > > > > > Acked-by: Christian Brauner <[email protected]> > > > > > > Once Oleg confirms that I'm right not to worty I'll pick this up. > > > > > > Thanks! > > > Christian > > > > > > > > > > > --- > > > > kernel/exit.c | 8 +++++--- > > > > 1 file changed, 5 insertions(+), 3 deletions(-) > > > > > > > > diff --git a/kernel/exit.c b/kernel/exit.c > > > > index a75b6a7f458a..740ceacb4b76 100644 > > > > --- a/kernel/exit.c > > > > +++ b/kernel/exit.c > > > > @@ -720,6 +720,7 @@ static void exit_notify(struct task_struct *tsk, > > > > int group_dead) > > > > if (group_dead) > > > > kill_orphaned_pgrp(tsk->group_leader, NULL); > > > > > > > > + tsk->exit_state = EXIT_ZOMBIE; > > > > if (unlikely(tsk->ptrace)) { > > > > int sig = thread_group_leader(tsk) && > > > > thread_group_empty(tsk) && > > > > @@ -1156,10 +1157,11 @@ static int wait_task_zombie(struct wait_opts > > > > *wo, struct task_struct *p) > > > > ptrace_unlink(p); > > > > > > > > /* If parent wants a zombie, don't release it now */ > > > > - state = EXIT_ZOMBIE; > > > > + p->exit_state = EXIT_ZOMBIE; > > > > if (do_notify_parent(p, p->exit_signal)) > > > > - state = EXIT_DEAD; > > > > - p->exit_state = state; > > > > + p->exit_state = EXIT_DEAD; > > > > + > > > > + state = p->exit_state; > > > > write_unlock_irq(&tasklist_lock); > > > > } > > > > if (state == EXIT_DEAD) > > > > -- > > > > 2.22.0.657.g960e92d24f-goog > > > >

