On Sat, Sep 14, 2019 at 07:33:58AM -0500, Eric W. Biederman wrote:
> 
> In the ordinary case today the rcu grace period for a task_struct is
> triggered when another process wait's for it's zombine and causes the
> kernel to call release_task().  As the waiting task has to receive a
> signal and then act upon it before this happens, typically this will
> occur after the original task as been removed from the runqueue.
> 
> Unfortunaty in some cases such as self reaping tasks it can be shown
> that release_task() will be called starting the grace period for
> task_struct long before the task leaves the runqueue.
> 
> Therefore use put_task_struct_rcu_user in finish_task_switch to
> guarantee that the there is a rcu lifetime after the task
> leaves the runqueue.
> 
> Besides the change in the start of the rcu grace period for the
> task_struct this change may cause perf_event_delayed_put and
> trace_sched_process_free.  The function perf_event_delayed_put boils
> down to just a WARN_ON for cases that I assume never show happen.  So
> I don't see any problem with delaying it.
> 
> The function trace_sched_process_free is a trace point and thus
> visible to user space.  Occassionally userspace has the strangest
> dependencies so this has a miniscule chance of causing a regression.
> This change only changes the timing of when the tracepoint is called.
> The change in timing arguably gives userspace a more accurate picture
> of what is going on.  So I don't expect there to be a regression.
> 
> In the case where a task self reaps we are pretty much guaranteed that
> the rcu grace period is delayed.  So we should get quite a bit of
> coverage in of this worst case for the change in a normal threaded
> workload.  So I expect any issues to turn up quickly or not at all.
> 
> I have lightly tested this change and everything appears to work
> fine.
> 
> Inspired-by: Linus Torvalds <[email protected]>
> Inspired-by: Oleg Nesterov <[email protected]>
> Signed-off-by: "Eric W. Biederman" <[email protected]>
> ---
>  kernel/fork.c       | 11 +++++++----
>  kernel/sched/core.c |  2 +-
>  2 files changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 9f04741d5c70..7a74ade4e7d6 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -900,10 +900,13 @@ static struct task_struct *dup_task_struct(struct 
> task_struct *orig, int node)
>       if (orig->cpus_ptr == &orig->cpus_mask)
>               tsk->cpus_ptr = &tsk->cpus_mask;
>  
> -     /* One for the user space visible state that goes away when reaped. */
> -     refcount_set(&tsk->rcu_users, 1);
> -     /* One for the rcu users, and one for the scheduler */
> -     refcount_set(&tsk->usage, 2);
> +     /*
> +      * One for the user space visible state that goes away when reaped.
> +      * One for the scheduler.
> +      */
> +     refcount_set(&tsk->rcu_users, 2);

OK, this would allow us to add a later decrement-and-test of
->rcu_users ...

> +     /* One for the rcu users */
> +     refcount_set(&tsk->usage, 1);
>  #ifdef CONFIG_BLK_DEV_IO_TRACE
>       tsk->btrace_seq = 0;
>  #endif
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 2b037f195473..69015b7c28da 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3135,7 +3135,7 @@ static struct rq *finish_task_switch(struct task_struct 
> *prev)
>               /* Task is done with its stack. */
>               put_task_stack(prev);
>  
> -             put_task_struct(prev);
> +             put_task_struct_rcu_user(prev);

... which is here.  And this looks to be invoked from the __schedule()
called from do_task_dead() at the very end of do_exit().

This looks plausible, but still requires that it no longer be possible to
enter an RCU read-side critical section that might increment ->rcu_users
after this point in time.  This might be enforced by a grace period
between the time that the task was removed from its lists and the current
time (seems unlikely, though, in that case why bother with call_rcu()?) or
by some other synchronization.

On to the next patch!

                                                        Thanx, Paul

>       }
>  
>       tick_nohz_task_switch();
> -- 
> 2.21.0.dirty
> 

Reply via email to