On Tue, Jun 26, 2018 at 11:29:50AM -0700, Paul E. McKenney wrote:
> On Tue, Jun 26, 2018 at 07:51:19PM +0200, Peter Zijlstra wrote:
> > On Tue, Jun 26, 2018 at 10:10:39AM -0700, Paul E. McKenney wrote:
> > > Without special fail-safe quiescent-state-propagation checks, grace-period
> > > hangs can result from the following scenario:
> > > 
> > > 1.        CPU 1 goes offline.
> > > 
> > > 2.        Because CPU 1 is the only CPU in the system blocking the current
> > >   grace period, as soon as rcu_cleanup_dying_idle_cpu()'s call to
> > >   rcu_report_qs_rnp() returns.
> > > 
> > > 3.        At this point, the leaf rcu_node structure's ->lock is no longer
> > >   held: rcu_report_qs_rnp() has released it, as it must in order
> > >   to awaken the RCU grace-period kthread.
> > > 
> > > 4.        At this point, that same leaf rcu_node structure's 
> > > ->qsmaskinitnext
> > >   field still records CPU 1 as being online.  This is absolutely
> > >   necessary because the scheduler uses RCU, and ->qsmaskinitnext
> > 
> > Can you expand a bit on this, where does the scheduler care about the
> > online state of the CPU that's about to call into arch_cpu_idle_dead()?
> 
> Because the CPU does a context switch between the time that the CPU gets
> marked offline from the viewpoint of cpu_offline() and the time that
> the CPU finally makes it to arch_cpu_idle_dead().  Plus reporting the
> quiescent state (rcu_report_qs_rnp()) can result in waking up RCU's
> grace-period kthread.  During that context switch and that wakeup,
> the scheduler needs RCU to continue paying attention to the outgoing
> CPU, right?

And is the following a reasonable expansion?

                                                        Thanx, Paul

------------------------------------------------------------------------

commit 2e5b2ff4047b138d6b56e4e3ba91bc47503cdebe
Author: Paul E. McKenney <[email protected]>
Date:   Fri May 25 19:23:09 2018 -0700

    rcu: Fix grace-period hangs due to race with CPU offline
    
    Without special fail-safe quiescent-state-propagation checks, grace-period
    hangs can result from the following scenario:
    
    1.      CPU 1 goes offline.
    
    2.      Because CPU 1 is the only CPU in the system blocking the current
            grace period, the grace period ends as soon as
            rcu_cleanup_dying_idle_cpu()'s call to rcu_report_qs_rnp()
            returns.
    
    3.      At this point, the leaf rcu_node structure's ->lock is no longer
            held: rcu_report_qs_rnp() has released it, as it must in order
            to awaken the RCU grace-period kthread.
    
    4.      At this point, that same leaf rcu_node structure's ->qsmaskinitnext
            field still records CPU 1 as being online.  This is absolutely
            necessary because the scheduler uses RCU (in this case on the
            wake-up path while awakening RCU's grace-period kthread), and
            ->qsmaskinitnext contains RCU's idea as to which CPUs are online.
            Therefore, invoking rcu_report_qs_rnp() after clearing CPU 1's
            bit from ->qsmaskinitnext would result in a lockdep-RCU splat
            due to RCU being used from an offline CPU.
    
    5.      RCU's grace-period kthread awakens, sees that the old grace period
            has completed and that a new one is needed.  It therefore starts
            a new grace period, but because CPU 1's leaf rcu_node structure's
            ->qsmaskinitnext field still shows CPU 1 as being online, this new
            grace period is initialized to wait for a quiescent state from the
            now-offline CPU 1.
    
    6.      Without the fail-safe force-quiescent-state checks, there would
            be no quiescent state from the now-offline CPU 1, which would
            eventually result in RCU CPU stall warnings and memory exhaustion.
    
    It would be good to get rid of the special fail-safe quiescent-state
    propagation checks, and thus it would be good to fix things so that
    he above scenario cannot happen.  This commit therefore adds a new
    ->ofl_lock to the rcu_state structure.  This lock is held by rcu_gp_init()
    across the applying of buffered online and offline operations to the
    rcu_node tree, and it is also held by rcu_cleanup_dying_idle_cpu()
    when buffering a new offline operation.  This prevents rcu_gp_init()
    from acquiring the leaf rcu_node structure's lock during the interval
    between when rcu_cleanup_dying_idle_cpu() invokes rcu_report_qs_rnp(),
    which releases ->lock and the re-acquisition of that same lock.
    This in turn prevents the failure scenario outlined above, and will
    hopefully eventually allow removal of the offline-CPU checks from the
    force-quiescent-state code path.
    
    Signed-off-by: Paul E. McKenney <[email protected]>

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 2cfd5d3da4f8..bb8f45c0fa68 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -101,6 +101,7 @@ struct rcu_state sname##_state = { \
        .abbr = sabbr, \
        .exp_mutex = __MUTEX_INITIALIZER(sname##_state.exp_mutex), \
        .exp_wake_mutex = __MUTEX_INITIALIZER(sname##_state.exp_wake_mutex), \
+       .ofl_lock = __SPIN_LOCK_UNLOCKED(sname##_state.ofl_lock), \
 }
 
 RCU_STATE_INITIALIZER(rcu_sched, 's', call_rcu_sched);
@@ -1900,11 +1901,13 @@ static bool rcu_gp_init(struct rcu_state *rsp)
         */
        rcu_for_each_leaf_node(rsp, rnp) {
                rcu_gp_slow(rsp, gp_preinit_delay);
+               spin_lock(&rsp->ofl_lock);
                raw_spin_lock_irq_rcu_node(rnp);
                if (rnp->qsmaskinit == rnp->qsmaskinitnext &&
                    !rnp->wait_blkd_tasks) {
                        /* Nothing to do on this leaf rcu_node structure. */
                        raw_spin_unlock_irq_rcu_node(rnp);
+                       spin_unlock(&rsp->ofl_lock);
                        continue;
                }
 
@@ -1940,6 +1943,7 @@ static bool rcu_gp_init(struct rcu_state *rsp)
                }
 
                raw_spin_unlock_irq_rcu_node(rnp);
+               spin_unlock(&rsp->ofl_lock);
        }
 
        /*
@@ -3747,6 +3751,7 @@ static void rcu_cleanup_dying_idle_cpu(int cpu, struct 
rcu_state *rsp)
 
        /* Remove outgoing CPU from mask in the leaf rcu_node structure. */
        mask = rdp->grpmask;
+       spin_lock(&rsp->ofl_lock);
        raw_spin_lock_irqsave_rcu_node(rnp, flags); /* Enforce GP memory-order 
guarantee. */
        if (rnp->qsmask & mask) { /* RCU waiting on outgoing CPU? */
                /* Report quiescent state -before- changing ->qsmaskinitnext! */
@@ -3755,6 +3760,7 @@ static void rcu_cleanup_dying_idle_cpu(int cpu, struct 
rcu_state *rsp)
        }
        rnp->qsmaskinitnext &= ~mask;
        raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
+       spin_unlock(&rsp->ofl_lock);
 }
 
 /*
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 3def94fc9c74..6683da6e4ecc 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -363,6 +363,10 @@ struct rcu_state {
        const char *name;                       /* Name of structure. */
        char abbr;                              /* Abbreviated name. */
        struct list_head flavors;               /* List of RCU flavors. */
+
+       spinlock_t ofl_lock ____cacheline_internodealigned_in_smp;
+                                               /* Synchronize offline with */
+                                               /*  GP pre-initialization. */
 };
 
 /* Values for rcu_state structure's gp_flags field. */

Reply via email to