On Tue, Jun 26, 2018 at 10:32:25PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 26, 2018 at 01:26:15PM -0700, Paul E. McKenney wrote:
> > commit 2e5b2ff4047b138d6b56e4e3ba91bc47503cdebe
> > Author: Paul E. McKenney <[email protected]>
> > Date:   Fri May 25 19:23:09 2018 -0700
> > 
> >     rcu: Fix grace-period hangs due to race with CPU offline
> >     
> >     Without special fail-safe quiescent-state-propagation checks, 
> > grace-period
> >     hangs can result from the following scenario:
> >     
> >     1.      CPU 1 goes offline.
> >     
> >     2.      Because CPU 1 is the only CPU in the system blocking the current
> >             grace period, the grace period ends as soon as
> >             rcu_cleanup_dying_idle_cpu()'s call to rcu_report_qs_rnp()
> >             returns.
> 
> My current code doesn't have that call... So this is a new problem
> earlier in this series.
> 
> >     3.      At this point, the leaf rcu_node structure's ->lock is no longer
> >             held: rcu_report_qs_rnp() has released it, as it must in order
> >             to awaken the RCU grace-period kthread.
> >     
> >     4.      At this point, that same leaf rcu_node structure's 
> > ->qsmaskinitnext
> >             field still records CPU 1 as being online.  This is absolutely
> >             necessary because the scheduler uses RCU (in this case on the
> >             wake-up path while awakening RCU's grace-period kthread), and
> >             ->qsmaskinitnext contains RCU's idea as to which CPUs are 
> > online.
> >             Therefore, invoking rcu_report_qs_rnp() after clearing CPU 1's
> >             bit from ->qsmaskinitnext would result in a lockdep-RCU splat
> >             due to RCU being used from an offline CPU.
> 
> Argh.. so it's your own wakeup!

That it is.  The outgoing CPU might need to wake up RCU's grace-period
kthread, for example, if that CPU was the last thing holding up the
current grace period.

> This all still smells really bad. But let me try and figure out where
> you introduced the problem.

Well, I can't say that it is the code that I am the most proud of...

On the other hand, there are a number of constraints on this code:

o       The ->qsmask bits indicate which CPUs the current grace period
        is still waiting on.

o       The ->qsmaskinit bits indicate which CPUs the current grace
        period was waiting on at grace-period initialization time.
        In fact, the ->qsmask bits are initialized by copying from
        the corresponding ->qsmaskinit bits.

o       The ->qsmaskinitnext bits indicate which CPUs are online at
        a given time, from an RCU perspective.  RCU will consider
        an incoming CPU to be online before cpu_online_mask does,
        and RCU will also consider an outgoing CPU to be online after
        cpu_online_mask says that it is gone.

o       Only the leaf rcu_node structures' ->qsmaskinitnext bits have
        meaning.  Thus, although the ->qsmaskinit bits are initialized
        for a given grace period based on changes in the values of
        the ->qsmaskinitnext bits since the beginning of the last
        grace period, for the non-leaf ->qsmaskinit bits, this cannot
        be a simple copy operation.  (Instead, changes are pushed up
        the rcu_node tree.)

        Why not instead make the CPU-hotplug operations push the
        changes up the tree?  Because this could race with grace-period
        initialization, in which case it would result in inconsistent
        values for the ->qsmaskinit bits, and thus of the ->qsmask bits.
        These inconsistent values can result in too-short grace periods
        on the one hand and grace-period hangs on the other.

o       The purpose of these three bitmasks is to decouple RCU's
        grace-period initialization from CPU-hotplug operations.
        The reason for this decoupling is that various pieces of the
        CPU-hotplug code wait on RCU grace periods, which means that
        RCU grace-period initialization cannot block CPU hotplug,
        at least not without risk of deadlock.

o       The initial state of all the ->qsmask bits must be consistent (at
        least in the case where no CPU tries to report a quiescent state
        until all these bits are fully initialized).  That is to say,
        if the initial value of a given ->qsmask is all zero (and there
        are no preempted tasks referenced by the ->gp_tasks pointer,
        see below), then the initial value of the corresponding bit of
        the parent rcu_node structure must also be zero.  Similarly, if
        the initial value of a given ->qsmask has at least one non-zero
        bit (or there is at least one task referenced by the ->gp_tasks
        pointer), then the initial value of the corresponding bit of
        the parent rcu_node structure must also be non-zero.  Of course,

        Violation of this rule was the impetus for several of the
        patches under discussion.

o       Only the CPU-hotplug code is allowed to change the values of
        the ->qsmaskinitnext bits.  The grace-period initialization code
        reads these bits, but so does code checking for offline CPUs.
        The rcu_node structure's ->lock must be held when changing
        these bits.

o       Only the grace-period initialization code is allowed to access the
        ->qsmaskinit bits (give or take debug prints).  In theory, no lock
        is required when updating these bits, but in practice the rcu_node
        structure's ->lock is held due to the need to hold either the
        ->qsmask or the ->qsmaskinitnext bits still while carrying out
        the access.

o       Only the grace-period initialization code is allowed to change
        the values of the ->qsmask bits from zero to one.  Only something
        reporting a quiescent state is allowed to change the values of
        the corresponding ->qsmask bits from one to zero.  If the
        ->qsmask field of the root rcu_node structure becomes zero,
        the RCU grace-period kthread should be awakened (one execption
        being when that kthread cleared the last bit, and another where
        the root is also a leaf and there are tasks referenced by the
        ->gp_tasks pointer, again, see below).

o       Grace-period initialization first scans the leaf rcu_node
        structures for CPU-hotplug transitions (locking each such
        structure along the way and updating their ->qsmaskinit field),
        then does a breadth-first copy of ->qsmaskinit to ->qsmask for
        each rcu_node structure.  CPUs can come and go at any point
        in this process.  In fact, one CPU can go and another
        (or even the same CPU) can come back during this part of
        grace-period initialization.

o       Leaf rcu_node structures must concern themselves with tasks
        preempted within an RCU-preempt read-side critical section as well
        as CPUs passing through quiescent states.  These preempted tasks
        are placed on the corresponding structure's ->blkd_tasks list.
        It is possible that a grace period might start such that a given
        leaf must wait on a preempted task (non-NULL ->gp_tasks pointer)
        but no online CPUs.  In addition, a CPU could come online
        while this state was in effect.  The boolean ->wait_blkd_tasks
        field helps keep all this straight, allowing that leaf rcu_node
        structure to be taken out of circulation once all its CPUs are
        offline -and- its ->blkd_tasks list has become empty.

o       As soon as a ->qsmask bit is set in a leaf rcu_node structure,
        the CPU corresponding to that bit might immediately start trying
        to report a quiescent state.  This means that the ->qsmask bits
        must be initialized from the root down, in a breadth-first
        fashion.  Otherwise, a quiescent-state report could cause the
        ->qsmask bits to become inconsistent (if the report outran
        the bottom-to-top initialization).

o       In the past, the RCU grace-period kthread's force-quiescent-state
        scan would catch CPU-offline races, but the goal is to get rid
        of that fail-safe in order to reduce lock contention.  Therefore,
        if a given leaf rcu_node structure's ->qsmask field has a bit set,
        then that CPU needs to be online at least at the very beginning
        of the grace period.

The options I have considered are as follows:

1.      Stick with the no-failsafe approach, adding the lock as shown
        in this patch.  (I obviously prefer this approach.)

2.      Stick with the no-failsafe approach, but rely on RCU's grace-period
        kthread to wake up later due to its timed wait during the
        force-quiescent-state process.  This would be a bit obnoxious,
        as it requires passing a don't-wake flag (or some such) up the
        quiescent-state reporting mechanism.  It would also needlessly
        delay grace-period ends, especially on large systems (RCU scales
        up the FQS delay on larger systems to maintain limited CPU
        consumption per unit time).

3.      Stick with the no-failsafe approach, but have the quiescent-state
        reporting code hand back a value indicating that a wakeup is needed.
        Also a bit obnoxious, as this value would need to be threaded up
        the reporting code's return path.  Simple in theory, but a bit
        of an ugly change, especially for the many places in the code that
        currently expect quiescent-state reporting to be an unconditional
        fire-and-forget operation.

4.      Re-introduce the old fail-safe code, and don't report the
        quiescent state at CPU-offline time, relying on the fail-safe
        code to do so.  This also potentially introduces delays and can
        add extra FQS scans, which in turn increases lock contention.

So what did you have in mind?

                                                                Thanx, Paul

Reply via email to