Re: Memory-ordering recipes

Paul E. McKenney Wed, 18 Oct 2017 13:49:38 -0700

On Wed, Oct 18, 2017 at 03:07:48AM +0200, Andrea Parri wrote:
> On Tue, Oct 17, 2017 at 02:01:37PM -0700, Paul E. McKenney wrote:
> > On Sun, Sep 17, 2017 at 04:05:09PM -0700, Paul E. McKenney wrote:
> > > Hello!
> > > 
> > > The topic of memory-ordering recipes came up at the Linux Plumbers
> > > Conference microconference on Friday, so I thought that I should summarize
> > > what is currently "out there":
> > 
> > And here is an updated list of potential Linux-kernel examples for a
> > "recipes" document, and thank you for the feedback.  Please feel free
> > to counterpropose better examples.  In addition, if there is some other
> > pattern that is commonplace and important enough to be included in a
> > recipes document, please point it out.
> > 
> >                                                     Thanx, Paul
> > 
> > ------------------------------------------------------------------------
> > 
> > This document lists the litmus-test patterns that we have been discussing,
> > along with examples from the Linux kernel.  This is intended to feed into
> > the recipes document.  All examples are from v4.13.
> > 
> > 0.  Simple special cases
> > 
> >     If there is only one CPU on the one hand or only one variable
> >     on the other, the code will execute in order.  There are (as
> >     usual) some things to be careful of:
> > 
> >     a.      There are some aspects of the C language that are
> >             unordered.  For example, in the expression "f(x) + g(y)",
> >             the order in which f and g are called is not defined;
> >             the object code is allowed to use either order or even
> >             to interleave the computations.
> > 
> >     b.      Compilers are permitted to use the "as-if" rule.  That is,
> >             a compiler can emit whatever code it likes, as long as
> >             the results of a single-threaded execution appear just
> >             as if the compiler had followed all the relevant rules.
> >             To see this, compile with a high level of optimization
> >             and run the debugger on the resulting binary.
> > 
> >     c.      If there is only one variable but multiple CPUs, all
> >             that variable must be properly aligned and all accesses
> >             to that variable must be full sized.  Variables that
> >             straddle cachelines or pages void your full-ordering
> >             warranty, as do undersized accesses that load from or
> >             store to only part of the variable.
> > 
> > 1.  Another simple case: Locking.  [ Assuming you don't think too
> >     hard about it, that is! ]
> > 
> >     Any CPU that has acquired a given lock sees any changes previously
> >     made by any CPU prior to having released that same lock.
> > 
> >     [ Should we discuss chaining back through different locks,
> >       sort of like release-acquire chains? ]
> > 
> > 2.  MP (see test6.pdf for nickname translation)
> > 
> >     a.      smp_store_release() / smp_load_acquire()
> > 
> >             init_stack_slab() in lib/stackdepot.c uses release-acquire
> >             to handle initialization of a slab of the stack.  Working
> >             out the mutual-exclusion design is left as an exercise for
> >             the reader.
> > 
> >     b.      rcu_assign_pointer() / rcu_dereference()
> > 
> >             expand_to_next_prime() does the rcu_assign_pointer(),
> >             and next_prime_number() does the rcu_dereference().
> >             This mediates access to a bit vector that is expanded
> >             as additional primes are needed.  These two functions
> >             are in lib/prime_numbers.c.
> > 
> >     c.      smp_wmb() / smp_rmb()
> > 
> >             xlog_state_switch_iclogs() contains the following:
> > 
> >                     log->l_curr_block -= log->l_logBBsize;
> >                     ASSERT(log->l_curr_block >= 0);
> >                     smp_wmb();
> >                     log->l_curr_cycle++;
> > 
> >             And xlog_valid_lsn() contains the following:
> > 
> >                     cur_cycle = ACCESS_ONCE(log->l_curr_cycle);
> >                     smp_rmb();
> >                     cur_block = ACCESS_ONCE(log->l_curr_block);
> > 
> >             Alternatively, from the comment in perf_output_put_handle()
> >             in kernel/events/ring_buffer.c:
> > 
> >              *   kernel                             user
> >              *
> >              *   if (LOAD ->data_tail) {            LOAD ->data_head
> >              *                      (A)             smp_rmb()       (C)
> >              *      STORE $data                     LOAD $data
> >              *      smp_wmb()       (B)             smp_mb()        (D)
> >              *      STORE ->data_head               STORE ->data_tail
> >              *   }
> >              *
> >              * Where A pairs with D, and B pairs with C.
> > 
> >             The B/C pairing is MP with smp_wmb() and smp_rmb().
> > 
> >     d.      Replacing either of the above with smp_mb()
> > 
> >             Holding off on this one for the moment...
> > 
> > 3.  LB
> > 
> >     a.      LB+ctrl+mb
> > 
> >             Again, from the comment in perf_output_put_handle()
> >             in kernel/events/ring_buffer.c:
> > 
> >              *   kernel                             user
> >              *
> >              *   if (LOAD ->data_tail) {            LOAD ->data_head
> >              *                      (A)             smp_rmb()       (C)
> >              *      STORE $data                     LOAD $data
> >              *      smp_wmb()       (B)             smp_mb()        (D)
> >              *      STORE ->data_head               STORE ->data_tail
> >              *   }
> >              *
> >              * Where A pairs with D, and B pairs with C.
> > 
> >             The A/D pairing covers this one.
> > 
> > 4.  Release-acquire chains, AKA ISA2, Z6.2, LB, and 3.LB
> > 
> >     Lots of variety here, can in some cases substitute:
> >     
> >     a.      READ_ONCE() for smp_load_acquire()
> >     b.      WRITE_ONCE() for smp_store_release()
> >     c.      Dependencies for both smp_load_acquire() and
> >             smp_store_release().
> >     d.      smp_wmb() for smp_store_release() in first thread
> >             of ISA2 and Z6.2.
> >     e.      smp_rmb() for smp_load_acquire() in last thread of ISA2.
> > 
> >     The canonical illustration of LB involves the various memory
> >     allocators, where you don't want a load from about-to-be-freed
> >     memory to see a store initializing a later incarnation of that
> >     same memory area.  But the per-CPU caches make this a very
> >     long and complicated example.
> > 
> >     I am not aware of any three-CPU release-acquire chains in the
> >     Linux kernel.  There are three-CPU lock-based chains in RCU,
> >     but these are not at all simple, either.
> > 
> >     Thoughts?
> > 
> > 5.  SB
> > 
> >     a.      smp_mb(), as in lockless wait-wakeup coordination.
> >             And as in sys_membarrier()-scheduler coordination,
> >             for that matter.
> > 
> >             Examples seem to be lacking.  Most cases use locking.
> >             Here is one rather strange one from RCU:
> > 
> >             void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func)
> >             {
> >                     unsigned long flags;
> >                     bool needwake;
> >                     bool havetask = READ_ONCE(rcu_tasks_kthread_ptr);
> > 
> >                     rhp->next = NULL;
> >                     rhp->func = func;
> >                     raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
> >                     needwake = !rcu_tasks_cbs_head;
> >                     *rcu_tasks_cbs_tail = rhp;
> >                     rcu_tasks_cbs_tail = &rhp->next;
> >                     raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
> >                     /* We can't create the thread unless interrupts are 
> > enabled. */
> >                     if ((needwake && havetask) ||
> >                         (!havetask && !irqs_disabled_flags(flags))) {
> >                             rcu_spawn_tasks_kthread();
> >                             wake_up(&rcu_tasks_cbs_wq);
> >                     }
> >             }
> > 
> >             And for the wait side, using synchronize_sched() to supply
> >             the barrier for both ends, with the preemption disabling
> >             due to raw_spin_lock_irqsave() serving as the read-side
> >             critical section:
> > 
> >             if (!list) {
> >                     wait_event_interruptible(rcu_tasks_cbs_wq,
> >                                              rcu_tasks_cbs_head);
> >                     if (!rcu_tasks_cbs_head) {
> >                             WARN_ON(signal_pending(current));
> >                             schedule_timeout_interruptible(HZ/10);
> >                     }
> >                     continue;
> >             }
> >             synchronize_sched();
> > 
> >             -----------------
> > 
> >             Here is another one that uses atomic_cmpxchg() as a
> >             full memory barrier:
> > 
> >             if (!wait_event_timeout(*wait, !atomic_read(stopping),
> >                                     msecs_to_jiffies(1000))) {
> >                     atomic_set(stopping, 0);
> >                     smp_mb();
> >                     return -ETIMEDOUT;
> >             }
> > 
> >             int omap3isp_module_sync_is_stopping(wait_queue_head_t *wait,
> >                                                  atomic_t *stopping)
> >             {
> >                     if (atomic_cmpxchg(stopping, 1, 0)) {
> >                             wake_up(wait);
> >                             return 1;
> >                     }
> > 
> >                     return 0;
> >             }
> > 
> >             -----------------
> > 
> >             And here is the generic pattern for the above two examples
> >             taken from waitqueue_active() in include/linux/wait.h:
> > 
> >  *      CPU0 - waker                    CPU1 - waiter
> >  *
> >  *                                      for (;;) {
> >  *      @cond = true;                     prepare_to_wait(&wq_head, &wait, 
> > state);
> >  *      smp_mb();                         // smp_mb() from 
> > set_current_state()
> >  *      if (waitqueue_active(wq_head))         if (@cond)
> >  *        wake_up(wq_head);                      break;
> >  *                                        schedule();
> >  *                                      }
> >  *                                      finish_wait(&wq_head, &wait);
> > 
> >             Note that prepare_to_wait() does the both the write
> >             and the set_current_state() that contains the smp_mb().
> >             The read is the waitqueue_active() on the one hand and
> >             the "if (@cond)" on the other.
> > 
> > 6.  W+RWC+porel+mb+mb
> > 
> >     See recipes-LKcode-63cae12bce986.txt.
> > 
> >     Mostly of historical interest -- as far as I know, this commit
> >     was the first to contain a litmus test.
> > 
> > 7.  Context switch and migration.  A bit specialized, so might leave
> >     this one out.
> > 
> >     When a thread moves from one CPU to another to another, the
> >     scheduler is required to do whatever is necessary for the thread
> >     to see any prior accesses that it executed on other CPUs.  This
> >     includes "interesting" interactions with wake_up() shown in the
> >     following comment from try_to_wake_up() in kernel/sched/core.c:
> > 
> >  * Notes on Program-Order guarantees on SMP systems.
> >  *
> >  *  MIGRATION
> >  *
> >  * The basic program-order guarantee on SMP systems is that when a task [t]
> >  * migrates, all its activity on its old CPU [c0] happens-before any 
> > subsequent
> >  * execution on its new CPU [c1].
> >  *
> >  * For migration (of runnable tasks) this is provided by the following 
> > means:
> >  *
> >  *  A) UNLOCK of the rq(c0)->lock scheduling out task t
> >  *  B) migration for t is required to synchronize *both* rq(c0)->lock and
> >  *     rq(c1)->lock (if not at the same time, then in that order).
> >  *  C) LOCK of the rq(c1)->lock scheduling in task
> >  *
> >  * Transitivity guarantees that B happens after A and C after B.
> >  * Note: we only require RCpc transitivity.
> >  * Note: the CPU doing B need not be c0 or c1
> >  *
> >  * Example:
> >  *
> >  *   CPU0            CPU1            CPU2
> >  *
> >  *   LOCK rq(0)->lock
> >  *   sched-out X
> >  *   sched-in Y
> >  *   UNLOCK rq(0)->lock
> >  *
> >  *                                   LOCK rq(0)->lock // orders against CPU0
> >  *                                   dequeue X
> >  *                                   UNLOCK rq(0)->lock
> >  *
> >  *                                   LOCK rq(1)->lock
> >  *                                   enqueue X
> >  *                                   UNLOCK rq(1)->lock
> >  *
> >  *                   LOCK rq(1)->lock // orders against CPU2
> >  *                   sched-out Z
> >  *                   sched-in X
> >  *                   UNLOCK rq(1)->lock
> 
> We might want to "specialize" this snippet to emphasize (or illustrate)
> a particular _cycle_; one possible way of "closing the chain" would be:
> 
>    CPU0
> 
>    smp_store_release(&X->on_cpu, 0); /* from sched-out X */
>    UNLOCK rq(0)->lock
> 
> 
>    CPU2
> 
>    LOCK rq(0)->lock
>    UNLOCK rq(1)->lock
> 
> 
>    CPU1
> 
>    LOCK rq(1)->lock
>    X->on_cpu = 1; /* from sched-in X */
> 
> 
>    BUG_ON("each LOCK reads from its UNLOCK" && X->on_cpu == 0);
> 
> (c.f., Z6.2 in test6.pdf)


I added this as a summary, pretty much in the form you have above.

> >  *
> >  *
> >  *  BLOCKING -- aka. SLEEP + WAKEUP
> >  *
> >  * For blocking we (obviously) need to provide the same guarantee as for
> >  * migration. However the means are completely different as there is no lock
> >  * chain to provide order. Instead we do:
> >  *
> >  *   1) smp_store_release(X->on_cpu, 0)
> >  *   2) smp_cond_load_acquire(!X->on_cpu)
> >  *
> >  * Example:
> >  *
> >  *   CPU0 (schedule)  CPU1 (try_to_wake_up) CPU2 (schedule)
> >  *
> >  *   LOCK rq(0)->lock LOCK X->pi_lock
> >  *   dequeue X
> >  *   sched-out X
> >  *   smp_store_release(X->on_cpu, 0);
> >  *
> >  *                    smp_cond_load_acquire(&X->on_cpu, !VAL);
> >  *                    X->state = WAKING
> >  *                    set_task_cpu(X,2)
> >  *
> >  *                    LOCK rq(2)->lock
> >  *                    enqueue X
> >  *                    X->state = RUNNING
> >  *                    UNLOCK rq(2)->lock
> >  *
> >  *                                          LOCK rq(2)->lock // orders 
> > against CPU1
> >  *                                          sched-out Z
> >  *                                          sched-in X
> >  *                                          UNLOCK rq(2)->lock
> >  *
> >  *                    UNLOCK X->pi_lock
> >  *   UNLOCK rq(0)->lock
> 
> Similarly:
> 
>    CPU0
> 
>    smp_store_release(&X->on_cpu, 0); /* from sched-out X */
> 
> 
>    CPU1
> 
>    smp_cond_load_acquire(&X->on_cpu, !VAL);
>    UNLOCK rq(2)->lock
> 
> 
>    CPU2
> 
>    LOCK rq(2)->lock
>    X->on_cpu = 1; // from sched-in X
> 
> 
>    BUG_ON("each LOCK reads from its UNLOCK" && X->on_cpu == 0);
> 
> (c.f., WWC in test6.pdf)

And same here.

                                                        Thanx, Paul

>   Andrea
> 
> 
> >  *
> >  *
> >  * However; for wakeups there is a second guarantee we must provide, namely 
> > we
> >  * must observe the state that lead to our wakeup. That is, not only must 
> > our
> >  * task observe its own prior state, it must also observe the stores prior 
> > to
> >  * its wakeup.
> >  *
> >  * This means that any means of doing remote wakeups must order the CPU 
> > doing
> >  * the wakeup against the CPU the task is going to end up running on. This,
> >  * however, is already required for the regular Program-Order guarantee 
> > above,
> >  * since the waking CPU is the one issueing the ACQUIRE 
> > (smp_cond_load_acquire).
> 
> > commit 63cae12bce9861cec309798d34701cf3da20bc71
> > Author: Peter Zijlstra <[email protected]>
> > Date:   Fri Dec 9 14:59:00 2016 +0100
> > 
> >     perf/core: Fix sys_perf_event_open() vs. hotplug
> >     
> >     There is problem with installing an event in a task that is 'stuck' on
> >     an offline CPU.
> >     
> >     Blocked tasks are not dis-assosciated from offlined CPUs, after all, a
> >     blocked task doesn't run and doesn't require a CPU etc.. Only on
> >     wakeup do we ammend the situation and place the task on a available
> >     CPU.
> >     
> >     If we hit such a task with perf_install_in_context() we'll loop until
> >     either that task wakes up or the CPU comes back online, if the task
> >     waking depends on the event being installed, we're stuck.
> >     
> >     While looking into this issue, I also spotted another problem, if we
> >     hit a task with perf_install_in_context() that is in the middle of
> >     being migrated, that is we observe the old CPU before sending the IPI,
> >     but run the IPI (on the old CPU) while the task is already running on
> >     the new CPU, things also go sideways.
> >     
> >     Rework things to rely on task_curr() -- outside of rq->lock -- which
> >     is rather tricky. Imagine the following scenario where we're trying to
> >     install the first event into our task 't':
> >     
> >     CPU0            CPU1            CPU2
> >     
> >                     (current == t)
> >     
> >     t->perf_event_ctxp[] = ctx;
> >     smp_mb();
> >     cpu = task_cpu(t);
> >     
> >                     switch(t, n);
> >                                     migrate(t, 2);
> >                                     switch(p, t);
> >     
> >                                     ctx = t->perf_event_ctxp[]; // must not 
> > be NULL
> >     
> >     smp_function_call(cpu, ..);
> >     
> >                     generic_exec_single()
> >                       func();
> >                         spin_lock(ctx->lock);
> >                         if (task_curr(t)) // false
> >     
> >                         add_event_to_ctx();
> >                         spin_unlock(ctx->lock);
> >     
> >                                     perf_event_context_sched_in();
> >                                       spin_lock(ctx->lock);
> >                                       // sees event
> >     
> >     So its CPU0's store of t->perf_event_ctxp[] that must not go 'missing'.
> >     Because if CPU2's load of that variable were to observe NULL, it would
> >     not try to schedule the ctx and we'd have a task running without its
> >     counter, which would be 'bad'.
> >     
> >     As long as we observe !NULL, we'll acquire ctx->lock. If we acquire it
> >     first and not see the event yet, then CPU0 must observe task_curr()
> >     and retry. If the install happens first, then we must see the event on
> >     sched-in and all is well.
> >     
> >     I think we can translate the first part (until the 'must not be NULL')
> >     of the scenario to a litmus test like:
> >     
> >       C C-peterz
> >     
> >       {
> >       }
> >     
> >       P0(int *x, int *y)
> >       {
> >               int r1;
> >     
> >               WRITE_ONCE(*x, 1);
> >               smp_mb();
> >               r1 = READ_ONCE(*y);
> >       }
> >     
> >       P1(int *y, int *z)
> >       {
> >               WRITE_ONCE(*y, 1);
> >               smp_store_release(z, 1);
> >       }
> >     
> >       P2(int *x, int *z)
> >       {
> >               int r1;
> >               int r2;
> >     
> >               r1 = smp_load_acquire(z);
> >               smp_mb();
> >               r2 = READ_ONCE(*x);
> >       }
> >     
> >       exists
> >       (0:r1=0 /\ 2:r1=1 /\ 2:r2=0)
> >     
> >     Where:
> >       x is perf_event_ctxp[],
> >       y is our tasks's CPU, and
> >       z is our task being placed on the rq of CPU2.
> >     
> >     The P0 smp_mb() is the one added by this patch, ordering the store to
> >     perf_event_ctxp[] from find_get_context() and the load of task_cpu()
> >     in task_function_call().
> >     
> >     The smp_store_release/smp_load_acquire model the RCpc locking of the
> >     rq->lock and the smp_mb() of P2 is the context switch switching from
> >     whatever CPU2 was running to our task 't'.
> >     
> >     This litmus test evaluates into:
> >     
> >       Test C-peterz Allowed
> >       States 7
> >       0:r1=0; 2:r1=0; 2:r2=0;
> >       0:r1=0; 2:r1=0; 2:r2=1;
> >       0:r1=0; 2:r1=1; 2:r2=1;
> >       0:r1=1; 2:r1=0; 2:r2=0;
> >       0:r1=1; 2:r1=0; 2:r2=1;
> >       0:r1=1; 2:r1=1; 2:r2=0;
> >       0:r1=1; 2:r1=1; 2:r2=1;
> >       No
> >       Witnesses
> >       Positive: 0 Negative: 7
> >       Condition exists (0:r1=0 /\ 2:r1=1 /\ 2:r2=0)
> >       Observation C-peterz Never 0 7
> >       Hash=e427f41d9146b2a5445101d3e2fcaa34
> >     
> >     And the strong and weak model agree.
> >     
> >     Reported-by: Mark Rutland <[email protected]>
> >     Tested-by: Mark Rutland <[email protected]>
> >     Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> >     Cc: Alexander Shishkin <[email protected]>
> >     Cc: Arnaldo Carvalho de Melo <[email protected]>
> >     Cc: Arnaldo Carvalho de Melo <[email protected]>
> >     Cc: Jiri Olsa <[email protected]>
> >     Cc: Linus Torvalds <[email protected]>
> >     Cc: Peter Zijlstra <[email protected]>
> >     Cc: Sebastian Andrzej Siewior <[email protected]>
> >     Cc: Stephane Eranian <[email protected]>
> >     Cc: Thomas Gleixner <[email protected]>
> >     Cc: Vince Weaver <[email protected]>
> >     Cc: Will Deacon <[email protected]>
> >     Cc: [email protected]
> >     Link: 
> > http://lkml.kernel.org/r/[email protected]
> >     Signed-off-by: Ingo Molnar <[email protected]>
> > 
> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > index ab15509fab8c..72ce7d63e561 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -2249,7 +2249,7 @@ static int  __perf_install_in_context(void *info)
> >     struct perf_event_context *ctx = event->ctx;
> >     struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
> >     struct perf_event_context *task_ctx = cpuctx->task_ctx;
> > -   bool activate = true;
> > +   bool reprogram = true;
> >     int ret = 0;
> >  
> >     raw_spin_lock(&cpuctx->ctx.lock);
> > @@ -2257,27 +2257,26 @@ static int  __perf_install_in_context(void *info)
> >             raw_spin_lock(&ctx->lock);
> >             task_ctx = ctx;
> >  
> > -           /* If we're on the wrong CPU, try again */
> > -           if (task_cpu(ctx->task) != smp_processor_id()) {
> > -                   ret = -ESRCH;
> > -                   goto unlock;
> > -           }
> > +           reprogram = (ctx->task == current);
> >  
> >             /*
> > -            * If we're on the right CPU, see if the task we target is
> > -            * current, if not we don't have to activate the ctx, a future
> > -            * context switch will do that for us.
> > +            * If the task is running, it must be running on this CPU,
> > +            * otherwise we cannot reprogram things.
> > +            *
> > +            * If its not running, we don't care, ctx->lock will
> > +            * serialize against it becoming runnable.
> >              */
> > -           if (ctx->task != current)
> > -                   activate = false;
> > -           else
> > -                   WARN_ON_ONCE(cpuctx->task_ctx && cpuctx->task_ctx != 
> > ctx);
> > +           if (task_curr(ctx->task) && !reprogram) {
> > +                   ret = -ESRCH;
> > +                   goto unlock;
> > +           }
> >  
> > +           WARN_ON_ONCE(reprogram && cpuctx->task_ctx && cpuctx->task_ctx 
> > != ctx);
> >     } else if (task_ctx) {
> >             raw_spin_lock(&task_ctx->lock);
> >     }
> >  
> > -   if (activate) {
> > +   if (reprogram) {
> >             ctx_sched_out(ctx, cpuctx, EVENT_TIME);
> >             add_event_to_ctx(event, ctx);
> >             ctx_resched(cpuctx, task_ctx);
> > @@ -2328,13 +2327,36 @@ perf_install_in_context(struct perf_event_context 
> > *ctx,
> >     /*
> >      * Installing events is tricky because we cannot rely on ctx->is_active
> >      * to be set in case this is the nr_events 0 -> 1 transition.
> > +    *
> > +    * Instead we use task_curr(), which tells us if the task is running.
> > +    * However, since we use task_curr() outside of rq::lock, we can race
> > +    * against the actual state. This means the result can be wrong.
> > +    *
> > +    * If we get a false positive, we retry, this is harmless.
> > +    *
> > +    * If we get a false negative, things are complicated. If we are after
> > +    * perf_event_context_sched_in() ctx::lock will serialize us, and the
> > +    * value must be correct. If we're before, it doesn't matter since
> > +    * perf_event_context_sched_in() will program the counter.
> > +    *
> > +    * However, this hinges on the remote context switch having observed
> > +    * our task->perf_event_ctxp[] store, such that it will in fact take
> > +    * ctx::lock in perf_event_context_sched_in().
> > +    *
> > +    * We do this by task_function_call(), if the IPI fails to hit the task
> > +    * we know any future context switch of task must see the
> > +    * perf_event_ctpx[] store.
> >      */
> > -again:
> > +
> >     /*
> > -    * Cannot use task_function_call() because we need to run on the task's
> > -    * CPU regardless of whether its current or not.
> > +    * This smp_mb() orders the task->perf_event_ctxp[] store with the
> > +    * task_cpu() load, such that if the IPI then does not find the task
> > +    * running, a future context switch of that task must observe the
> > +    * store.
> >      */
> > -   if (!cpu_function_call(task_cpu(task), __perf_install_in_context, 
> > event))
> > +   smp_mb();
> > +again:
> > +   if (!task_function_call(task, __perf_install_in_context, event))
> >             return;
> >  
> >     raw_spin_lock_irq(&ctx->lock);
> > @@ -2348,12 +2370,16 @@ again:
> >             raw_spin_unlock_irq(&ctx->lock);
> >             return;
> >     }
> > -   raw_spin_unlock_irq(&ctx->lock);
> >     /*
> > -    * Since !ctx->is_active doesn't mean anything, we must IPI
> > -    * unconditionally.
> > +    * If the task is not running, ctx->lock will avoid it becoming so,
> > +    * thus we can safely install the event.
> >      */
> > -   goto again;
> > +   if (task_curr(task)) {
> > +           raw_spin_unlock_irq(&ctx->lock);
> > +           goto again;
> > +   }
> > +   add_event_to_ctx(event, ctx);
> > +   raw_spin_unlock_irq(&ctx->lock);
> >  }
> >  
> >  /*
>

Re: Memory-ordering recipes

Reply via email to