On Thu, Nov 12, 2015 at 04:00:58PM +0100, Oleg Nesterov wrote:
> On 11/12, Boqun Feng wrote:
[snip]
> >
> > Hmm.. probably incorrect.. because the ACQUIRE semantics of spin_lock()
> > only guarantees that the memory operations following spin_lock() can't
> > be reorder before the *LOAD* part of spin_lock() not the *STORE* part,
> > i.e. the case below can happen(assuming the spin_lock() is implemented
> > as ll/sc loop)
> >
> >     spin_lock(&lock):
> >       r1 = *lock; // LL, r1 == 0
> >     o = READ_ONCE(object); // could be reordered here.
> >       *lock = 1; // SC
> >
> > This could happen because of the ACQUIRE semantics of spin_lock(), and
> > the current implementation of spin_lock() on PPC allows this happen.
> >
> > (Cc PPC maintainers for their opinions on this one)
> 
> In this case the code above is obviously wrong. And I do not understand
> how we can rely on spin_unlock_wait() then.
> 
> And afaics do_exit() is buggy too then, see below.
> 
> > I think it's OK for it as an ACQUIRE(with a proper barrier) or even just
> > a control dependency to pair with spin_unlock(), for example, the
> > following snippet in do_exit() is OK, except the smp_mb() is redundant,
> > unless I'm missing something subtle:
> >
> >     /*
> >      * The setting of TASK_RUNNING by try_to_wake_up() may be delayed
> >      * when the following two conditions become true.
> >      *   - There is race condition of mmap_sem (It is acquired by
> >      *     exit_mm()), and
> >      *   - SMI occurs before setting TASK_RUNINNG.
> >      *     (or hypervisor of virtual machine switches to other guest)
> >      *  As a result, we may become TASK_RUNNING after becoming TASK_DEAD
> >      *
> >      * To avoid it, we have to wait for releasing tsk->pi_lock which
> >      * is held by try_to_wake_up()
> >      */
> >     smp_mb();
> >     raw_spin_unlock_wait(&tsk->pi_lock);
> 
> Perhaps it is me who missed something. But I don't think we can remove
> this mb(). And at the same time it can't help on PPC if I understand

You are right, we need this smp_mb() to order the previous STORE of
->state with the LOAD of ->pi_lock. I missed that part because I saw all
the explicit STOREs of ->state in do_exit() are set_current_state()
which has a smp_mb() following the STOREs.

> your explanation above correctly.
> 
> To simplify, lets ignore exit_mm/down_read/etc. The exiting task does
> 
> 
>       current->state = TASK_UNINTERRUPTIBLE;
>       // without schedule() in between
>       current->state = TASK_RUNNING;
> 
>       smp_mb();
>       spin_unlock_wait(pi_lock);
> 
>       current->state = TASK_DEAD;
>       schedule();
> 
> and we need to ensure that if we race with 
> try_to_wake_up(TASK_UNINTERRUPTIBLE)
> it can't change TASK_DEAD back to RUNNING.
> 
> Without smp_mb() this can be reordered, spin_unlock_wait(pi_locked) can
> read the old "unlocked" state of pi_lock before we set UNINTERRUPTIBLE,
> so in fact we could have
> 
>       current->state = TASK_UNINTERRUPTIBLE;
>       
>       spin_unlock_wait(pi_lock);
> 
>       current->state = TASK_RUNNING;
> 
>       current->state = TASK_DEAD;
> 
> and this can obviously race with ttwu() which can take pi_lock and see
> state == TASK_UNINTERRUPTIBLE after spin_unlock_wait().
> 

Yep, my mistake ;-)

> And, if I understand you correctly, this smp_mb() can't help on PPC.
> try_to_wake_up() can read task->state before it writes to *pi_lock.
> To me this doesn't really differ from the code above,
> 
>       CPU 1 (do_exit)                         CPU_2 (ttwu)
> 
>                                               spin_lock(pi_lock):
>                                                 r1 = *pi_lock; // r1 == 0;
>       p->state = TASK_UNINTERRUPTIBLE;
>                                               state = p->state;
>       p->state = TASK_RUNNING;
>       mb();
>       spin_unlock_wait();
>                                               *pi_lock = 1;
> 
>       p->state = TASK_DEAD;
>                                               if (state & 
> TASK_UNINTERRUPTIBLE) // true
>                                                       p->state = RUNNING;
> 
> No?
> 

do_exit() is surely buggy if spin_lock() could work in this way.

> And smp_mb__before_spinlock() looks wrong too then.
> 

Maybe not? As smp_mb__before_spinlock() is used before a LOCK operation,
which has both LOAD part and STORE part unlike spin_unlock_wait()?

> Oleg.
> 

Attachment: signature.asc
Description: PGP signature

Reply via email to