On Mon, Nov 16, 2020 at 01:37:21PM +0000, Mel Gorman wrote:
> On Mon, Nov 16, 2020 at 01:11:03PM +0000, Will Deacon wrote:

> > Anyway, setting all that aside, I do agree with you that the bitfield usage
> > in task_struct looks pretty suspicious. For example, in __schedule() we
> > have:
> > 
> >     rq_lock(rq, &rf);
> >     smp_mb__after_spinlock();
> >     ...
> >     prev_state = prev->state;
> > 
> >     if (!preempt && prev_state) {
> >             if (signal_pending_state(prev_state, prev)) {
> >                     prev->state = TASK_RUNNING;
> >             } else {
> >                     prev->sched_contributes_to_load =
> >                             (prev_state & TASK_UNINTERRUPTIBLE) &&
> >                             !(prev_state & TASK_NOLOAD) &&
> >                             !(prev->flags & PF_FROZEN);
> >                     ...
> >                     deactivate_task(rq, prev, DEQUEUE_SLEEP | 
> > DEQUEUE_NOCLOCK);
> > 
> > where deactivate_task() updates p->on_rq directly:
> > 
> >     p->on_rq = (flags & DEQUEUE_SLEEP) ? 0 : TASK_ON_RQ_MIGRATING;
> > 
> 
> It used to be at least a WRITE_ONCE until 58877d347b58 ("sched: Better
> document ttwu()") which changed it. Not sure why that is and didn't
> think about it too deep as it didn't appear to be directly related to
> the problem and didn't have ordering consequences.

I'm confused; that commit didn't change deactivate_task(). Anyway,
->on_rq should be strictly under rq->lock. That said, since there is a
READ_ONCE() consumer of ->on_rq it makes sense to have the stores as
WRITE_ONCE().

> > __ttwu_queue_wakelist() we have:
> > 
> >     p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);
> > 
> > which can be invoked on the try_to_wake_up() path if p->on_rq is first read
> > as zero and then p->on_cpu is read as 1. Perhaps these non-atomic bitfield
> > updates can race and cause the flags to be corrupted?
> > 
> 
> I think this is at least one possibility. I think at least that one
> should only be explicitly set on WF_MIGRATED and explicitly cleared in
> sched_ttwu_pending. While I haven't audited it fully, it might be enough
> to avoid a double write outside of the rq lock on the bitfield but I
> still need to think more about the ordering of sched_contributes_to_load
> and whether it's ordered by p->on_cpu or not.

The scenario you're worried about is something like:

        CPU0                                                    CPU1

        schedule()
                prev->sched_contributes_to_load = X;
                deactivate_task(prev);

                                                                try_to_wake_up()
                                                                        if 
(p->on_rq &&) // false
                                                                        if 
(smp_load_acquire(&p->on_cpu) && // true
                                                                            
ttwu_queue_wakelist())
                                                                                
p->sched_remote_wakeup = Y;

                smp_store_release(prev->on_cpu, 0);



And then the stores of X and Y clobber one another.. Hummph, seems
reasonable. One quick thing to test would be something like this:


diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7abbdd7f3884..9844e541c94c 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -775,7 +775,9 @@ struct task_struct {
        unsigned                        sched_reset_on_fork:1;
        unsigned                        sched_contributes_to_load:1;
        unsigned                        sched_migrated:1;
+       unsigned                        :0;
        unsigned                        sched_remote_wakeup:1;
+       unsigned                        :0;
 #ifdef CONFIG_PSI
        unsigned                        sched_psi_wake_requeue:1;
 #endif

Reply via email to