On Thu, Sep 17, 2020 at 11:42:11AM +0200, Thomas Gleixner wrote:

> +static inline void update_nr_migratory(struct task_struct *p, long delta)
> +{
> +     if (p->nr_cpus_allowed > 1 && p->sched_class->update_migratory)
> +             p->sched_class->update_migratory(p, delta);
> +}

Right, so as you know, I totally hate this thing :-) It adds a second
(and radically different) version of changing affinity. I'm working on a
version that uses the normal *set_cpus_allowed*() interface.

> +/*
> + * The migrate_disable/enable() fastpath updates only the tasks migrate
> + * disable count which is sufficient as long as the task stays on the CPU.
> + *
> + * When a migrate disabled task is scheduled out it can become subject to
> + * load balancing. To prevent this, update task::cpus_ptr to point to the
> + * current CPUs cpumask and set task::nr_cpus_allowed to 1.
> + *
> + * If task::cpus_ptr does not point to task::cpus_mask then the update has
> + * been done already. This check is also used in in migrate_enable() as an
> + * indicator to restore task::cpus_ptr to point to task::cpus_mask
> + */
> +static inline void sched_migration_ctrl(struct task_struct *prev, int cpu)
> +{
> +     if (!prev->migration_ctrl.disable_cnt ||
> +         prev->cpus_ptr != &prev->cpus_mask)
> +             return;
> +
> +     prev->cpus_ptr = cpumask_of(cpu);
> +     update_nr_migratory(prev, -1);
> +     prev->nr_cpus_allowed = 1;
> +}

So this thing is called from schedule(), with only rq->lock held, and
that violates the locking rules for changing the affinity.

I have a comment that explains how it's broken and why it's sort-of
working.

> +void migrate_disable(void)
> +{
> +     unsigned long flags;
> +
> +     if (!current->migration_ctrl.disable_cnt) {
> +             raw_spin_lock_irqsave(&current->pi_lock, flags);
> +             current->migration_ctrl.disable_cnt++;
> +             raw_spin_unlock_irqrestore(&current->pi_lock, flags);
> +     } else {
> +             current->migration_ctrl.disable_cnt++;
> +     }
> +}

That pi_lock seems unfortunate, and it isn't obvious what the point of
it is.

> +void migrate_enable(void)
> +{
> +     struct task_migrate_data *pending;
> +     struct task_struct *p = current;
> +     struct rq_flags rf;
> +     struct rq *rq;
> +
> +     if (WARN_ON_ONCE(p->migration_ctrl.disable_cnt <= 0))
> +             return;
> +
> +     if (p->migration_ctrl.disable_cnt > 1) {
> +             p->migration_ctrl.disable_cnt--;
> +             return;
> +     }
> +
> +     raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
> +     p->migration_ctrl.disable_cnt = 0;
> +     pending = p->migration_ctrl.pending;
> +     p->migration_ctrl.pending = NULL;
> +
> +     /*
> +      * If the task was never scheduled out while in the migrate
> +      * disabled region and there is no migration request pending,
> +      * return.
> +      */
> +     if (!pending && p->cpus_ptr == &p->cpus_mask) {
> +             raw_spin_unlock_irqrestore(&p->pi_lock, rf.flags);
> +             return;
> +     }
> +
> +     rq = __task_rq_lock(p, &rf);
> +     /* Was it scheduled out while in a migrate disabled region? */
> +     if (p->cpus_ptr != &p->cpus_mask) {
> +             /* Restore the tasks CPU mask and update the weight */
> +             p->cpus_ptr = &p->cpus_mask;
> +             p->nr_cpus_allowed = cpumask_weight(&p->cpus_mask);
> +             update_nr_migratory(p, 1);
> +     }
> +
> +     /* If no migration request is pending, no further action required. */
> +     if (!pending) {
> +             task_rq_unlock(rq, p, &rf);
> +             return;
> +     }
> +
> +     /* Migrate self to the requested target */
> +     pending->res = set_cpus_allowed_ptr_locked(p, pending->mask,
> +                                                pending->check, rq, &rf);
> +     complete(pending->done);
> +}

So, what I'm missing with all this are the design contraints for this
trainwreck. Because the 'sane' solution was having migrate_disable()
imply cpus_read_lock(). But that didn't fly because we can't have
migrate_disable() / migrate_enable() schedule for raisins.

And if I'm not mistaken, the above migrate_enable() *does* require being
able to schedule, and our favourite piece of futex:

        raw_spin_lock_irq(&q.pi_state->pi_mutex.wait_lock);
        spin_unlock(q.lock_ptr);

is broken. Consider that spin_unlock() doing migrate_enable() with a
pending sched_setaffinity().

Let me ponder this more..

Reply via email to