On Thu, May 21, 2020 at 11:38:16AM +0100, Mel Gorman wrote:
> IIUC, this patch front-loads as much work as possible before checking if
> the task is on_rq and then the waker/wakee shares a cache, queue task on
> the wake_list and otherwise do a direct wakeup.
> 
> The advantage is that spinning is avoided on p->on_rq when p does not
> share a cache. The disadvantage is that it may result in tasks being
> stacked but this should only happen when the domain is overloaded and
> select_task_eq() is unlikely to find an idle CPU. The load balancer would
> soon correct the situation anyway.
> 
> In terms of netperf for my testing, the benefit is marginal because the
> wakeups are primarily between tasks that share cache. It does trigger as
> perf indicates that some time is spent in ttwu_queue_remote with this
> patch, it's just that the overall time spent spinning on p->on_rq is
> very similar. I'm still waiting on other workloads to complete to see
> what the impact is.

So it might make sense to play with the exact conditions under which
we'll attempt this remote queue, if we see a large 'local' p->on_cpu
spin time, it might make sense to attempt the queue even in this case.

We could for example change it to:

        if (REAC_ONCE(p->on_cpu) && ttwu_queue_remote(p, cpu, wake_flags | 
WF_ON_CPU))
                goto unlock;

and then use that in ttwu_queue_remote() to differentiate between these
two cases.

Anyway, if it's a wash (atomic op vs spinning) then it's probably not
worth it.

Another optimization might be to forgo the IPI entirely in this case and
instead stick a sched_ttwu_pending() at the end of __schedule() or
something like that.

> However, intuitively at least, it makes sense to avoid spinning on p->on_rq
> when it's unnecessary and the other changes appear to be safe.  Even if
> wake_list should be used in some cases for local wakeups, it would make
> sense to put that on top of this patch. Do you want to slap a changelog
> around this and update the comments or do you want me to do it? I should
> have more results in a few hours even if they are limited to one machine
> but ideally Rik would test his workload too.

I've written you a Changelog, but please carry it in your set to
evaluate if it's actually worth it.

---
Subject: sched: Optimize ttwu() spinning on p->on_cpu
From: Peter Zijlstra <[email protected]>
Date: Fri, 15 May 2020 16:24:44 +0200

Both Rik and Mel reported seeing ttwu() spend significant time on:

  smp_cond_load_acquire(&p->on_cpu, !VAL);

Attempt to avoid this by queueing the wakeup on the CPU that own's the
p->on_cpu value. This will then allow the ttwu() to complete without
further waiting.

Since we run schedule() with interrupts disabled, the IPI is
guaranteed to happen after p->on_cpu is cleared, this is what makes it
safe to queue early.

Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
---
 kernel/sched/core.c |   45 ++++++++++++++++++++++++---------------------
 1 file changed, 24 insertions(+), 21 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2312,7 +2312,7 @@ static void wake_csd_func(void *info)
        sched_ttwu_pending();
 }
 
-static void ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
+static void __ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
 {
        struct rq *rq = cpu_rq(cpu);
 
@@ -2354,6 +2354,17 @@ bool cpus_share_cache(int this_cpu, int
 {
        return per_cpu(sd_llc_id, this_cpu) == per_cpu(sd_llc_id, that_cpu);
 }
+
+static bool ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
+{
+       if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), 
cpu)) {
+               sched_clock_cpu(cpu); /* Sync clocks across CPUs */
+               __ttwu_queue_remote(p, cpu, wake_flags);
+               return true;
+       }
+
+       return false;
+}
 #endif /* CONFIG_SMP */
 
 static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
@@ -2362,11 +2373,8 @@ static void ttwu_queue(struct task_struc
        struct rq_flags rf;
 
 #if defined(CONFIG_SMP)
-       if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), 
cpu)) {
-               sched_clock_cpu(cpu); /* Sync clocks across CPUs */
-               ttwu_queue_remote(p, cpu, wake_flags);
+       if (ttwu_queue_remote(p, cpu, wake_flags))
                return;
-       }
 #endif
 
        rq_lock(rq, &rf);
@@ -2550,7 +2558,15 @@ try_to_wake_up(struct task_struct *p, un
        if (p->on_rq && ttwu_remote(p, wake_flags))
                goto unlock;
 
+       if (p->in_iowait) {
+               delayacct_blkio_end(p);
+               atomic_dec(&task_rq(p)->nr_iowait);
+       }
+
 #ifdef CONFIG_SMP
+       p->sched_contributes_to_load = !!task_contributes_to_load(p);
+       p->state = TASK_WAKING;
+
        /*
         * Ensure we load p->on_cpu _after_ p->on_rq, otherwise it would be
         * possible to, falsely, observe p->on_cpu == 0.
@@ -2581,15 +2597,10 @@ try_to_wake_up(struct task_struct *p, un
         * This ensures that tasks getting woken will be fully ordered against
         * their previous state and preserve Program Order.
         */
-       smp_cond_load_acquire(&p->on_cpu, !VAL);
-
-       p->sched_contributes_to_load = !!task_contributes_to_load(p);
-       p->state = TASK_WAKING;
+       if (READ_ONCE(p->on_cpu) && ttwu_queue_remote(p, cpu, wake_flags))
+               goto unlock;
 
-       if (p->in_iowait) {
-               delayacct_blkio_end(p);
-               atomic_dec(&task_rq(p)->nr_iowait);
-       }
+       smp_cond_load_acquire(&p->on_cpu, !VAL);
 
        cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
        if (task_cpu(p) != cpu) {
@@ -2597,14 +2608,6 @@ try_to_wake_up(struct task_struct *p, un
                psi_ttwu_dequeue(p);
                set_task_cpu(p, cpu);
        }
-
-#else /* CONFIG_SMP */
-
-       if (p->in_iowait) {
-               delayacct_blkio_end(p);
-               atomic_dec(&task_rq(p)->nr_iowait);
-       }
-
 #endif /* CONFIG_SMP */
 
        ttwu_queue(p, cpu, wake_flags);

Reply via email to