Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

Mel Gorman Fri, 15 May 2020 01:48:27 -0700

On Thu, May 14, 2020 at 05:31:22PM +0200, Peter Zijlstra wrote:
> On Wed, May 13, 2020 at 04:30:23PM +0100, Mel Gorman wrote:
> > Complete shot in the dark but restore adjust_numa_imbalance() and try
> > this
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 1a9983da4408..0b31f4468d5b 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -2393,7 +2393,7 @@ static void ttwu_queue(struct task_struct *p, int 
> > cpu, int wake_flags)
> >     struct rq_flags rf;
> >  
> >  #if defined(CONFIG_SMP)
> > -   if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), 
> > cpu)) {
> > +   if (sched_feat(TTWU_QUEUE)) {
> 
> just saying that this has the risk of regressing other workloads, see:
> 
>   518cd6234178 ("sched: Only queue remote wakeups when crossing cache 
> boundaries")
>


Sure, I didn't really think that this was appropriate but it was the best I
had at the time. This can be a lot more targetted but it took me a while
to put together a picture as it was difficult for me to analyse. I was
going to post this as a seperate RFC but given this thread, what do you
think of this?

---8<---
sched: Wake cache-local tasks via wake_list if wakee CPU is polling

There are two separate method for waking a task from idle, one for
tasks running on CPUs that share a cache and one for when the caches are
separate. The methods can loosely be called local and remote even though
this is not directly related to NUMA and instead is due to the expected
cost of accessing data that is cache-hot on another CPU. The "local" costs
are expected to be relatively cheap as they share the LLC in comparison to
a remote IPI that is potentially required when using the "remote" wakeup.
The problem is that the local wakeup is not always cheaper and it appears
to have degraded even further around the 4.19 mark.

There appears to be a couple of reasons why it can be slower.

The first is specific to pairs of tasks where one or both rapidly enters
idle. For example, on netperf UDP_STREAM, the client sends a bunch of
packets, wakes the server, the server processes some packets and goes
back to sleep. There is a relationship between the tasks but it's not
strictly synchronous. The timing is different if the client/server are on
separate NUMA nodes and netserver is more likely to enter idle (measured
as server entering idle 10% more times when tasks are local vs remote
but machine-specific). However, the wakeups are so rapid that the wakeup
happens while the server is descheduling. That forces the waker to spin
on smp_cond_load_acquire for longer. In this case, it can be cheaper to
add the task to the rq->wake_list even if that potentially requires an IPI.

The second is that the local wakeup path is simply not always
that fast. Using ftrace, the cost of the locks, update_rq_clock and
ttwu_do_activate was measured as roughly 4.5 microseconds. While it's
a single instance, the cost of the "remote" wakeup for try_to_wake_up
was roughly 2.5 microseconds versus 6.2 microseconds for the "fast" local
wakeup. When there are tens of thousands of wakeups per second, these costs
accumulate and manifest as a throughput regression in netperf UDP_STREAM.

The essential difference in costs comes down to whether the CPU is fully
idle, a task is descheduling or polling in poll_idle(). This patch
special cases ttwu_queue() to use the "remote" method if the CPUs
task is polling as it's generally cheaper to use the wake_list in that
case than the local method because an IPI should not be required. As it is
race-prone, a reschedule IPI may still be sent but in that case the local
wakeup would also have to send a reschedule IPI so it should be harmless.

The benefit is not universal as it requires the specific pattern of wakeups
that happen when the wakee is likely to be descheduling. netperf UDP_STREAM
is one of the more obvious examples and it is more obvious since the
load balancer was reconciled and the client and server are more likely to
be running on the same NUMA node which is why it was missed for so long.

netperf-udp
                                  5.7.0-rc5              5.7.0-rc5
                                    vanilla          wakelist-v1r1
Hmean     send-64         211.25 (   0.00%)      228.61 *   8.22%*
Hmean     send-128        413.49 (   0.00%)      458.00 *  10.77%*
Hmean     send-256        786.31 (   0.00%)      879.30 *  11.83%*
Hmean     send-1024      3055.75 (   0.00%)     3429.45 *  12.23%*
Hmean     send-2048      6052.79 (   0.00%)     6878.99 *  13.65%*
Hmean     send-3312      9208.09 (   0.00%)    10832.19 *  17.64%*
Hmean     send-4096     11268.45 (   0.00%)    12968.39 *  15.09%*
Hmean     send-8192     17943.12 (   0.00%)    19886.07 *  10.83%*
Hmean     send-16384    29732.94 (   0.00%)    32297.44 *   8.63%*
Hmean     recv-64         211.25 (   0.00%)      228.61 *   8.22%*
Hmean     recv-128        413.49 (   0.00%)      458.00 *  10.77%*
Hmean     recv-256        786.31 (   0.00%)      879.30 *  11.83%*
Hmean     recv-1024      3055.75 (   0.00%)     3429.45 *  12.23%*
Hmean     recv-2048      6052.79 (   0.00%)     6878.99 *  13.65%*
Hmean     recv-3312      9208.09 (   0.00%)    10832.19 *  17.64%*
Hmean     recv-4096     11268.45 (   0.00%)    12968.39 *  15.09%*
Hmean     recv-8192     17943.12 (   0.00%)    19886.06 *  10.83%*
Hmean     recv-16384    29731.75 (   0.00%)    32296.08 *   8.62%*

It's less obvious on something like TCP_STREAM as there is a stricter
relationship between the client and server but with the patch, it's
much less variable

netperf-tcp
                             5.7.0-rc5              5.7.0-rc5
                               vanilla          wakelist-v1r1
Hmean     64         769.28 (   0.00%)      779.96 *   1.39%*
Hmean     128       1497.54 (   0.00%)     1509.33 *   0.79%*
Hmean     256       2806.00 (   0.00%)     2787.59 (  -0.66%)
Hmean     1024      9700.80 (   0.00%)     9720.02 (   0.20%)
Hmean     2048     17071.28 (   0.00%)    16968.92 (  -0.60%)
Hmean     3312     22976.20 (   0.00%)    23012.80 (   0.16%)
Hmean     4096     25885.74 (   0.00%)    26152.49 *   1.03%*
Hmean     8192     34014.83 (   0.00%)    33878.67 (  -0.40%)
Hmean     16384    39482.14 (   0.00%)    40307.22 (   2.09%)
Stddev    64           2.60 (   0.00%)        0.17 (  93.31%)
Stddev    128          3.69 (   0.00%)        1.52 (  58.97%)
Stddev    256         31.06 (   0.00%)       14.76 (  52.47%)
Stddev    1024        96.50 (   0.00%)       46.43 (  51.89%)
Stddev    2048       115.98 (   0.00%)       62.59 (  46.03%)
Stddev    3312       293.28 (   0.00%)       41.92 (  85.71%)
Stddev    4096       173.45 (   0.00%)      123.19 (  28.98%)
Stddev    8192       783.59 (   0.00%)      129.62 (  83.46%)
Stddev    16384     1012.37 (   0.00%)      252.08 (  75.10%)

Signed-off-by: Mel Gorman <[email protected]>
---
 kernel/sched/core.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9a2fbf98fd6f..59077c7c6660 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2380,13 +2380,32 @@ static void ttwu_queue(struct task_struct *p, int cpu, 
int wake_flags)
        struct rq_flags rf;
 
 #if defined(CONFIG_SMP)
-       if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), 
cpu)) {
+       if (sched_feat(TTWU_QUEUE)) {
+               /*
+                * A remote wakeup is often expensive as can require
+                * an IPI and the wakeup path is slow. However, in
+                * the specific case where the target CPU is idle
+                * and polling, the CPU is active and rapidly checking
+                * if a reschedule is needed. In this case, the idle
+                * task just needs to be marked for resched and p
+                * will rapidly be requeued which is less expensive
+                * than the direct wakeup path.
+                */
+               if (cpus_share_cache(smp_processor_id(), cpu)) {
+                       struct thread_info *ti = task_thread_info(p);
+                       typeof(ti->flags) val = READ_ONCE(ti->flags);
+
+                       if (val & _TIF_POLLING_NRFLAG)
+                               goto activate;
+               }
+
                sched_clock_cpu(cpu); /* Sync clocks across CPUs */
                ttwu_queue_remote(p, cpu, wake_flags);
                return;
        }
 #endif
 
+activate:
        rq_lock(rq, &rf);
        update_rq_clock(rq);
        ttwu_do_activate(rq, p, wake_flags, &rf);

Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

Reply via email to