On Wed, Dec 09, 2020 at 02:24:04PM +0800, Aubrey Li wrote:
> Add idle cpumask to track idle cpus in sched domain. Every time
> a CPU enters idle, the CPU is set in idle cpumask to be a wakeup
> target. And if the CPU is not in idle, the CPU is cleared in idle
> cpumask during scheduler tick to ratelimit idle cpumask update.
> 
> When a task wakes up to select an idle cpu, scanning idle cpumask
> has lower cost than scanning all the cpus in last level cache domain,
> especially when the system is heavily loaded.
> 
> Benchmarks including hackbench, schbench, uperf, sysbench mysql
> and kbuild were tested on a x86 4 socket system with 24 cores per
> socket and 2 hyperthreads per core, total 192 CPUs, no regression
> found.
> 

I ran this patch with tbench on top of of the schedstat patches that
track SIS efficiency. The tracking adds overhead so it's not a perfect
performance comparison but the expectation would be that the patch reduces
the number of runqueues that are scanned

tbench4
                          5.10.0-rc6             5.10.0-rc6
                      schedstat-v1r1          idlemask-v7r1
Hmean     1        504.76 (   0.00%)      500.14 *  -0.91%*
Hmean     2       1001.22 (   0.00%)      970.37 *  -3.08%*
Hmean     4       1930.56 (   0.00%)     1880.96 *  -2.57%*
Hmean     8       3688.05 (   0.00%)     3537.72 *  -4.08%*
Hmean     16      6352.71 (   0.00%)     6439.53 *   1.37%*
Hmean     32     10066.37 (   0.00%)    10124.65 *   0.58%*
Hmean     64     12846.32 (   0.00%)    11627.27 *  -9.49%*
Hmean     128    22278.41 (   0.00%)    22304.33 *   0.12%*
Hmean     256    21455.52 (   0.00%)    20900.13 *  -2.59%*
Hmean     320    21802.38 (   0.00%)    21928.81 *   0.58%*

Not very optimistic result. The schedstats indicate;

                                5.10.0-rc6     5.10.0-rc6
                            schedstat-v1r1  idlemask-v7r1
Ops TTWU Count               5599714302.00  5589495123.00
Ops TTWU Local               2687713250.00  2563662550.00
Ops SIS Search               5596677950.00  5586381168.00
Ops SIS Domain Search        3268344934.00  3229088045.00
Ops SIS Scanned             15909069113.00 16568899405.00
Ops SIS Domain Scanned      13580736097.00 14211606282.00
Ops SIS Failures             2944874939.00  2843113421.00
Ops SIS Core Search           262853975.00   311781774.00
Ops SIS Core Hit              185189656.00   216097102.00
Ops SIS Core Miss              77664319.00    95684672.00
Ops SIS Recent Used Hit       124265515.00   146021086.00
Ops SIS Recent Used Miss      338142547.00   403547579.00
Ops SIS Recent Attempts       462408062.00   549568665.00
Ops SIS Search Efficiency            35.18          33.72
Ops SIS Domain Search Eff            24.07          22.72
Ops SIS Fast Success Rate            41.60          42.20
Ops SIS Success Rate                 47.38          49.11
Ops SIS Recent Success Rate          26.87          26.57

The field I would expect to decrease is SIS Domain Scanned -- the number
of runqueues that were examined but it's actually worse and graphing over
time shows it's worse for the client thread counts.  select_idle_cpu()
is definitely being called because "Domain Search" is 10 times higher than
"Core Search" and there "Core Miss" is non-zero.

I suspect the issue is that the mask is only marked busy from the tick
context which is a very wide window. If select_idle_cpu() picks an idle
CPU from the mask, it's still marked as idle in the mask.

-- 
Mel Gorman
SUSE Labs

Reply via email to