With commit 247f2f6f3c70 ("sched/core: Don't schedule threads on pre-empted
vCPUs"), scheduler avoids preempted vCPUs to schedule tasks on wakeup.
This leads to wrong choice of CPU, which in-turn leads to larger wakeup
latencies. Eventually, it leads to performance regression in latency
sensitive benchmarks like soltp, schbench etc.

On Powerpc, vcpu_is_preempted only looks at yield_count. If the
yield_count is odd, the vCPU is assumed to be preempted. However
yield_count is increased whenever LPAR enters CEDE state. So any CPU
that has entered CEDE state is assumed to be preempted.

Even if vCPU of dedicated LPAR is preempted/donated, it should have
right of first-use since they are suppose to own the vCPU.

On a Power9 System with 32 cores
 # lscpu
Architecture:        ppc64le
Byte Order:          Little Endian
CPU(s):              128
On-line CPU(s) list: 0-127
Thread(s) per core:  8
Core(s) per socket:  1
Socket(s):           16
NUMA node(s):        2
Model:               2.2 (pvr 004e 0202)
Model name:          POWER9 (architected), altivec supported
Hypervisor vendor:   pHyp
Virtualization type: para
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            10240K
NUMA node0 CPU(s):   0-63
NUMA node1 CPU(s):   64-127
 

  # perf stat -a -r 5 ./schbench
v5.4                                            v5.4 + patch
Latency percentiles (usec)                      Latency percentiles (usec)
        49.0000th: 47                                   50.0000th: 33
        74.0000th: 64                                   75.0000th: 44
        89.0000th: 76                                   90.0000th: 50
        94.0000th: 83                                   95.0000th: 53
        *98.0000th: 103                                 *99.0000th: 57
        98.5000th: 2124                                 99.5000th: 59
        98.9000th: 7976                                 99.9000th: 83
        min=-1, max=10519                               min=0, max=117
Latency percentiles (usec)                      Latency percentiles (usec)
        49.0000th: 45                                   50.0000th: 34
        74.0000th: 61                                   75.0000th: 45
        89.0000th: 70                                   90.0000th: 52
        94.0000th: 77                                   95.0000th: 56
        *98.0000th: 504                                 *99.0000th: 62
        98.5000th: 4012                                 99.5000th: 64
        98.9000th: 8168                                 99.9000th: 79
        min=-1, max=14500                               min=0, max=123
Latency percentiles (usec)                      Latency percentiles (usec)
        49.0000th: 48                                   50.0000th: 35
        74.0000th: 65                                   75.0000th: 47
        89.0000th: 76                                   90.0000th: 55
        94.0000th: 82                                   95.0000th: 59
        *98.0000th: 1098                                *99.0000th: 67
        98.5000th: 3988                                 99.5000th: 71
        98.9000th: 9360                                 99.9000th: 98
        min=-1, max=19283                               min=0, max=137
Latency percentiles (usec)                      Latency percentiles (usec)
        49.0000th: 46                                   50.0000th: 35
        74.0000th: 63                                   75.0000th: 46
        89.0000th: 73                                   90.0000th: 53
        94.0000th: 78                                   95.0000th: 57
        *98.0000th: 113                                 *99.0000th: 63
        98.5000th: 2316                                 99.5000th: 65
        98.9000th: 7704                                 99.9000th: 83
        min=-1, max=17976                               min=0, max=139
Latency percentiles (usec)                      Latency percentiles (usec)
        49.0000th: 46                                   50.0000th: 34
        74.0000th: 62                                   75.0000th: 46
        89.0000th: 73                                   90.0000th: 53
        94.0000th: 79                                   95.0000th: 57
        *98.0000th: 97                                  *99.0000th: 64
        98.5000th: 1398                                 99.5000th: 70
        98.9000th: 8136                                 99.9000th: 100
        min=-1, max=10008                               min=0, max=142

Performance counter stats for 'system wide' (4 runs):

context-switches       42,604 ( +-  0.87% )       45,397 ( +-  0.25% )
cpu-migrations          0,195 ( +-  2.70% )          230 ( +-  7.23% )
page-faults            16,783 ( +- 14.87% )       16,781 ( +-  9.77% )

Waiman Long suggested using static_keys.

Reported-by: Parth Shah <pa...@linux.ibm.com>
Reported-by: Ihor Pasichnyk <ihor.pasich...@ibm.com>
Cc: Parth Shah <pa...@linux.ibm.com>
Cc: Ihor Pasichnyk <ihor.pasich...@ibm.com>
Cc: Juri Lelli <juri.le...@redhat.com>
Cc: Waiman Long <long...@redhat.com>
Signed-off-by: Srikar Dronamraju <sri...@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/spinlock.h | 5 +++--
 arch/powerpc/mm/numa.c              | 4 ++++
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/spinlock.h 
b/arch/powerpc/include/asm/spinlock.h
index e9a960e28f3c..866f6ca0427a 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -35,11 +35,12 @@
 #define LOCK_TOKEN     1
 #endif
 
-#ifdef CONFIG_PPC_PSERIES
+#if defined(CONFIG_PPC_PSERIES) && defined(CONFIG_PPC_SPLPAR)
+DECLARE_STATIC_KEY_FALSE(shared_processor);
 #define vcpu_is_preempted vcpu_is_preempted
 static inline bool vcpu_is_preempted(int cpu)
 {
-       if (!firmware_has_feature(FW_FEATURE_SPLPAR))
+       if (!static_branch_unlikely(&shared_processor))
                return false;
        return !!(be32_to_cpu(lppaca_of(cpu).yield_count) & 1);
 }
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 50d68d21ddcc..ffb971f3a63c 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1568,9 +1568,13 @@ int prrn_is_enabled(void)
        return prrn_enabled;
 }
 
+DEFINE_STATIC_KEY_FALSE(shared_processor);
+EXPORT_SYMBOL_GPL(shared_processor);
+
 void __init shared_proc_topology_init(void)
 {
        if (lppaca_shared_proc(get_lppaca())) {
+               static_branch_enable(&shared_processor);
                bitmap_fill(cpumask_bits(&cpu_associativity_changes_mask),
                            nr_cpumask_bits);
                numa_update_cpu_topology(false);
-- 
2.18.1

Reply via email to