With commit 247f2f6f3c70 ("sched/core: Don't schedule threads on pre-empted
vCPUs"), scheduler avoids preempted vCPUs to schedule tasks on wakeup.
This leads to wrong choice of CPU, which in-turn leads to larger wakeup
latencies. Eventually, it leads to performance regression in latency
sensitive benchmarks like soltp, schbench etc.

On Powerpc, vcpu_is_preempted only looks at yield_count. If the
yield_count is odd, the vCPU is assumed to be preempted. However
yield_count is increased whenever LPAR enters CEDE state. So any CPU
that has entered CEDE state is assumed to be preempted.

Even if vCPU of dedicated LPAR is preempted/donated, it should have
right of first-use since they are suppose to own the vCPU.

On a Power9 System with 32 cores
 # lscpu
Architecture:        ppc64le
Byte Order:          Little Endian
CPU(s):              128
On-line CPU(s) list: 0-127
Thread(s) per core:  8
Core(s) per socket:  1
Socket(s):           16
NUMA node(s):        2
Model:               2.2 (pvr 004e 0202)
Model name:          POWER9 (architected), altivec supported
Hypervisor vendor:   pHyp
Virtualization type: para
L1d cache:           32K
L1i cache:           32K
L2 cache:            512K
L3 cache:            10240K
NUMA node0 CPU(s):   0-63
NUMA node1 CPU(s):   64-127

  # perf stat -a -r 5 ./schbench
v5.4                                     v5.4 + patch
Latency percentiles (usec)               Latency percentiles (usec)
        50.0000th: 45                           50.0000th: 39
        75.0000th: 62                           75.0000th: 53
        90.0000th: 71                           90.0000th: 67
        95.0000th: 77                           95.0000th: 76
        *99.0000th: 91                          *99.0000th: 89
        99.5000th: 707                          99.5000th: 93
        99.9000th: 6920                         99.9000th: 118
        min=0, max=10048                        min=0, max=211
Latency percentiles (usec)               Latency percentiles (usec)
        50.0000th: 45                           50.0000th: 34
        75.0000th: 61                           75.0000th: 45
        90.0000th: 72                           90.0000th: 53
        95.0000th: 79                           95.0000th: 56
        *99.0000th: 691                         *99.0000th: 61
        99.5000th: 3972                         99.5000th: 63
        99.9000th: 8368                         99.9000th: 78
        min=0, max=16606                        min=0, max=228
Latency percentiles (usec)               Latency percentiles (usec)
        50.0000th: 45                           50.0000th: 34
        75.0000th: 61                           75.0000th: 45
        90.0000th: 71                           90.0000th: 53
        95.0000th: 77                           95.0000th: 57
        *99.0000th: 106                         *99.0000th: 63
        99.5000th: 2364                         99.5000th: 68
        99.9000th: 7480                         99.9000th: 100
        min=0, max=10001                        min=0, max=134
Latency percentiles (usec)               Latency percentiles (usec)
        50.0000th: 45                           50.0000th: 34
        75.0000th: 62                           75.0000th: 46
        90.0000th: 72                           90.0000th: 53
        95.0000th: 78                           95.0000th: 56
        *99.0000th: 93                          *99.0000th: 61
        99.5000th: 108                          99.5000th: 64
        99.9000th: 6792                         99.9000th: 85
        min=0, max=17681                        min=0, max=121
Latency percentiles (usec)               Latency percentiles (usec)
        50.0000th: 46                           50.0000th: 33
        75.0000th: 62                           75.0000th: 44
        90.0000th: 73                           90.0000th: 51
        95.0000th: 79                           95.0000th: 54
        *99.0000th: 113                         *99.0000th: 61
        99.5000th: 2724                         99.5000th: 64
        99.9000th: 6184                         99.9000th: 82
        min=0, max=9887                         min=0, max=121

 Performance counter stats for 'system wide' (5 runs):

context-switches    43,373  ( +-  0.40% )   44,597 ( +-  0.55% )
cpu-migrations       1,211  ( +-  5.04% )      220 ( +-  6.23% )
page-faults         15,983  ( +-  5.21% )   15,360 ( +-  3.38% )

Waiman Long suggested using static_keys.

Fixes: 41946c86876e ("locking/core, powerpc: Implement vcpu_is_preempted(cpu)")

Cc: Parth Shah <pa...@linux.ibm.com>
Cc: Ihor Pasichnyk <ihor.pasich...@ibm.com>
Cc: Juri Lelli <juri.le...@redhat.com>
Cc: Phil Auld <pa...@redhat.com>
Cc: Waiman Long <long...@redhat.com>
Cc: Gautham R. Shenoy <e...@linux.vnet.ibm.com>
Cc: Vaidyanathan Srinivasan <sva...@linux.ibm.com>
Reported-by: Parth Shah <pa...@linux.ibm.com>
Reported-by: Ihor Pasichnyk <ihor.pasich...@ibm.com>
Tested-by: Juri Lelli <juri.le...@redhat.com>
Tested-by: Parth Shah <pa...@linux.ibm.com>
Acked-by: Waiman Long <long...@redhat.com>
Acked-by: Phil Auld <pa...@redhat.com>
Reviewed-by: Gautham R. Shenoy <e...@linux.vnet.ibm.com>
Reviewed-by: Vaidyanathan Srinivasan <sva...@linux.ibm.com>
Signed-off-by: Srikar Dronamraju <sri...@linux.vnet.ibm.com>
---
Changelog v1 (https://patchwork.ozlabs.org/patch/1204190/) ->v3:
Code is now under CONFIG_PPC_SPLPAR as it depends on CONFIG_PPC_PSERIES.
This was suggested by Waiman Long.

Changelog v3 (https://patchwork.ozlabs.org/patch/1204526) ->v4:
Fix a build issue in CONFIG_NUMA=n reported by Michael Ellerman
by moving the relevant code from mm/numa.c to kernel/smp.c

 arch/powerpc/include/asm/spinlock.h |  6 ++++--
 arch/powerpc/kernel/smp.c           | 19 ++++++++++++++-----
 arch/powerpc/mm/numa.c              |  8 +++-----
 3 files changed, 21 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/include/asm/spinlock.h 
b/arch/powerpc/include/asm/spinlock.h
index e9a960e28f3c..f318bfe3525f 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -17,6 +17,7 @@
  */
 #include <linux/irqflags.h>
 #ifdef CONFIG_PPC64
+#include <linux/jump_label.h>
 #include <asm/paca.h>
 #include <asm/hvcall.h>
 #endif
@@ -35,11 +36,12 @@
 #define LOCK_TOKEN     1
 #endif
 
-#ifdef CONFIG_PPC_PSERIES
+#ifdef CONFIG_PPC_SPLPAR
+DECLARE_STATIC_KEY_FALSE(shared_processor);
 #define vcpu_is_preempted vcpu_is_preempted
 static inline bool vcpu_is_preempted(int cpu)
 {
-       if (!firmware_has_feature(FW_FEATURE_SPLPAR))
+       if (!static_branch_unlikely(&shared_processor))
                return false;
        return !!(be32_to_cpu(lppaca_of(cpu).yield_count) & 1);
 }
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index ea6adbf6a221..96a44157b935 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1348,6 +1348,9 @@ static struct sched_domain_topology_level 
power9_topology[] = {
        { NULL, },
 };
 
+DEFINE_STATIC_KEY_FALSE(shared_processor);
+EXPORT_SYMBOL_GPL(shared_processor);
+
 void __init smp_cpus_done(unsigned int max_cpus)
 {
        /*
@@ -1359,11 +1362,17 @@ void __init smp_cpus_done(unsigned int max_cpus)
        if (smp_ops && smp_ops->bringup_done)
                smp_ops->bringup_done();
 
-       /*
-        * On a shared LPAR, associativity needs to be requested.
-        * Hence, get numa topology before dumping cpu topology
-        */
-       shared_proc_topology_init();
+#ifdef CONFIG_PPC_SPLPAR
+       if (lppaca_shared_proc(get_lppaca())) {
+               static_branch_enable(&shared_processor);
+
+               /*
+                * On a shared LPAR, associativity needs to be requested.
+                * Hence, get numa topology before dumping cpu topology
+                */
+               shared_proc_topology_init();
+       }
+#endif
        dump_numa_cpu_topology();
 
 #ifdef CONFIG_SCHED_SMT
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 50d68d21ddcc..c352b1dfa99e 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1570,11 +1570,9 @@ int prrn_is_enabled(void)
 
 void __init shared_proc_topology_init(void)
 {
-       if (lppaca_shared_proc(get_lppaca())) {
-               bitmap_fill(cpumask_bits(&cpu_associativity_changes_mask),
-                           nr_cpumask_bits);
-               numa_update_cpu_topology(false);
-       }
+       bitmap_fill(cpumask_bits(&cpu_associativity_changes_mask),
+                   nr_cpumask_bits);
+       numa_update_cpu_topology(false);
 }
 
 static int topology_read(struct seq_file *file, void *v)
-- 
2.18.1

Reply via email to