From: Wanpeng Li <[email protected]>

Wake-affine is a feature inside scheduler which we attempt to make processes 
running closely, it gains benefit mostly from cache-hit. When waker tries 
to wakup wakee, it needs to select cpu to run wakee, wake affine heuristic 
mays select the cpu which waker is running on currently instead of the prev 
cpu which wakee was last time running. 

However, in multiple VMs over-subscribe virtualization scenario, it increases 
the probability to incur vCPU stacking which means that the sibling vCPUs from 
the same VM will be stacked on one pCPU. I test three 80 vCPUs VMs running on 
one 80 pCPUs Skylake server(PLE is supported), the ebizzy score can increase 
17% 
after disabling wake-affine for vCPU process. 

When qemu/other vCPU inject virtual interrupt to guest through waking up one 
sleeping vCPU, it increases the probability to stack vCPUs/qemu by scheduler
wake-affine. vCPU stacking issue can greately inceases the lock synchronization 
latency in a virtualized environment. This patch disables wake-affine vCPU 
process to mitigtate lock holder preemption.

Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Paolo Bonzini <[email protected]>
Cc: Radim Krčmář <[email protected]>
Signed-off-by: Wanpeng Li <[email protected]>
---
 include/linux/sched.h | 1 +
 kernel/sched/fair.c   | 3 +++
 virt/kvm/kvm_main.c   | 1 +
 3 files changed, 5 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8dc1811..3dd33d8 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1468,6 +1468,7 @@ extern struct pid *cad_pid;
 #define PF_NO_SETAFFINITY      0x04000000      /* Userland is not allowed to 
meddle with cpus_mask */
 #define PF_MCE_EARLY           0x08000000      /* Early kill for mce process 
policy */
 #define PF_MEMALLOC_NOCMA      0x10000000      /* All allocation request will 
have _GFP_MOVABLE cleared */
+#define PF_NO_WAKE_AFFINE      0x20000000      /* This thread should not be 
wake affine */
 #define PF_FREEZER_SKIP                0x40000000      /* Freezer should not 
count it as freezable */
 #define PF_SUSPEND_TASK                0x80000000      /* This thread called 
freeze_processes() and should not be frozen */
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 036be95..18eb1fa 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5428,6 +5428,9 @@ static int wake_wide(struct task_struct *p)
        unsigned int slave = p->wakee_flips;
        int factor = this_cpu_read(sd_llc_size);
 
+       if (unlikely(p->flags & PF_NO_WAKE_AFFINE))
+               return 1;
+
        if (master < slave)
                swap(master, slave);
        if (slave < factor || master < slave * factor)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 887f3b0..b9f75c3 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2680,6 +2680,7 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 
id)
 
        mutex_unlock(&kvm->lock);
        kvm_arch_vcpu_postcreate(vcpu);
+       current->flags |= PF_NO_WAKE_AFFINE;
        return r;
 
 unlock_vcpu_destroy:
-- 
2.7.4

Reply via email to