Re: [PATCH v2 0/6] KVM: arm64: VCPU preempted check support
On 2020/1/15 22:14, Marc Zyngier wrote: > On 2020-01-13 12:12, Will Deacon wrote: >> [+PeterZ] >> >> On Thu, Dec 26, 2019 at 09:58:27PM +0800, Zengruan Ye wrote: >>> This patch set aims to support the vcpu_is_preempted() functionality >>> under KVM/arm64, which allowing the guest to obtain the VCPU is >>> currently running or not. This will enhance lock performance on >>> overcommitted hosts (more runnable VCPUs than physical CPUs in the >>> system) as doing busy waits for preempted VCPUs will hurt system >>> performance far worse than early yielding. >>> >>> We have observed some performace improvements in uninx benchmark tests. >>> >>> unix benchmark result: >>> host: kernel 5.5.0-rc1, HiSilicon Kunpeng920, 8 CPUs >>> guest: kernel 5.5.0-rc1, 16 VCPUs >>> >>> test-case | after-patch | before-patch >>> +---+-- >>> Dhrystone 2 using register variables | 334600751.0 lps | 335319028.3 >>> lps >>> Double-Precision Whetstone | 32856.1 MWIPS | 32849.6 >>> MWIPS >>> Execl Throughput | 3662.1 lps | 2718.0 >>> lps >>> File Copy 1024 bufsize 2000 maxblocks | 432906.4 KBps | 158011.8 >>> KBps >>> File Copy 256 bufsize 500 maxblocks | 116023.0 KBps | 37664.0 >>> KBps >>> File Copy 4096 bufsize 8000 maxblocks | 1432769.8 KBps | 441108.8 >>> KBps >>> Pipe Throughput | 6405029.6 lps | 6021457.6 >>> lps >>> Pipe-based Context Switching | 185872.7 lps | 184255.3 >>> lps >>> Process Creation | 4025.7 lps | 3706.6 >>> lps >>> Shell Scripts (1 concurrent) | 6745.6 lpm | 6436.1 >>> lpm >>> Shell Scripts (8 concurrent) | 998.7 lpm | 931.1 >>> lpm >>> System Call Overhead | 3913363.1 lps | 3883287.8 >>> lps >>> +---+-- >>> System Benchmarks Index Score | 1835.1 | 1327.6 >> >> Interesting, thanks for the numbers. >> >> So it looks like there is a decent improvement to be had from targetted vCPU >> wakeup, but I really dislike the explicit PV interface and it's already been >> shown to interact badly with the WFE-based polling in smp_cond_load_*(). >> >> Rather than expose a divergent interface, I would instead like to explore an >> improvement to smp_cond_load_*() and see how that performs before we commit >> to something more intrusive. Marc and I looked at this very briefly in the >> past, and the basic idea is to register all of the WFE sites with the >> hypervisor, indicating which register contains the address being spun on >> and which register contains the "bad" value. That way, you don't bother >> rescheduling a vCPU if the value at the address is still bad, because you >> know it will exit immediately. >> >> Of course, the devil is in the details because when I say "address", that's >> a guest virtual address, so you need to play some tricks in the hypervisor >> so that you have a separate mapping for the lockword (it's enough to keep >> track of the physical address). >> >> Our hacks are here but we basically ran out of time to work on them beyond >> an unoptimised and hacky prototype: >> >> https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/log/?h=kvm-arm64/pvcy >> >> Marc -- how would you prefer to handle this? > > Let me try and rebase this thing to a modern kernel (I doubt it applies > without > conflicts to mainline). We can then have discussion about its merit on the > list > once I post it. It'd be good to have a pointer to the benchamrks that have > been > used here. > > Thanks, > > M. Hi Marc, Will, My apologies for the slow reply. Just checking what is the latest on this PV cond yield prototype? https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/log/?h=kvm-arm64/pvcy The following are the unixbench test results of PV cond yield prototype: unix benchmark result: host: kernel 5.10.0-rc6, HiSilicon Kunpeng920, 8 CPUs guest: kernel 5.10.0-rc6, 16 VCPUs | 5.10.0-rc6 | pv_cond_yield | vcpu_is_preempted System Benchmarks Index Values|INDEX | INDEX| INDEX ---++---+--- Dhrystone 2 using register variables | 29164.0 |29156.9|29207.2 Double-Precision Whetstone| 6807.6 | 6789.2| 6912.1 Execl Throughput |856.7 | 1195.6| 863.1 File Copy 1024 bufsize 2000 maxblocks |189.9 | 923.5| 1094.2 File Copy 256 bufsize 500 maxblocks |121.9 | 578.4| 588.7 File Copy 4096 bufsize 8000 maxblocks |419.9 | 1992.0| 2733.7
Re: [PATCH v2 0/6] KVM: arm64: VCPU preempted check support
On 2020/1/15 22:14, Marc Zyngier wrote: > On 2020-01-13 12:12, Will Deacon wrote: >> [+PeterZ] >> >> On Thu, Dec 26, 2019 at 09:58:27PM +0800, Zengruan Ye wrote: >>> This patch set aims to support the vcpu_is_preempted() functionality >>> under KVM/arm64, which allowing the guest to obtain the VCPU is >>> currently running or not. This will enhance lock performance on >>> overcommitted hosts (more runnable VCPUs than physical CPUs in the >>> system) as doing busy waits for preempted VCPUs will hurt system >>> performance far worse than early yielding. >>> >>> We have observed some performace improvements in uninx benchmark tests. >>> >>> unix benchmark result: >>> host: kernel 5.5.0-rc1, HiSilicon Kunpeng920, 8 CPUs >>> guest: kernel 5.5.0-rc1, 16 VCPUs >>> >>> test-case | after-patch | before-patch >>> +---+-- >>> Dhrystone 2 using register variables | 334600751.0 lps | 335319028.3 >>> lps >>> Double-Precision Whetstone | 32856.1 MWIPS | 32849.6 >>> MWIPS >>> Execl Throughput | 3662.1 lps | 2718.0 >>> lps >>> File Copy 1024 bufsize 2000 maxblocks | 432906.4 KBps | 158011.8 >>> KBps >>> File Copy 256 bufsize 500 maxblocks | 116023.0 KBps | 37664.0 >>> KBps >>> File Copy 4096 bufsize 8000 maxblocks | 1432769.8 KBps | 441108.8 >>> KBps >>> Pipe Throughput | 6405029.6 lps | 6021457.6 >>> lps >>> Pipe-based Context Switching | 185872.7 lps | 184255.3 >>> lps >>> Process Creation | 4025.7 lps | 3706.6 >>> lps >>> Shell Scripts (1 concurrent) | 6745.6 lpm | 6436.1 >>> lpm >>> Shell Scripts (8 concurrent) | 998.7 lpm | 931.1 >>> lpm >>> System Call Overhead | 3913363.1 lps | 3883287.8 >>> lps >>> +---+-- >>> System Benchmarks Index Score | 1835.1 | 1327.6 >> >> Interesting, thanks for the numbers. >> >> So it looks like there is a decent improvement to be had from targetted vCPU >> wakeup, but I really dislike the explicit PV interface and it's already been >> shown to interact badly with the WFE-based polling in smp_cond_load_*(). >> >> Rather than expose a divergent interface, I would instead like to explore an >> improvement to smp_cond_load_*() and see how that performs before we commit >> to something more intrusive. Marc and I looked at this very briefly in the >> past, and the basic idea is to register all of the WFE sites with the >> hypervisor, indicating which register contains the address being spun on >> and which register contains the "bad" value. That way, you don't bother >> rescheduling a vCPU if the value at the address is still bad, because you >> know it will exit immediately. >> >> Of course, the devil is in the details because when I say "address", that's >> a guest virtual address, so you need to play some tricks in the hypervisor >> so that you have a separate mapping for the lockword (it's enough to keep >> track of the physical address). >> >> Our hacks are here but we basically ran out of time to work on them beyond >> an unoptimised and hacky prototype: >> >> https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/log/?h=kvm-arm64/pvcy >> >> Marc -- how would you prefer to handle this? > > Let me try and rebase this thing to a modern kernel (I doubt it applies > without > conflicts to mainline). We can then have discussion about its merit on the > list > once I post it. It'd be good to have a pointer to the benchamrks that have > been > used here. Hi Marc, Will, My apologies for the slow reply. Just checking what is the latest on this PV cond yield prototype? https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/log/?h=kvm-arm64/pvcy Recently, I re-doed the unixbench test comparison between vCPU preempted check and PV cond yield. The results are as follows: unix benchmark result: host: kernel 5.10.0-rc6, HiSilicon Kunpeng920, 8 CPUs guest: kernel 5.10.0-rc6, 16 VCPUs | 5.10.0-rc6 | pv_cond_yield | vcpu_is_preempted System Benchmarks Index Values | INDEX | INDEX | INDEX ---++---+--- Dhrystone 2 using register variables | 29164.0 | 29156.9 | 29207.2 Double-Precision Whetstone | 6807.6 | 6789.2 | 6912.1 Execl Throughput | 856.7 | 1195.6 | 863.1 File Copy 1024 bufsize 2000 maxblocks | 189.9 | 923.5 | 1094.2 File Copy 256 bufsize 500 maxblocks | 121.9 | 578.4 | 588.7 File Copy 4096 bufsize 8000 maxblocks | 419.9 |
Re: [PATCH v2 0/6] KVM: arm64: VCPU preempted check support
[+PeterZ] On Thu, Dec 26, 2019 at 09:58:27PM +0800, Zengruan Ye wrote: > This patch set aims to support the vcpu_is_preempted() functionality > under KVM/arm64, which allowing the guest to obtain the VCPU is > currently running or not. This will enhance lock performance on > overcommitted hosts (more runnable VCPUs than physical CPUs in the > system) as doing busy waits for preempted VCPUs will hurt system > performance far worse than early yielding. > > We have observed some performace improvements in uninx benchmark tests. > > unix benchmark result: > host: kernel 5.5.0-rc1, HiSilicon Kunpeng920, 8 CPUs > guest: kernel 5.5.0-rc1, 16 VCPUs > >test-case|after-patch| before-patch > +---+-- > Dhrystone 2 using register variables | 334600751.0 lps | 335319028.3 lps > Double-Precision Whetstone | 32856.1 MWIPS | 32849.6 > MWIPS > Execl Throughput | 3662.1 lps | 2718.0 lps > File Copy 1024 bufsize 2000 maxblocks |432906.4 KBps |158011.8 KBps > File Copy 256 bufsize 500 maxblocks|116023.0 KBps | 37664.0 KBps > File Copy 4096 bufsize 8000 maxblocks | 1432769.8 KBps |441108.8 KBps > Pipe Throughput| 6405029.6 lps | 6021457.6 lps > Pipe-based Context Switching |185872.7 lps |184255.3 lps > Process Creation | 4025.7 lps | 3706.6 lps > Shell Scripts (1 concurrent) | 6745.6 lpm | 6436.1 lpm > Shell Scripts (8 concurrent) | 998.7 lpm | 931.1 lpm > System Call Overhead | 3913363.1 lps | 3883287.8 lps > +---+-- > System Benchmarks Index Score | 1835.1 | 1327.6 Interesting, thanks for the numbers. So it looks like there is a decent improvement to be had from targetted vCPU wakeup, but I really dislike the explicit PV interface and it's already been shown to interact badly with the WFE-based polling in smp_cond_load_*(). Rather than expose a divergent interface, I would instead like to explore an improvement to smp_cond_load_*() and see how that performs before we commit to something more intrusive. Marc and I looked at this very briefly in the past, and the basic idea is to register all of the WFE sites with the hypervisor, indicating which register contains the address being spun on and which register contains the "bad" value. That way, you don't bother rescheduling a vCPU if the value at the address is still bad, because you know it will exit immediately. Of course, the devil is in the details because when I say "address", that's a guest virtual address, so you need to play some tricks in the hypervisor so that you have a separate mapping for the lockword (it's enough to keep track of the physical address). Our hacks are here but we basically ran out of time to work on them beyond an unoptimised and hacky prototype: https://git.kernel.org/pub/scm/linux/kernel/git/maz/arm-platforms.git/log/?h=kvm-arm64/pvcy Marc -- how would you prefer to handle this? Will ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v2 0/6] KVM: arm64: VCPU preempted check support
This patch set aims to support the vcpu_is_preempted() functionality under KVM/arm64, which allowing the guest to obtain the VCPU is currently running or not. This will enhance lock performance on overcommitted hosts (more runnable VCPUs than physical CPUs in the system) as doing busy waits for preempted VCPUs will hurt system performance far worse than early yielding. We have observed some performace improvements in uninx benchmark tests. unix benchmark result: host: kernel 5.5.0-rc1, HiSilicon Kunpeng920, 8 CPUs guest: kernel 5.5.0-rc1, 16 VCPUs test-case|after-patch| before-patch +---+-- Dhrystone 2 using register variables | 334600751.0 lps | 335319028.3 lps Double-Precision Whetstone | 32856.1 MWIPS | 32849.6 MWIPS Execl Throughput | 3662.1 lps | 2718.0 lps File Copy 1024 bufsize 2000 maxblocks |432906.4 KBps |158011.8 KBps File Copy 256 bufsize 500 maxblocks|116023.0 KBps | 37664.0 KBps File Copy 4096 bufsize 8000 maxblocks | 1432769.8 KBps |441108.8 KBps Pipe Throughput| 6405029.6 lps | 6021457.6 lps Pipe-based Context Switching |185872.7 lps |184255.3 lps Process Creation | 4025.7 lps | 3706.6 lps Shell Scripts (1 concurrent) | 6745.6 lpm | 6436.1 lpm Shell Scripts (8 concurrent) | 998.7 lpm | 931.1 lpm System Call Overhead | 3913363.1 lps | 3883287.8 lps +---+-- System Benchmarks Index Score | 1835.1 | 1327.6 Changes from v1: https://lore.kernel.org/lkml/20191217135549.3240-1-yezengr...@huawei.com/ * Guest kernel no longer allocates the PV lock structure, instead it is allocated by user space to avoid lifetime issues about kexec. * Provide VCPU attributes for PV lock. * Update SMC number of PV lock features. * Report some basic validation when PV lock init. * Document preempted field. * Bunch of typo fixes. Zengruan Ye (6): KVM: arm64: Document PV-lock interface KVM: arm64: Add SMCCC paravirtualised lock calls KVM: arm64: Support pvlock preempted via shared structure KVM: arm64: Provide VCPU attributes for PV lock KVM: arm64: Add interface to support VCPU preempted check KVM: arm64: Support the VCPU preemption check Documentation/virt/kvm/arm/pvlock.rst | 63 Documentation/virt/kvm/devices/vcpu.txt | 14 +++ arch/arm/include/asm/kvm_host.h | 18 arch/arm64/include/asm/kvm_host.h | 28 ++ arch/arm64/include/asm/paravirt.h | 15 +++ arch/arm64/include/asm/pvlock-abi.h | 16 arch/arm64/include/asm/spinlock.h | 7 ++ arch/arm64/include/uapi/asm/kvm.h | 2 + arch/arm64/kernel/Makefile | 2 +- arch/arm64/kernel/paravirt-spinlocks.c | 13 +++ arch/arm64/kernel/paravirt.c| 121 +++- arch/arm64/kernel/setup.c | 2 + arch/arm64/kvm/Makefile | 1 + arch/arm64/kvm/guest.c | 9 ++ include/linux/arm-smccc.h | 14 +++ include/linux/cpuhotplug.h | 1 + include/uapi/linux/kvm.h| 2 + virt/kvm/arm/arm.c | 8 ++ virt/kvm/arm/hypercalls.c | 8 ++ virt/kvm/arm/pvlock.c | 103 20 files changed, 445 insertions(+), 2 deletions(-) create mode 100644 Documentation/virt/kvm/arm/pvlock.rst create mode 100644 arch/arm64/include/asm/pvlock-abi.h create mode 100644 arch/arm64/kernel/paravirt-spinlocks.c create mode 100644 virt/kvm/arm/pvlock.c -- 2.19.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization