Hi Sean/Parolo/Vitaly, Wanted to get your opinion on this change. Let me first introduce the scenario:
We have been testing kata containers (containers wrapped in VM) in Azure, and found some significant issues with TLB flushing. This is a popular workload and requires launching many nested VMs quickly. When testing on a 64 core Intel VM (D64s_v5 in case someone is wondering), spinning up some 150-ish nested VMs in parallel, performance starts getting worse the more nested VMs are already running, CPU usage spikes to 100% on all cores and doesn't settle even when all nested VMs boot up. On an idle system a single nested VMs boots within seconds, but once we have a couple dozen running or so (doing nothing inside), boot time gets longer and longer for each new nested VM, they start hitting startup timeout etc. In some cases we never reach the point where all nested VMs are up and running. Investigating the issue we found that this can't be reproduced on AMD and on Intel when EPT is disabled. In both these cases the scenario completes within 20s or so. TPD_MMU or not doesn't make a difference. With EPT=Y the case takes minutes.Out of curiousity I also ran the test case on an n4-standard-64 VM on GCP and found that EPT=Y runs in ~30s, while EPT=N runs in ~20s (which I found slightly interesting). So that's when we starting looking at the TLB flushing code and found that INVEPT.global is used on every CPU migration and that it's an expensive function on Hyper-V. It also has an impact on every running nested VM, so we end up with lots of INVEPT.global calls - we reach 2000 calls/s before we're essentially stuck in 100% guest ttime. That's why I'm looking at tweaking the TLB flushing behavior to avoid it. I came across past discussions on this topic ([^1]) and after some thinking see two options: 1. Do you see a way to optimize this generically to avoid KVM_REQ_TLB_FLUSH on migration in current KVM? In nested (as in: KVM running nested) I think we rarely see CPU pinning used the way we it is on baremetal so it's not a rare of an operation. Much has also changed since [^1] and with kvm_mmu_reset_context() still being called in many paths we might be over flushing. Perhaps a loop flushing individual roots with roles that do not have a post_set_xxx hook that does flushing? 2. We can approach this in a Hyper-V specific way, using the dedicated flush hypercall, which is what the following RFC patch does. This hypercall acts as a broadcast INVEPT.single. I believe that using the flush hypercall in flush_tlb_current() is sufficient to ensure the right semantics and correctness. The one thing I haven't made up my mind about yet is whether we could still use a flush of the current root on migration or not - I can imagine at most an INVEPT.single, I also haven't yet figured out how that could be plumbed in if it's really necessary (can't put it in KVM_REQ_TLB_FLUSH because that would break the assumption that it is stronger than KVM_REQ_TLB_FLUSH_CURRENT). With 2. the performance is comparable to EPT=N on Intel, roughly 20s for the test scenario. Let me know what you think about this and if you have any suggestions. Best wishes, Jeremi [^1]: https://lore.kernel.org/kvm/yqljnbbp%2feous...@google.com/ Jeremi Piotrowski (1): KVM: VMX: Use Hyper-V EPT flush for local TLB flushes arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/vmx/vmx.c | 20 +++++++++++++++++--- arch/x86/kvm/vmx/vmx_onhyperv.h | 6 ++++++ arch/x86/kvm/x86.c | 3 +++ 4 files changed, 27 insertions(+), 3 deletions(-) -- 2.39.5