On Fri, May 02, 2025 at 05:03:45AM +0000, Manali Shukla wrote: > Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM CPUs with Bus Lock > Threshold, which is close enough to VMX's Bus Lock Detection VM-Exit to > allow reusing KVM_CAP_X86_BUS_LOCK_EXIT. > > The biggest difference between the two features is that Threshold is > fault-like, whereas Detection is trap-like. To allow the guest to make > forward progress, Threshold provides a per-VMCB counter which is > decremented every time a bus lock occurs, and a VM-Exit is triggered if > and only if the counter is '0'. > > To provide Detection-like semantics, initialize the counter to '0', i.e. > exit on every bus lock, and when re-executing the guilty instruction, set > the counter to '1' to effectively step past the instruction. > > Note, in the unlikely scenario that re-executing the instruction doesn't > trigger a bus lock, e.g. because the guest has changed memory types or > patched the guilty instruction, the bus lock counter will be left at '1', > i.e. the guest will be able to do a bus lock on a different instruction. > In a perfect world, KVM would ensure the counter is '0' if the guest has > made forward progress, e.g. if RIP has changed. But trying to close that > hole would incur non-trivial complexity, for marginal benefit; the intent > of KVM_CAP_X86_BUS_LOCK_EXIT is to allow userspace rate-limit bus locks, > not to allow for precise detection of problematic guest code. And, it's > simply not feasible to fully close the hole, e.g. if an interrupt arrives > before the original instruction can re-execute, the guest could step past > a different bus lock. > > Suggested-by: Sean Christopherson <[email protected]> > Signed-off-by: Manali Shukla <[email protected]> > --- > Documentation/virt/kvm/api.rst | 5 +++++ > arch/x86/kvm/svm/nested.c | 34 ++++++++++++++++++++++++++++++ > arch/x86/kvm/svm/svm.c | 38 ++++++++++++++++++++++++++++++++++ > arch/x86/kvm/svm/svm.h | 1 + > 4 files changed, 78 insertions(+) > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst > index ad1859f4699e..f7d2d477c3cf 100644 > --- a/Documentation/virt/kvm/api.rst > +++ b/Documentation/virt/kvm/api.rst > @@ -7989,6 +7989,11 @@ apply some other policy-based mitigation. When exiting > to userspace, KVM sets > KVM_RUN_X86_BUS_LOCK in vcpu-run->flags, and conditionally sets the > exit_reason > to KVM_EXIT_X86_BUS_LOCK. > > +Due to differences in the underlying hardware implementation, the vCPU's RIP > at > +the time of exit diverges between Intel and AMD. On Intel hosts, RIP points > at > +the next instruction, i.e. the exit is trap-like. On AMD hosts, RIP points > at > +the offending instruction, i.e. the exit is fault-like. > + > Note! Detected bus locks may be coincident with other exits to userspace, > i.e. > KVM_RUN_X86_BUS_LOCK should be checked regardless of the primary exit reason > if > userspace wants to take action on all detected bus locks. > diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c > index 834b67672d50..5369d9517fbb 100644 > --- a/arch/x86/kvm/svm/nested.c > +++ b/arch/x86/kvm/svm/nested.c > @@ -678,6 +678,33 @@ static void nested_vmcb02_prepare_control(struct > vcpu_svm *svm, > vmcb02->control.iopm_base_pa = vmcb01->control.iopm_base_pa; > vmcb02->control.msrpm_base_pa = vmcb01->control.msrpm_base_pa; > > + /* > + * Stash vmcb02's counter if the guest hasn't moved past the guilty > + * instrution; otherwise, reset the counter to '0'. > + * > + * In order to detect if L2 has made forward progress or not, track the > + * RIP at which a bus lock has occurred on a per-vmcb12 basis. If RIP > + * is changed, guest has clearly made forward progress, bus_lock_counter > + * still remained '1', so reset bus_lock_counter to '0'. Eg. In the > + * scenario, where a buslock happened in L1 before VMRUN, the bus lock > + * firmly happened on an instruction in the past. Even if vmcb01's > + * counter is still '1', (because the guilty instruction got patched), > + * the vCPU has clearly made forward progress and so KVM should reset > + * vmcb02's counter to '0'. > + * > + * If the RIP hasn't changed, stash the bus lock counter at nested VMRUN > + * to prevent the same guilty instruction from triggering a VM-Exit. Eg. > + * if userspace rate-limits the vCPU, then it's entirely possible that > + * L1's tick interrupt is pending by the time userspace re-runs the > + * vCPU. If KVM unconditionally clears the counter on VMRUN, then when > + * L1 re-enters L2, the same instruction will trigger a VM-Exit and the > + * entire cycle start over. > + */ > + if (vmcb02->save.rip && (svm->nested.ctl.bus_lock_rip == > vmcb02->save.rip))
Is vmcb02->save.rip the right thing to use here? IIUC, we want to find out if L2 made forward progress since the last bus lock #VMEXIT from L2 to L0, at which point we record bus_lock_rip. However, on nested VMRUN, vmcb02->save.rip is only populated from the vmcb12 in nested_vmcb02_prepare_save(), which doesn't happen until after nested_vmcb02_prepare_control(). So waht we're comparing against here is L2's RIP last time KVM ran it. It's even worse in the KVM_SET_NESTED_STATE path, because vmcb02->save.rip will be unintialized (probably zero). Probably you want to use vmcb12_rip here, as this is the RIP that L1 wants to run L2 with, and what will end up in vmcb02->save.rip. HOWEVER, that is also broken in the KVM_SET_NESTED_STATE path. In that path we pass in the uninitialized vmcb02->save.rip as vmcb12_rip anyway. Fixing this is non-trivial because KVM_SET_REGS could be called before or after KVM_SET_NESTED_STATE, so the RIP may not be available at all at that point. We probably want to set a flag in svm_set_nested_state() that the RIP needs fixing up, and the perhaps in svm_vcpu_pre_run() fix up the control fields in vmcb02 that depend on it: namely the bus_lock_counter, next_rip, and soft_int_* fields. It gets worse because I think next_rip is also not always properly saved/restored because we do not sync it from vmcb02 to cache in nested_sync_control_from_vmcb02() -- but that one is not relevant to bus_lock_counter AFAICT. > + vmcb02->control.bus_lock_counter = 1; > + else > + vmcb02->control.bus_lock_counter = 0; > + > /* Done at vmrun: asid. */ > > /* Also overwritten later if necessary. */
