On Fri, May 02, 2025 at 05:03:45AM +0000, Manali Shukla wrote:
> Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM CPUs with Bus Lock
> Threshold, which is close enough to VMX's Bus Lock Detection VM-Exit to
> allow reusing KVM_CAP_X86_BUS_LOCK_EXIT.
> 
> The biggest difference between the two features is that Threshold is
> fault-like, whereas Detection is trap-like.  To allow the guest to make
> forward progress, Threshold provides a per-VMCB counter which is
> decremented every time a bus lock occurs, and a VM-Exit is triggered if
> and only if the counter is '0'.
> 
> To provide Detection-like semantics, initialize the counter to '0', i.e.
> exit on every bus lock, and when re-executing the guilty instruction, set
> the counter to '1' to effectively step past the instruction.
> 
> Note, in the unlikely scenario that re-executing the instruction doesn't
> trigger a bus lock, e.g. because the guest has changed memory types or
> patched the guilty instruction, the bus lock counter will be left at '1',
> i.e. the guest will be able to do a bus lock on a different instruction.
> In a perfect world, KVM would ensure the counter is '0' if the guest has
> made forward progress, e.g. if RIP has changed.  But trying to close that
> hole would incur non-trivial complexity, for marginal benefit; the intent
> of KVM_CAP_X86_BUS_LOCK_EXIT is to allow userspace rate-limit bus locks,
> not to allow for precise detection of problematic guest code.  And, it's
> simply not feasible to fully close the hole, e.g. if an interrupt arrives
> before the original instruction can re-execute, the guest could step past
> a different bus lock.
> 
> Suggested-by: Sean Christopherson <[email protected]>
> Signed-off-by: Manali Shukla <[email protected]>
> ---
>  Documentation/virt/kvm/api.rst |  5 +++++
>  arch/x86/kvm/svm/nested.c      | 34 ++++++++++++++++++++++++++++++
>  arch/x86/kvm/svm/svm.c         | 38 ++++++++++++++++++++++++++++++++++
>  arch/x86/kvm/svm/svm.h         |  1 +
>  4 files changed, 78 insertions(+)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index ad1859f4699e..f7d2d477c3cf 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -7989,6 +7989,11 @@ apply some other policy-based mitigation. When exiting 
> to userspace, KVM sets
>  KVM_RUN_X86_BUS_LOCK in vcpu-run->flags, and conditionally sets the 
> exit_reason
>  to KVM_EXIT_X86_BUS_LOCK.
>  
> +Due to differences in the underlying hardware implementation, the vCPU's RIP 
> at
> +the time of exit diverges between Intel and AMD.  On Intel hosts, RIP points 
> at
> +the next instruction, i.e. the exit is trap-like.  On AMD hosts, RIP points 
> at
> +the offending instruction, i.e. the exit is fault-like.
> +
>  Note! Detected bus locks may be coincident with other exits to userspace, 
> i.e.
>  KVM_RUN_X86_BUS_LOCK should be checked regardless of the primary exit reason 
> if
>  userspace wants to take action on all detected bus locks.
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index 834b67672d50..5369d9517fbb 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -678,6 +678,33 @@ static void nested_vmcb02_prepare_control(struct 
> vcpu_svm *svm,
>       vmcb02->control.iopm_base_pa = vmcb01->control.iopm_base_pa;
>       vmcb02->control.msrpm_base_pa = vmcb01->control.msrpm_base_pa;
>  
> +     /*
> +      * Stash vmcb02's counter if the guest hasn't moved past the guilty
> +      * instrution; otherwise, reset the counter to '0'.
> +      *
> +      * In order to detect if L2 has made forward progress or not, track the
> +      * RIP at which a bus lock has occurred on a per-vmcb12 basis.  If RIP
> +      * is changed, guest has clearly made forward progress, bus_lock_counter
> +      * still remained '1', so reset bus_lock_counter to '0'. Eg. In the
> +      * scenario, where a buslock happened in L1 before VMRUN, the bus lock
> +      * firmly happened on an instruction in the past. Even if vmcb01's
> +      * counter is still '1', (because the guilty instruction got patched),
> +      * the vCPU has clearly made forward progress and so KVM should reset
> +      * vmcb02's counter to '0'.
> +      *
> +      * If the RIP hasn't changed, stash the bus lock counter at nested VMRUN
> +      * to prevent the same guilty instruction from triggering a VM-Exit. Eg.
> +      * if userspace rate-limits the vCPU, then it's entirely possible that
> +      * L1's tick interrupt is pending by the time userspace re-runs the
> +      * vCPU.  If KVM unconditionally clears the counter on VMRUN, then when
> +      * L1 re-enters L2, the same instruction will trigger a VM-Exit and the
> +      * entire cycle start over.
> +      */
> +     if (vmcb02->save.rip && (svm->nested.ctl.bus_lock_rip == 
> vmcb02->save.rip))

Is vmcb02->save.rip the right thing to use here?

IIUC, we want to find out if L2 made forward progress since the last bus
lock #VMEXIT from L2 to L0, at which point we record bus_lock_rip.
However, on nested VMRUN, vmcb02->save.rip is only populated from the
vmcb12 in nested_vmcb02_prepare_save(), which doesn't happen until after
nested_vmcb02_prepare_control(). So waht we're comparing against here is
L2's RIP last time KVM ran it.

It's even worse in the KVM_SET_NESTED_STATE path, because
vmcb02->save.rip will be unintialized (probably zero).

Probably you want to use vmcb12_rip here, as this is the RIP that L1
wants to run L2 with, and what will end up in vmcb02->save.rip.

HOWEVER, that is also broken in the KVM_SET_NESTED_STATE path. In that
path we pass in the uninitialized vmcb02->save.rip as vmcb12_rip anyway.
Fixing this is non-trivial because KVM_SET_REGS could be called before
or after KVM_SET_NESTED_STATE, so the RIP may not be available at all at
that point.

We probably want to set a flag in svm_set_nested_state() that the RIP
needs fixing up, and the perhaps in svm_vcpu_pre_run() fix up the
control fields in vmcb02 that depend on it: namely the bus_lock_counter,
next_rip, and soft_int_* fields.

It gets worse because I think next_rip is also not always properly
saved/restored because we do not sync it from vmcb02 to cache in
nested_sync_control_from_vmcb02() -- but that one is not relevant to
bus_lock_counter AFAICT.

> +             vmcb02->control.bus_lock_counter = 1;
> +     else
> +             vmcb02->control.bus_lock_counter = 0;
> +
>       /* Done at vmrun: asid.  */
>  
>       /* Also overwritten later if necessary.  */

Reply via email to