Hi Sean,

Thanks for reviewing my patches.

On 10/15/2024 11:19 PM, Sean Christopherson wrote:
> On Fri, Oct 04, 2024, Manali Shukla wrote:
>> When a VMRUN instruction is executed, the bus lock threshold count is
>> loaded into an internal count register. Before the processor executes
>> a bus lock in the guest, it checks the value of this register:
>>
>>  - If the value is greater than '0', the processor successfully
>>    executes the bus lock and decrements the count.
>>
>>  - If the value is '0', the bus lock is not executed, and a #VMEXIT to
>>    the VMM is taken.
>>
>> The bus lock threshold #VMEXIT is reported to the VMM with the VMEXIT
>> code A5h, SVM_EXIT_BUS_LOCK.
> 
> I vote to split this into two patches: one to add the architectural 
> collateral,
> with the above as the changelog, and a second to actually implement support in
> KVM.  Having the above background is useful, but it makes it quite hard to 
> find
> information on the KVM design and implementation.
>  

Sure. I will split it into 2 patches.

>> This implementation is set up to intercept every guest bus lock.
> 
> "This implementation" is a variation on "This patch".  Drop it, and simply 
> state
> what the patch is doing.
> 
>> initial value of the Bus Lock Threshold counter is '0' in the VMCB, so
>> the very first bus lock will exit, set the Bus Lock Threshold counter
>> to '1' (so that the offending instruction can make forward progress)
>> but then the value is at '0' again so the next bus lock will exit.
>>
>> The generated SVM_EXIT_BUS_LOCK in kvm will exit to user space by
> 
> s/kvm/KVM
> 
> And again, the tone is wrong.
> 
> Something is what I would aim for:
> 
>   Add support for KVM_CAP_X86_BUS_LOCK_EXIT on SVM CPUs with Bus Lock
>   Threshold, which is close enough to VMX's Bus Lock Detection VM-Exit to
>   allow reusing KVM_CAP_X86_BUS_LOCK_EXIT.
> 
>   The biggest difference between the two features is that Threshold is
>   fault-like, whereas Detection is trap-like.  To allow the guest to make
>   forward progress, Threshold provides a per-VMCB counter which is
>   decremented every time a bus lock occurs, and a VM-Exit is triggered if
>   and only if the counter is '0'.
> 
>   To provide Detection-like semantics, initialize the counter to '0', i.e.
>   exit on every bus lock, and when re-executing the guilty instruction, set
>   the counter to '1' to effectively step past the instruction.
> 
>   Note, in the unlikely scenario that re-executing the instruction doesn't
>   trigger a bus lock, e.g. because the guest has changed memory types or
>   patched the guilty instruction, the bus lock counter will be left at '1',
>   i.e. the guest will be able to do a bus lock on a different instruction.
>   In a perfect world, KVM would ensure the counter is '0' if the guest has
>   made forward progress, e.g. if RIP has changed.  But trying to close that
>   hole would incur non-trivial complexity, for marginal benefit; the intent
>   of KVM_CAP_X86_BUS_LOCK_EXIT is to allow userspace rate-limit bus locks,
>   not to allow for precise detection of problematic guest code.  And, it's
>   simply not feasible to fully close the hole, e.g. if an interrupt arrives
>   before the original instruction can re-execute, the guest could step past
>   a different bus lock.
> 
>> setting the KVM_RUN_BUS_LOCK flag in vcpu->run->flags, indicating to
>> the user space that a bus lock has been detected in the guest.
>>
>> Use the already available KVM capability KVM_CAP_X86_BUS_LOCK_EXIT to
>> enable the feature. This feature is disabled by default, it can be
>> enabled explicitly from user space.
>>
>> More details about the Bus Lock Threshold feature can be found in AMD
>> APM [1].
> 
> ...
> 
>> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
>> index d5314cb7dff4..ca1c42201894 100644
>> --- a/arch/x86/kvm/svm/nested.c
>> +++ b/arch/x86/kvm/svm/nested.c
>> @@ -669,6 +669,11 @@ static void nested_vmcb02_prepare_control(struct 
>> vcpu_svm *svm,
>>      vmcb02->control.iopm_base_pa = vmcb01->control.iopm_base_pa;
>>      vmcb02->control.msrpm_base_pa = vmcb01->control.msrpm_base_pa;
>>  
>> +    /*
>> +     * The bus lock threshold count should keep running across nested
>> +     * transitions. Copy the buslock threshold count from vmcb01 to vmcb02.
> 
> No, it should be reset to '0'.  The bus lock can't have been on VMRUN, 
> because KVM
> is emulating the VMRUN.  That leaves two possibilities: the bus lock happened 
> in
> L1 on an instruction before VMRUN, or the bus lock happened in _an_ L2, 
> before a
> nested VM-Exit to L1 occurred.
> 
> In the first case, the bus lock firmly happened on an instruction in the past.
> Even if vmcb01's counter is still '1', e.g. because the guilty instruction got
> patched, the vCPU has clearly made forward progress and so KVM should reset 
> vmcb02's
> counter to '0'.
> 
> In the second case, KVM has no idea if L2 has made forward progress.  The only
> way to _try to_ detect if L2 has made forward progress would to be to track 
> the
> counter on a per-vmcb12 basis, but even that is flawed because KVM doesn't 
> have
> visibility into L1's management of L2.
>  
> I do think we may need to stash vmcb02's counter though.  E.g. if userspace 
> rate-
> limits the vCPU, then it's entirely possible that L1's tick interrupt is 
> pending
> by the time userspace re-runs the vCPU.  If KVM unconditionally clears the 
> counter
> on VMRUN, then when L1 re-enters L2, the same instruction will trigger a 
> VM-Exit
> and the entire cycle starts over.
> 
> Anything we do is going to be imperfect, but the best idea I can come up with 
> is
> also relatively simple, especially in conjunction with my suggestion below.  
> If
> KVM tracks the RIP that triggered the bus lock, then on nested VM-Exit KVM can
> propagate that RIP into svm_nested_state as appropriate.  E.g.
> 
>       if (vmcb02->control.bus_lock_counter &&
>           svm->bus_lock_rip == vmcb02->save.rip)
>               svm->nested.bus_lock_rip = svm->bus_lock_rip;
>       else
>               svm->nested.bus_lock_rip = INVALID_GVA; /* or '0', as much as 
> that bugs me */
> 
> and then on nested VMRUN
> 
>       if (svm->nested.bus_lock_rip == vmcb02->save.rip) {
>               vmcb02->control.bus_lock_counter = 1;
>               svm->bus_lock_rip = svm->nested.bus_lock_rip;
>       } else {
>               vmcb02->control.bus_lock_counter = 0;
>       }
> 
>       svm->nested.bus_lock_rip = INVALID_GVA;
> 
> Again, it's imperfect, e.g. if L1 happens to run a different vCPU at the same
> RIP, then KVM will allow a bus lock for the wrong vCPU.  But the odds of that
> happening are absurdly low unless L1 is deliberately doing weird things, and 
> so
> I don't think
> 
>> +     */
>> +    vmcb02->control.bus_lock_counter = vmcb01->control.bus_lock_counter;
>>      /* Done at vmrun: asid.  */
>>  
>>      /* Also overwritten later if necessary.  */
>> @@ -1035,6 +1040,11 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
>>  
>>      }
>>  
>> +    /*
>> +     * The bus lock threshold count should keep running across nested
>> +     * transitions. Copy the buslock threshold count from vmcb02 to vmcb01.
>> +     */
>> +    vmcb01->control.bus_lock_counter = vmcb02->control.bus_lock_counter;
> 
> KVM should definitely reset the counter to '0' on a nested VM-Exit, because L1
> can't be interrupted by L2, i.e. there is zero chance that KVM needs to allow 
> a
> bus lock in L1 to ensure L1 makes forward progress.
> 
>>      nested_svm_copy_common_state(svm->nested.vmcb02.ptr, svm->vmcb01.ptr);
>>  
>>      svm_switch_vmcb(svm, &svm->vmcb01);
>> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
>> index 9df3e1e5ae81..0d0407f1ee6a 100644
>> --- a/arch/x86/kvm/svm/svm.c
>> +++ b/arch/x86/kvm/svm/svm.c
>> @@ -1372,6 +1372,9 @@ static void init_vmcb(struct kvm_vcpu *vcpu)
>>              svm->vmcb->control.int_ctl |= V_GIF_ENABLE_MASK;
>>      }
>>  
>> +    if (vcpu->kvm->arch.bus_lock_detection_enabled)
>> +            svm_set_intercept(svm, INTERCEPT_BUSLOCK);
>> +
>>      if (sev_guest(vcpu->kvm))
>>              sev_init_vmcb(svm);
>>  
>> @@ -3286,6 +3289,24 @@ static int invpcid_interception(struct kvm_vcpu *vcpu)
>>      return kvm_handle_invpcid(vcpu, type, gva);
>>  }
>>  
>> +static int bus_lock_exit(struct kvm_vcpu *vcpu)
>> +{
>> +    struct vcpu_svm *svm = to_svm(vcpu);
>> +
>> +    vcpu->run->exit_reason = KVM_EXIT_X86_BUS_LOCK;
>> +    vcpu->run->flags |= KVM_RUN_X86_BUS_LOCK;
>> +
>> +    /*
>> +     * Reload the counter with value greater than '0'.
> 
> The value quite obviously must be exactly '1', not simply greater than '0.  I 
> also
> think this is the wrong place to set the counter.  Rather than set the 
> counter at
> the time of exit, KVM should implement a vcpu->arch.complete_userspace_io 
> callback
> and set the counter to '1' if and only if RIP (or LIP, but I have no 
> objection to
> keeping things simple) is unchanged.  It's a bit of extra complexity, but it 
> will
> make it super obvious why KVM is setting the counter to '1'.  And, if 
> userspace
> wants to stuff state and move past the instruction, e.g. by emulating the 
> guilty
> instruction, then KVM won't unnecessarily allow a bus lock in the guest.
> 
> And then the comment can be:
> 
>       /*
>        * If userspace has NOT change RIP, then KVM's ABI is to let the guest
>        * execute the bus-locking instruction.  Set the bus lock counter to '1'
>        * to effectively step past the bus lock.
>        */
> 

Thank you for highlighting these scenarios (for nested guest and normal 
guests). 
I had not thought about them. I’m currently going through the comments
and trying to fully understand them. I’ll try out the suggested changes and
get back to you.

- Manali


>> +     * The bus lock exit on SVM happens with RIP pointing to the guilty
>> +     * instruction. So, reloading the value of bus_lock_counter to '0'
>> +     * results into generating continuous bus lock exits.
>> +     */
>> +    svm->vmcb->control.bus_lock_counter = 1;
>> +
>> +    return 0;
>> +}
>> +
>>  static int (*const svm_exit_handlers[])(struct kvm_vcpu *vcpu) = {
>>      [SVM_EXIT_READ_CR0]                     = cr_interception,
>>      [SVM_EXIT_READ_CR3]                     = cr_interception,
>> @@ -3353,6 +3374,7 @@ static int (*const svm_exit_handlers[])(struct 
>> kvm_vcpu *vcpu) = {
>>      [SVM_EXIT_CR4_WRITE_TRAP]               = cr_trap,
>>      [SVM_EXIT_CR8_WRITE_TRAP]               = cr_trap,
>>      [SVM_EXIT_INVPCID]                      = invpcid_interception,
>> +    [SVM_EXIT_BUS_LOCK]                     = bus_lock_exit,
>>      [SVM_EXIT_NPF]                          = npf_interception,
>>      [SVM_EXIT_RSM]                          = rsm_interception,
>>      [SVM_EXIT_AVIC_INCOMPLETE_IPI]          = 
>> avic_incomplete_ipi_interception,
>> @@ -5227,6 +5249,11 @@ static __init void svm_set_cpu_caps(void)
>>              kvm_cpu_cap_set(X86_FEATURE_SVME_ADDR_CHK);
>>      }
>>  
>> +    if (cpu_feature_enabled(X86_FEATURE_BUS_LOCK_THRESHOLD)) {
>> +            pr_info("Bus Lock Threashold supported\n");
>> +            kvm_caps.has_bus_lock_exit = true;
>> +    }
>> +
>>      /* CPUID 0x80000008 */
>>      if (boot_cpu_has(X86_FEATURE_LS_CFG_SSBD) ||
>>          boot_cpu_has(X86_FEATURE_AMD_SSBD))
>> -- 
>> 2.34.1
>>


Reply via email to