On Mon, Feb 02, 2026, Yosry Ahmed wrote: > On Fri, May 02, 2025 at 05:03:45AM +0000, Manali Shukla wrote: > > diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c > > index 834b67672d50..5369d9517fbb 100644 > > --- a/arch/x86/kvm/svm/nested.c > > +++ b/arch/x86/kvm/svm/nested.c > > @@ -678,6 +678,33 @@ static void nested_vmcb02_prepare_control(struct > > vcpu_svm *svm, > > vmcb02->control.iopm_base_pa = vmcb01->control.iopm_base_pa; > > vmcb02->control.msrpm_base_pa = vmcb01->control.msrpm_base_pa; > > > > + /* > > + * Stash vmcb02's counter if the guest hasn't moved past the guilty > > + * instrution; otherwise, reset the counter to '0'. > > + * > > + * In order to detect if L2 has made forward progress or not, track the > > + * RIP at which a bus lock has occurred on a per-vmcb12 basis. If RIP > > + * is changed, guest has clearly made forward progress, bus_lock_counter > > + * still remained '1', so reset bus_lock_counter to '0'. Eg. In the > > + * scenario, where a buslock happened in L1 before VMRUN, the bus lock > > + * firmly happened on an instruction in the past. Even if vmcb01's > > + * counter is still '1', (because the guilty instruction got patched), > > + * the vCPU has clearly made forward progress and so KVM should reset > > + * vmcb02's counter to '0'. > > + * > > + * If the RIP hasn't changed, stash the bus lock counter at nested VMRUN > > + * to prevent the same guilty instruction from triggering a VM-Exit. Eg. > > + * if userspace rate-limits the vCPU, then it's entirely possible that > > + * L1's tick interrupt is pending by the time userspace re-runs the > > + * vCPU. If KVM unconditionally clears the counter on VMRUN, then when > > + * L1 re-enters L2, the same instruction will trigger a VM-Exit and the > > + * entire cycle start over. > > + */ > > + if (vmcb02->save.rip && (svm->nested.ctl.bus_lock_rip == > > vmcb02->save.rip)) > > Is vmcb02->save.rip the right thing to use here?
Objection, leading the witness your honor. > IIUC, we want to find out if L2 made forward progress since the last bus > lock #VMEXIT from L2 to L0 More or less. >, at which point we record bus_lock_rip. > However, on nested VMRUN, vmcb02->save.rip is only populated from the > vmcb12 in nested_vmcb02_prepare_save(), which doesn't happen until after > nested_vmcb02_prepare_control(). So waht we're comparing against here is > L2's RIP last time KVM ran it. Hrm, that definitely seems like what past me intended[*], as this quite clearly suggests my intent was to check the RIP coming from vmcb12: : Again, it's imperfect, e.g. if L1 happens to run a different vCPU at the same : RIP, then KVM will allow a bus lock for the wrong vCPU. But the odds of that : happening are absurdly low unless L1 is deliberately doing weird things, and so : I don't think I have this vague feeling that I deliberately didn't check @vmcb12_rip, but the more I look at this, the more I'm convinced I simply didn't notice @vmcb12_rip when typing up the suggestion. [*] https://lore.kernel.org/all/[email protected] > It's even worse in the KVM_SET_NESTED_STATE path, because > vmcb02->save.rip will be unintialized (probably zero). FWIW, this one is arguably ideal, in the sense that KVM should probably assume the worst if userspace is loading guest state. > Probably you want to use vmcb12_rip here, as this is the RIP that L1 > wants to run L2 with, and what will end up in vmcb02->save.rip. > > HOWEVER, that is also broken in the KVM_SET_NESTED_STATE path. In that > path we pass in the uninitialized vmcb02->save.rip as vmcb12_rip anyway. > Fixing this is non-trivial because KVM_SET_REGS could be called before > or after KVM_SET_NESTED_STATE, so the RIP may not be available at all at > that point. Eh, this code is completely meaningless on KVM_SET_NESTED_STATE. There is no "right" answer, because KVM has no clue what constitutes forward progress. Luckily, no answer is completely "wrong" either, because the #VMEXIT is transparent to L1 and L2. E.g. it's like saying that sending an IRQ to CPU running a vCPU is "wrong"; it's never truly "wrong", just suboptimal (sometimes *very* suboptimal). > We probably want to set a flag in svm_set_nested_state() that the RIP > needs fixing up, and the perhaps in svm_vcpu_pre_run() fix up the > control fields in vmcb02 that depend on it: namely the bus_lock_counter, > next_rip, and soft_int_* fields. Nah, not unless there's a meaningful, negative impact on a real world setup. And as above, in practice the only impact is effectively a performance blip due to generating an unwanted userspace exit. If userspace is stuffing nested state so often that that's actually a performance problem, then userspace can set regs before nested state. There are a pile of edge cases and "what ifs" around this, none of which have a perfect answer since KVM doesn't know userspace's intent. And so I want to go with a solution that is as simple as possible without risking putting L2 into an infinite loop. > It gets worse because I think next_rip is also not always properly > saved/restored because we do not sync it from vmcb02 to cache in > nested_sync_control_from_vmcb02() -- but that one is not relevant to > bus_lock_counter AFAICT. Correct. All in all, I think just this? diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c index de90b104a0dd..482092b2051c 100644 --- a/arch/x86/kvm/svm/nested.c +++ b/arch/x86/kvm/svm/nested.c @@ -809,7 +809,7 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm, * L1 re-enters L2, the same instruction will trigger a VM-Exit and the * entire cycle start over. */ - if (vmcb02->save.rip && (svm->nested.ctl.bus_lock_rip == vmcb02->save.rip)) + if (vmcb12_rip && (svm->nested.ctl.bus_lock_rip == vmcb12_rip)) vmcb02->control.bus_lock_counter = 1; else vmcb02->control.bus_lock_counter = 0;
