On Tue, Feb 03, 2026 at 06:58:08AM -0800, Sean Christopherson wrote:
> On Mon, Feb 02, 2026, Yosry Ahmed wrote:
> > On Fri, May 02, 2025 at 05:03:45AM +0000, Manali Shukla wrote:
> > > diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> > > index 834b67672d50..5369d9517fbb 100644
> > > --- a/arch/x86/kvm/svm/nested.c
> > > +++ b/arch/x86/kvm/svm/nested.c
> > > @@ -678,6 +678,33 @@ static void nested_vmcb02_prepare_control(struct 
> > > vcpu_svm *svm,
> > >   vmcb02->control.iopm_base_pa = vmcb01->control.iopm_base_pa;
> > >   vmcb02->control.msrpm_base_pa = vmcb01->control.msrpm_base_pa;
> > >  
> > > + /*
> > > +  * Stash vmcb02's counter if the guest hasn't moved past the guilty
> > > +  * instrution; otherwise, reset the counter to '0'.
> > > +  *
> > > +  * In order to detect if L2 has made forward progress or not, track the
> > > +  * RIP at which a bus lock has occurred on a per-vmcb12 basis.  If RIP
> > > +  * is changed, guest has clearly made forward progress, bus_lock_counter
> > > +  * still remained '1', so reset bus_lock_counter to '0'. Eg. In the
> > > +  * scenario, where a buslock happened in L1 before VMRUN, the bus lock
> > > +  * firmly happened on an instruction in the past. Even if vmcb01's
> > > +  * counter is still '1', (because the guilty instruction got patched),
> > > +  * the vCPU has clearly made forward progress and so KVM should reset
> > > +  * vmcb02's counter to '0'.
> > > +  *
> > > +  * If the RIP hasn't changed, stash the bus lock counter at nested VMRUN
> > > +  * to prevent the same guilty instruction from triggering a VM-Exit. Eg.
> > > +  * if userspace rate-limits the vCPU, then it's entirely possible that
> > > +  * L1's tick interrupt is pending by the time userspace re-runs the
> > > +  * vCPU.  If KVM unconditionally clears the counter on VMRUN, then when
> > > +  * L1 re-enters L2, the same instruction will trigger a VM-Exit and the
> > > +  * entire cycle start over.
> > > +  */
> > > + if (vmcb02->save.rip && (svm->nested.ctl.bus_lock_rip == 
> > > vmcb02->save.rip))
> > 
> > Is vmcb02->save.rip the right thing to use here?
> 
> Objection, leading the witness your honor.

Withdrawn.

> 
> > IIUC, we want to find out if L2 made forward progress since the last bus
> > lock #VMEXIT from L2 to L0
> 
> More or less.
>   
> >, at which point we record bus_lock_rip.
> > However, on nested VMRUN, vmcb02->save.rip is only populated from the
> > vmcb12 in nested_vmcb02_prepare_save(), which doesn't happen until after
> > nested_vmcb02_prepare_control(). So waht we're comparing against here is
> > L2's RIP last time KVM ran it.
> 
> Hrm, that definitely seems like what past me intended[*], as this quite 
> clearly
> suggests my intent was to check the RIP coming from vmcb12:
> 
>  : Again, it's imperfect, e.g. if L1 happens to run a different vCPU at the 
> same
>  : RIP, then KVM will allow a bus lock for the wrong vCPU.  But the odds of 
> that
>  : happening are absurdly low unless L1 is deliberately doing weird things, 
> and so
>  : I don't think
> 
> I have this vague feeling that I deliberately didn't check @vmcb12_rip, but 
> the
> more I look at this, the more I'm convinced I simply didn't notice @vmcb12_rip
> when typing up the suggestion.

Yeah I looked through the history trying to figure out if the choice was
intentional but I couldn't find a clue.

> 
> [*] https://lore.kernel.org/all/[email protected]
> 
> > It's even worse in the KVM_SET_NESTED_STATE path, because
> > vmcb02->save.rip will be unintialized (probably zero).
> 
> FWIW, this one is arguably ideal, in the sense that KVM should probably assume
> the worst if userspace is loading guest state.
> 
> > Probably you want to use vmcb12_rip here, as this is the RIP that L1
> > wants to run L2 with, and what will end up in vmcb02->save.rip.
> >
> > HOWEVER, that is also broken in the KVM_SET_NESTED_STATE path. In that
> > path we pass in the uninitialized vmcb02->save.rip as vmcb12_rip anyway.
> 
> 
> > Fixing this is non-trivial because KVM_SET_REGS could be called before
> > or after KVM_SET_NESTED_STATE, so the RIP may not be available at all at
> > that point.
> 
> Eh, this code is completely meaningless on KVM_SET_NESTED_STATE.  There is no
> "right" answer, because KVM has no clue what constitutes forward progress.
> 
> Luckily, no answer is completely "wrong" either, because the #VMEXIT is 
> transparent
> to L1 and L2.  E.g. it's like saying that sending an IRQ to CPU running a 
> vCPU is
> "wrong"; it's never truly "wrong", just suboptimal (sometimes *very* 
> suboptimal).
> 
> > We probably want to set a flag in svm_set_nested_state() that the RIP
> > needs fixing up, and the perhaps in svm_vcpu_pre_run() fix up the
> > control fields in vmcb02 that depend on it: namely the bus_lock_counter,
> > next_rip, and soft_int_* fields.
> 
> Nah, not unless there's a meaningful, negative impact on a real world setup.
> And as above, in practice the only impact is effectively a performance blip 
> due
> to generating an unwanted userspace exit.  If userspace is stuffing nested 
> state
> so often that that's actually a performance problem, then userspace can set 
> regs
> before nested state.

But the wrong @vmcb12_rip also affects soft IRQ injection, which I know
is not relevant here, but is part of the reason I brought this up.

If the guest has NRIPS, we use @vmcb12_rip to figure out if we need to
reset next_rip after a VMRUN. If the guest does not have NRIPS, we put
it directly in next_rip.

Either way we could end up with a messed up next_rip, and that can mess
up the guest IIUC. Of course, all of that only happens if we migrate
with a pending soft IRQ which is probably unlikely, but could happen.

> 
> There are a pile of edge cases and "what ifs" around this, none of which have 
> a
> perfect answer since KVM doesn't know userspace's intent.  And so I want to go
> with a solution that is as simple as possible without risking putting L2 into 
> an
> infinite loop.
> 
> > It gets worse because I think next_rip is also not always properly
> > saved/restored because we do not sync it from vmcb02 to cache in
> > nested_sync_control_from_vmcb02() -- but that one is not relevant to
> > bus_lock_counter AFAICT.
> 
> Correct.
> 
> All in all, I think just this?

For the bus lock thing that's probably the right thing to do. It fixes
the usual nested VMRUN path and doesn't really make the
KVM_SET_NESTED_STATE path any worse (feel free to add a Reviewed-by from
me).

But I am wondering if we should also fix @vmcb12_rip in the
KVM_SET_NESTED_STATE path (separately), for the soft IRQ case as well.

> 
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index de90b104a0dd..482092b2051c 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -809,7 +809,7 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm 
> *svm,
>          * L1 re-enters L2, the same instruction will trigger a VM-Exit and 
> the
>          * entire cycle start over.
>          */
> -       if (vmcb02->save.rip && (svm->nested.ctl.bus_lock_rip == 
> vmcb02->save.rip))
> +       if (vmcb12_rip && (svm->nested.ctl.bus_lock_rip == vmcb12_rip))
>                 vmcb02->control.bus_lock_counter = 1;
>         else
>                 vmcb02->control.bus_lock_counter = 0;

Reply via email to