On Fri, 2026-05-08 at 15:40 -0700, Sean Christopherson wrote:
> On Mon, May 04, 2026, Dongli Zhang wrote:
> > KVM does not support vCPU hotplug. When a vCPU is removed, its
> > corresponding data structures are not freed by KVM. Instead, QEMU destroys
> > only the userspace state and the vCPU thread, while the KVM vCPU fd remains
> > open and parked in QEMU.
> > 
> > As a result, vcpu->arch.st.last_steal is not reset.
> > 
> > If the same vCPU is later re-created by QEMU, last_steal retains its old
> > value, while current->sched_info.run_delay starts from zero since a new
> > vCPU thread is created. This causes
> > current->sched_info.run_delay - vcpu->arch.st.last_steal to produce a
> > large, bogus value.
> > 
> > Fix this by resetting vcpu->arch.st.last_steal to
> > current->sched_info.run_delay when KVM steal time is enabled.
> 
> This is quite arbitrary.  E.g. if userspace hands the vCPU off to a different
> task without going through QEMU's hotplug dance, then 
> current->sched_info.run_delay
> will also change.
> 
> Shouldn't x86 hook kvm_arch_vcpu_run_pid_change() and reset last_steal in 
> there?

I'd like to be sure that we get this right for live update and live migration.

I think we *do* get it right for the Xen runstate info. When the guest
is paused on the source (or before the kexec), the VMM takes the
runstate info which contains the current kvmclock *and* the time spent
in each state.

On restore on the destination (or after kexec), userspace provides that
back to the kernel, along with the KVM clock information.

As the KVM clock has naturally progressed in the intervening time, the
delta between the kvmclock from the runstate, and the kvmclock when the
vCPU actually gets to run again, is reported (correctly) as steal time.


Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to