On Tue, 2026-05-19 at 14:23 -0700, Dongli Zhang wrote:
> I think I now understand why I feel like I am always asking weird questions. I
> have been thinking about how to account for downtime, so I see
> KVM_SET_CLOCK_GUEST as a supplement to KVM_SET_CLOCK.

I do not believe in "downtime". There is no such thing.
There is only "steal time".

A CPU may be off in the weeds — a vCPU suffering steal time, or even a
pCPU in SMM which is effectively the same thing — but time doesn't
stop, and neither does the TSC.

> Suppose we are not going to account for any downtime. With 
> KVM_SET_CLOCK_GUEST:
> 
> 1. The masterclock is active, so gTSC is synchronized across vCPUs. All vCPUs
> share the same kvm_read_l1_tsc(v, ka->master_cycle_now).

Strictly, by the time we get to the end of my series, masterclock is
active *because* all the vCPUs are running at the same TSC rate (even
if the guest set them to different offsets). But OK.

> 2. Migrate the gTSC to the target VM however people want (either ablolute 
> value
> or offset value). (Optional) Account for downtime in gTSC however people want,
> even with KVM_SET_CLOCK/KVM_CLOCK_REALTIME, which you may not like.
>
> 3. Adjust kvm-clock (that is, ka->kvmclock_offset) with KVM_SET_CLOCK_GUEST.
> 
> That is why you think KVM_SET_CLOCK is no longer required if we have
> KVM_SET_CLOCK_GUEST. While I think KVM_SET_CLOCK is required because of
> KVM_CLOCK_REALTIME.

If I recall correctly what we described in
https://lore.kernel.org/all/[email protected]/
I don't think we actually needed KVM_SET_CLOCK at all, did we?

We *abuse* KVM_GET_CLOCK to give us a tuple of {realtime, host TSC}
because there's actually no other way for *userspace* to get that. We
don't actually *care* about the KVM clock part.

We use the {realtime, host TSC} pair to reconstitute the guest TSC
values to correctly reflect the passing of time while the guest was in
the ether.

> It it isn't required to account any downtime for gTSC or if there is another 
> way
> to do so, only KVM_SET_CLOCK_GUEST is enough.

Right. If you only want the guest to come back with the *same* values
in its TSC as before the migration, as if the TSC was *paused* during
the migration, then you can just restore those values and use
KVM_SET_CLOCK_GUEST. Assuming you are on modern hardware and have set
all vCPUs to the same rate (and are using this series so the *guest*
can't break masterclock for you, and you can trust the
KVM_SET_CLOCK_GUEST will work).

> > 
> > > Another scenario is when only MASTERCLOCK_UPDATE is pending and there is 
> > > no
> > > pending CLOCK_UPDATE.
> > > 
> > > In this scenario, is it fine to skip processing MASTERCLOCK_UPDATE before 
> > > saving
> > > pvclock_vcpu_time_info?
> > > 
> > 
> > I'm not sure I understand that scenario. 
> > 
> > MASTERCLOCK_UPDATE means we have to actually recalculate the master
> > clock (which really *should* be rare, now!). And then any time we do
> > that, we also have to do a CLOCK_UPDATE on every vCPU to disseminate
> > the new information. Which is why kvm_end_pvclock_update() does exactly
> > that.
> > 
> > So your "MASTERCLOCK_UPDATE is pending and there is no pending
> > CLOCK_UPDATE" doesn't make much sense to me. If MASTERCLOCK_UPDATE is
> > pending, then there *will* be a CLOCK_UPDATE pending.
> 
> Suppose the VM is stopped and the master clock is active.

I don't know what it means for a VM to be 'stopped'. Do you mean that
all vCPUs happen to be experiencing steal time at the present moment?

> Suddenly, we change the host clocksource from TSC to HPET. 
> pvclock_gtod_notify()
> may call pvclock_gtod_update_fn() to set a pending KVM_REQ_MASTERCLOCK_UPDATE
> for all vCPUs. Unless the pending KVM_REQ_MASTERCLOCK_UPDATE is processed by
> kvm_update_masterclock(), kvm_end_pvclock_update() will not set a pending
> KVM_REQ_CLOCK_UPDATE.

You say 'Unless'... do you mean 'Until'?

> Therefore, this is a scenario in which only KVM_REQ_MASTERCLOCK_UPDATE is 
> pending.
> 
> I do not think this scenario is important. I am just curious about the 
> expected
> way to implement similar code in the future :)

I think that's working correctly. Until the master clock has *actually*
been updated, there's no point in setting CLOCK_UPDATE for each vCPU to
disseminate the new information to its own pvclock?



> > 
> > 
> > > > > 
> > > > > Would it be helpful to validate that the delta is within a reasonable 
> > > > > range,
> > > > > e.g. that the drift can never be more than five minutes (forward or 
> > > > > backward)?
> > > > 
> > > > If a guest has been running for months on a previous host and is
> > > > migrated to a new host, don't we expect that the KVM clock of the new
> > > > VM on the new host is tweaked from its default near-zero after
> > > > creation, to some large amount?
> > > > 
> > > 
> > > Regarding live migration, my own investigation does not show a 
> > > proportional
> > > relationship between VM uptime and the amount of drift.
> > 
> > You're comparing the VM on the source host, with the VM on the
> > destination post-migration.
> 
> Apologies for making it confusing. I was just trying to explain why I think 
> the
> kvm-clock drift will not be large.

Sure, but I don't care. If we have a sane API, the drift should be zero
:)

> We previously discussed the vCPU hotplug and kvm-clock drift issue. The longer
> the time interval between two vCPU hotplug events, the larger the drift.
> 
> For live migration (with QEMU), I provided the equation to show that the drift
> will not be large, because it is determined by something else rather than by 
> how
> long the VM has been running on the source server.
> 
> 
> For the previous vCPU hotplug and kvm-clock bug, if we add more vCPUs to a 
> guest
> that has been running for three months, the drift will be relatively larger.
> 
> For QEMU live migration, migrating a guest VM that has been running on the
> source host for *three months* versus one that has been running for *one day*
> will not cause much difference in kvm-clock drift.

Right.

> For the ideal live update case (on the same host), there may be no need to
> adjust gTSC so that it keeps incrementing. In that case, KVM_SET_CLOCK_GUEST 
> can
> be used to adjust kvm-clock based on gTSC.

Right. You restore the gTSC using its *offset* from the host TSC which
hasn't stopped counting on the same host. Then use KVM_SET_CLOCK_GUEST
to restore the kvmclock in terms of the gTSC. And you have an
absolutely cycle-perfect migration.

> For the live migration scenario, the current QEMU implementation not only 
> fails
> to account for downtime, but also has a drift issue. That is what I would like
> to address in QEMU.

Again, restore the gTSC as accurately as possible. Probably by working
out for *yourself* the relationships of the source and destination host
TSCs to real time, and then reconstituting on the destination using TSC
offset just as for live migration.

And then use KVM_SET_CLOCK_GUEST too.

That's what I attempted to document in
https://lore.kernel.org/all/[email protected]/
and should probably revive.

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to