Re: [PATCH v4 04/30] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for accurate KVM clock migration

Dongli Zhang Tue, 19 May 2026 14:25:26 -0700

On 2026-05-19 12:50 AM, David Woodhouse wrote:
> On Mon, 2026-05-18 at 17:57 -0700, Dongli Zhang wrote:
>> On 2026-05-18 1:48 AM, David Woodhouse wrote:
>>> ...
>>
>> I have fixed the Thunderbird configuration. Does it look better to you?
> 
> The date is certainly better, thank you. But although I *was* up late
> that night frowning at clocks, I didn't think I was up *quite* as late
> (almost 2am) as it suggests.
> 
> But I suspect that getting *that* right is beyond the limit of
> Thunderbird's configurability.
> 
> Thanks :)

Let me continue exploring how to add a timezone, such as "17:57 -0700".

> 
>> I really appreciate guidelines like the ones below.
>>
>> https://lore.kernel.org/all/[email protected]
>>
>> Assuming I am a user of the new API, I feel confused about whether the goal 
>> is
>> to replace KVM_SET_CLOCK with KVM_SET_CLOCK_GUEST, or whether the latter is
>> meant to supplement the former.
> 
> The issue is that KVM_SET_CLOCK_GUEST can only be used in 'masterclock'
> mode, when the TSC is reliable and the guest TSCs are all in sync.
> 
> Which ought to be *all* of the time, on modern hardware and sane
> configurations. And in this series, I don't even let the *guest* screw
> that over by setting different TSC offsets on different vCPUs any more
> (we stay in masterclock mode in that case now). But the VMM can cause
> its guest to come out of masterclock mode, by setting different TSC
> *speeds* on different vCPUs.
> 
> So there remain some pathological cases where the kvmclock actually
> still has a justification to exist, and those are the cases where it
> needs to be set in its own right as a function of host time
> (KVM_SET_CLOCK), not purely as a function of the guest TSC
> (KVM_SET_CLOCK_GUEST).

I think I now understand why I feel like I am always asking weird questions. I
have been thinking about how to account for downtime, so I see
KVM_SET_CLOCK_GUEST as a supplement to KVM_SET_CLOCK.

Suppose we are not going to account for any downtime. With KVM_SET_CLOCK_GUEST:

1. The masterclock is active, so gTSC is synchronized across vCPUs. All vCPUs
share the same kvm_read_l1_tsc(v, ka->master_cycle_now).

2. Migrate the gTSC to the target VM however people want (either ablolute value
or offset value). (Optional) Account for downtime in gTSC however people want,
even with KVM_SET_CLOCK/KVM_CLOCK_REALTIME, which you may not like.

3. Adjust kvm-clock (that is, ka->kvmclock_offset) with KVM_SET_CLOCK_GUEST.

That is why you think KVM_SET_CLOCK is no longer required if we have
KVM_SET_CLOCK_GUEST. While I think KVM_SET_CLOCK is required because of
KVM_CLOCK_REALTIME.

It it isn't required to account any downtime for gTSC or if there is another way
to do so, only KVM_SET_CLOCK_GUEST is enough.

> 
> 
>>
>> If we are going to use KVM_SET_CLOCK_GUEST when KVM_SET_CLOCK is not needed, 
>> I
>> would appreciate it if the API could carry more data in addition to struct
>> pvclock_vcpu_time_info.
>>
>> +#define KVM_SET_CLOCK_GUEST    _IOW(KVMIO, 0xd6, struct 
>> pvclock_vcpu_time_info)
>> +#define KVM_GET_CLOCK_GUEST    _IOR(KVMIO, 0xd7, struct 
>> pvclock_vcpu_time_info)
>>
>>

[snip]

>>
>> Another scenario is when only MASTERCLOCK_UPDATE is pending and there is no
>> pending CLOCK_UPDATE.
>>
>> In this scenario, is it fine to skip processing MASTERCLOCK_UPDATE before 
>> saving
>> pvclock_vcpu_time_info?
>>
> 
> I'm not sure I understand that scenario. 
> 
> MASTERCLOCK_UPDATE means we have to actually recalculate the master
> clock (which really *should* be rare, now!). And then any time we do
> that, we also have to do a CLOCK_UPDATE on every vCPU to disseminate
> the new information. Which is why kvm_end_pvclock_update() does exactly
> that.
> 
> So your "MASTERCLOCK_UPDATE is pending and there is no pending
> CLOCK_UPDATE" doesn't make much sense to me. If MASTERCLOCK_UPDATE is
> pending, then there *will* be a CLOCK_UPDATE pending.

Suppose the VM is stopped and the master clock is active.

Suddenly, we change the host clocksource from TSC to HPET. pvclock_gtod_notify()
may call pvclock_gtod_update_fn() to set a pending KVM_REQ_MASTERCLOCK_UPDATE
for all vCPUs. Unless the pending KVM_REQ_MASTERCLOCK_UPDATE is processed by
kvm_update_masterclock(), kvm_end_pvclock_update() will not set a pending
KVM_REQ_CLOCK_UPDATE.

Therefore, this is a scenario in which only KVM_REQ_MASTERCLOCK_UPDATE is 
pending.

I do not think this scenario is important. I am just curious about the expected
way to implement similar code in the future :)

> 
> 
>>>>
>>>> Would it be helpful to validate that the delta is within a reasonable 
>>>> range,
>>>> e.g. that the drift can never be more than five minutes (forward or 
>>>> backward)?
>>>
>>> If a guest has been running for months on a previous host and is
>>> migrated to a new host, don't we expect that the KVM clock of the new
>>> VM on the new host is tweaked from its default near-zero after
>>> creation, to some large amount?
>>>
>>
>> Regarding live migration, my own investigation does not show a proportional
>> relationship between VM uptime and the amount of drift.
> 
> You're comparing the VM on the source host, with the VM on the
> destination post-migration.

Apologies for making it confusing. I was just trying to explain why I think the
kvm-clock drift will not be large.

We previously discussed the vCPU hotplug and kvm-clock drift issue. The longer
the time interval between two vCPU hotplug events, the larger the drift.

For live migration (with QEMU), I provided the equation to show that the drift
will not be large, because it is determined by something else rather than by how
long the VM has been running on the source server.


For the previous vCPU hotplug and kvm-clock bug, if we add more vCPUs to a guest
that has been running for three months, the drift will be relatively larger.

For QEMU live migration, migrating a guest VM that has been running on the
source host for *three months* versus one that has been running for *one day*
will not cause much difference in kvm-clock drift.

> 
> Perhaps I misunderstood, but I thought your suggested validation of a
> 'reasonable range' would also apply when adjusting the kvmclock of the
> nascent VM on the destination host, from "newly created" to "has been
> running for months" while migrating the state of the actual guest onto
> a clean new slate.
> 
>> Just taking QEMU + KVM as an example: suppose TSC scaling is inactive, the
>> amount of drift does not depend on how long the VM has been running before 
>> live
>> migration.
>>
>> Instead, it depends on the delta between when we call MSR_IA32_TSC and
>> KVM_GET_CLOCK, and between MSR_IA32_TSC and KVM_SET_CLOCK.
>>
>> The guest TSC stops at P1 and resumes at P3.
>> The kvmclock stops at P2 and resumes at P4.
>>
>> We expect P1 == P2 and P3 == P4.
>>
>> On source host.
>>
>> - kvm_get_msr_common(MSR_IA32_TSC) for vCPU=0 ===> P1
> 
> Here's where it all starts going wrong. Line 1.
> 
> Any API which lets you get a single time value in isolation, and thus
> which is already out of date by the time the system call even returns,
> is fundamentally unsuitable for migration.
> 
>> - kvm_get_msr_common(MSR_IA32_TSC) for vCPU=1
>> - kvm_get_msr_common(MSR_IA32_TSC) for vCPU=2
>> - kvm_get_msr_common(MSR_IA32_TSC) for vCPU=3
>> - kvm_get_msr_common(MSR_IA32_TSC) for vCPU=4
>> ... ...
>> - kvm_get_msr_common(MSR_IA32_TSC) for vCPU=N
>> - KVM_GET_CLOCK                               ===> P2
>>
>> On target host.
>>
>> - kvm_set_msr_common(MSR_IA32_TSC) for vCPU=1 ===> P3
>> - kvm_set_msr_common(MSR_IA32_TSC) for vCPU=2
> 
> At this point, the nasty hack in the kernel steps in, realises that the
> value you're setting on vCPU 2 is within a second or so of the value
> you had previously set on vCPU 1, and snaps it back to be precisely the
> same. To work around the fundamental brokenness of this method.
> 
>> - kvm_set_msr_common(MSR_IA32_TSC) for vCPU=3
>> - kvm_set_msr_common(MSR_IA32_TSC) for vCPU=4
>> - kvm_set_msr_common(MSR_IA32_TSC) for vCPU=5
>> ... ...
>> - kvm_set_msr_common(MSR_IA32_TSC) for vCPU=N
>> - KVM_SET_CLOCK                               ====> P4
>>
>>
>> Here is my equiation to predict the drift.
> 
> I'm sure you're right, but I didn't get that far when looking at this.
> I'd already thrown up in my mouth a little bit by line one.
> 
> Here's my equation to predict the drift of a live update done correctly
> on the same host using the method I've now put in the documentation:
> 
> 0.

For the ideal live update case (on the same host), there may be no need to
adjust gTSC so that it keeps incrementing. In that case, KVM_SET_CLOCK_GUEST can
be used to adjust kvm-clock based on gTSC.

For the live migration scenario, the current QEMU implementation not only fails
to account for downtime, but also has a drift issue. That is what I would like
to address in QEMU.

Thank you very much!

Dongli Zhang
Re: [PATCH v4 04/30] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for accurate KVM clock migration

Reply via email to