On 5/10/26 1:13 PM, David Woodhouse wrote:
> On Sun, 2026-05-10 at 12:11 -0700, H. Peter Anvin wrote:
>> On May 10, 2026 11:54:38 AM PDT, David Woodhouse <[email protected]> wrote:
>>> On Mon, 2026-05-04 at 17:30 -0700, Dongli Zhang wrote:
>>>> The KVM_CLOCK_REALTIME has been introduced to help track the downtime of
>>>> live migration. KVM uses that realtime value to advance guest clock, but
>>>> the same blackout is not reflected in KVM steal time.
>>>>
>>>> Account that same delta in steal time directly in kvm_vm_ioctl_set_clock(),
>>>> only when KVM_CLOCK_REALTIME is used. This keeps the KVM-only solution
>>>> self-contained and avoids adding a new KVM ioctl or requiring additional
>>>> userspace changes (i.e. QEMU).
>>>>
>>>> Record the per-VM downtime delta when KVM_SET_CLOCK receives
>>>> KVM_CLOCK_REALTIME, and fold it into the existing x86 steal accounting
>>>> path. Initialize each vCPU's local cursor
>>>> (vcpu->arch.st.last_downtime_steal) when the guest enables
>>>> MSR_KVM_STEAL_TIME so previously accumulated blackout is not charged.
>>>>
>>>> Note that this means a vCPU may observe additional steal time after
>>>> blackout even if the host side contribution from current->sched_info
>>>> did not increase during that interval.
>>>>
>>>> Signed-off-by: Dongli Zhang <[email protected]>
>>>
>>> I really don't want to see KVM_CLOCK_REALTIME used for anything more
>>> than it already is. Or, indeed, even for that.
>>>
>>> There is precisely *one* place where it's OK to use 'real time' as a
>>> comparator, and that's when setting the guest's TSC. And even then it
>>> should be using TAI not UTC unless you like your guests' clocks jumping
>>> around by a second if you migrate at the wrong time. KVM_CLOCK_REALTIME
>>> was never the right thing to use, for anything.
>>>
>>> The KVM clock is a function of the guest's TSC (see
>>> KVM_SET_CLOCK_GUEST), and steal time is a function of that (as it's
>>> measured in nanoseconds).
>>>
>>> Don't bring UTC into it *anywhere*.
>>>
>>>
>>
>> Unfortunately TAI is often unavailable. One can hope that the
>> proposal of abolishing leap seconds by 2035, fixing the TAI-UTC
>> offset permanently, actually happens.
>
> I was hoping for the opposite; it's just pandering to stupid bugs.
> Yes, leap seconds are fairly rare; instead maybe we should *always*
> have a leap second in one direction or the other at the end of the
> year. Otherwise it's just building up to be a bigger problem later.
>
>> The difference between atomic and solar time is better handled with
>> the already-existing "time zones" mechanism, which tends to change
>> far more frequently for entirely different reasons than the TAI-UT1
>> difference slowly accumulates.
>
> I have absolutely no faith in a 'time zones with second precision'
> model ever actually working either. Although maybe if we ditched UTC
> completely (and the pointless 37-second offset frozen in time for ever
> like the GPS offset), and our second-precision time zones were based on
> *TAI* we could exercise them from day one?
>
> Either way, as long as it isn't the awful abomination of *smearing*
> leap seconds and screwing up time precision, nobody actually needs to
> be nailed to anything.
>
> And none of it matters here for *steal time*, since the *only* thing in
> a migration that should be based on any kind of real time is the guest
> TSC, and everything else should be based purely on that (perhaps via
> the kvmclock).
>
> And even then it's only for live *migration* to a different host, as
> live update on the same host across kexec should be purely based on
> offsets from the host's TSC which remains unperturbed.
Thank you very much!
Based on my understanding, you have two main points:
1. KVM_CLOCK_REALTIME is not preferred for live migration or live update.
Essentially, the only acceptable use of KVM_CLOCK_REALTIME is to adjust
guest_TSC. After that, everything should rely on kvm-clock (especially after
KVM_SET_CLOCK_GUEST).
2. Regardless of whether TAI or KVM_CLOCK_REALTIME is used to adjust guest_TSC,
the calculation of the steal-time delta should always and exclusively be based
on kvm-clock values before and after the adjustment.
Would you mind confirming whether my understanding of your points is correct?
Dongli Zhang