Hi,
I'm seeing KVM_GET_CLOCK return values ~253 years in the future when
running KVM inside a Hyper-V VM (nested virtualization). I tracked
it down to an unsigned wraparound in __get_kvmclock() and have
bpftrace data showing the exact failure.
Setup:
- Intel i7-11800H laptop running Windows with Hyper-V
- L1 guest: Ubuntu 24.04, kernel 6.8.0, 4 vCPUs
- Clocksource: hyperv_clocksource_tsc_page (VDSO_CLOCKMODE_HVCLOCK)
- KVM running inside L1, hosting L2 guests
Root cause:
__get_kvmclock() does:
hv_clock.tsc_timestamp = ka->master_cycle_now;
hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset;
...
data->clock = __pvclock_read_cycles(&hv_clock, data->host_tsc);
and __pvclock_read_cycles() does:
delta = tsc - src->tsc_timestamp; /* unsigned */
master_cycle_now is a raw RDTSC captured by
pvclock_update_vm_gtod_copy(). host_tsc is a raw RDTSC read by
__get_kvmclock() on the current CPU. Both go through the vgettsc()
HVCLOCK path which calls hv_read_tsc_page_tsc() -- this computes a
cross-CPU-consistent reference counter via scale/offset, but stores
the *raw* RDTSC in tsc_timestamp as a side effect.
Under Hyper-V, raw RDTSC values are not consistent across vCPUs.
The hypervisor corrects them only through the TSC page scale/offset.
If pvclock_update_vm_gtod_copy() runs on CPU 0 and __get_kvmclock()
later runs on CPU 1 where the raw TSC is lower, the unsigned
subtraction wraps.
I wrote a bpftrace tracer (included below) to instrument both
functions and captured two corruption events:
Event 1:
[GTOD_COPY] pid=2117649 cpu=0->0 use_master=1
mcn=598992030530137 mkn=259977082393200
[GET_CLOCK] pid=2117649 entry_cpu=1 exit_cpu=1 use_master=1
clock=8006399342167092479 host_tsc=598991848289183
master_cycle_now=598992030530137
system_time(mkn+off)=5175860260
TSC DEFICIT: 182240954 cycles
master_cycle_now captured on CPU 0, host_tsc read on CPU 1.
CPU 1's raw RDTSC was 182M cycles lower.
598991848289183 - 598992030530137 = 18446744073527310662 (u64)
Returned clock: 8,006,399,342,167,092,479 ns (~253.7 years)
Correct system_time: 5,175,860,260 ns (~5.2 seconds)
Event 2:
[GTOD_COPY] pid=2117953 cpu=0->0 use_master=1
mcn=599040238416510
[GET_CLOCK] pid=2117953 entry_cpu=3 exit_cpu=3 use_master=1
clock=8006399342464295526 host_tsc=599040211994220
master_cycle_now=599040238416510
TSC DEFICIT: 26422290 cycles
Same pattern, CPU 0 vs CPU 3, 26M cycle deficit.
kvm_get_wall_clock_epoch() has the same pattern -- fresh host_tsc
vs stale master_cycle_now passed to __pvclock_read_cycles().
The simplest fix I can think of is guarding the __pvclock_read_cycles
call in __get_kvmclock():
if (data->host_tsc >= hv_clock.tsc_timestamp)
data->clock = __pvclock_read_cycles(&hv_clock, data->host_tsc);
else
data->clock = hv_clock.system_time;
system_time (= master_kernel_ns + kvmclock_offset) was computed from
the TSC page's corrected reference counter and is accurate regardless
of CPU. The fallback loses sub-us interpolation but avoids a 253-year
jump. On systems with consistent cross-CPU TSC, the branch is never
taken.
One thing I wasn't sure about: when the fallback triggers,
KVM_CLOCK_TSC_STABLE is still set in data->flags. I left it alone
since the returned value is still correct (just less precise), but
I could see an argument for clearing it.
Disabling master clock entirely for HVCLOCK would also work but
seemed heavy -- it sacrifices PVCLOCK_TSC_STABLE_BIT, forces the
guest pvclock read into the atomic64_cmpxchg monotonicity guard,
and triggers KVM_REQ_GLOBAL_CLOCK_UPDATE on vCPU migration.
Reproducer bpftrace script (run while exercising KVM on a Hyper-V
host):
#!/usr/bin/env bpftrace
/*
* Detect host_tsc < master_cycle_now in __get_kvmclock.
*
* struct kvm_clock_data layout (for raw offset reads):
* offset 0: u64 clock
* offset 24: u64 host_tsc
*/
kprobe:__get_kvmclock
{
$kvm = (struct kvm *)arg0;
@get_data[tid] = (uint64)arg1;
@get_use_master[tid] = (uint64)$kvm->arch.use_master_clock;
@get_mcn[tid] = (uint64)$kvm->arch.master_cycle_now;
@get_cpu[tid] = cpu;
}
kretprobe:__get_kvmclock
{
$data_ptr = @get_data[tid];
if ($data_ptr != 0) {
$clock = *(uint64 *)($data_ptr);
$host_tsc = *(uint64 *)($data_ptr + 24);
$use_master = @get_use_master[tid];
$mcn = @get_mcn[tid];
if ($use_master && $host_tsc != 0 && $host_tsc < $mcn) {
printf("BUG: pid=%d cpu=%d->%d host_tsc=%lu mcn=%lu "
"deficit=%lu clock=%lu\n",
pid, @get_cpu[tid], cpu, $host_tsc,
$mcn, $mcn - $host_tsc, $clock);
}
}
delete(@get_data[tid]);
delete(@get_use_master[tid]);
delete(@get_mcn[tid]);
delete(@get_cpu[tid]);
}
kprobe:pvclock_update_vm_gtod_copy {
@gtod_kvm[tid] = (uint64)arg0;
@gtod_cpu[tid] = cpu;
}
kretprobe:pvclock_update_vm_gtod_copy
{
$kvm = (struct kvm *)@gtod_kvm[tid];
if ($kvm != 0) {
printf("GTOD: pid=%d cpu=%d->%d mcn=%lu use_master=%d\n",
pid, @gtod_cpu[tid], cpu,
$kvm->arch.master_cycle_now,
$kvm->arch.use_master_clock);
}
delete(@gtod_kvm[tid]);
delete(@gtod_cpu[tid]);
}
Thanks,
Thomas