On Sat, Jun 06, 2026, David Woodhouse wrote:
> On Sat, 2026-06-06 at 12:34 +0200, Thomas Gleixner wrote:
> > On Fri, May 29 2026 at 07:43, Sean Christopherson wrote:
> >
> > > Now that all paravirt code that explicitly specifies the TSC frequency
> > > also sets X86_FEATURE_TSC_KNOWN_FREQ, replace all of the one-off code
> > > and simply set X86_FEATURE_TSC_KNOWN_FREQ if the TSC frequency is known.
> > >
> > > Do NOT force set TSC_KNOWN_FREQ if the "known" TSC frequency was provided
> > > by the user. Per commit bd35c77e32e4 ("x86/tsc: Add tsc_early_khz command
> > > line parameter"), one of the goals of the param is to allow the refined
> > > calibration work "to do meaningful error checking".
> > >
> > > Note, preferring the user-provided TSC frequency over the frequency from
> > > the hypervisor or trusted firmware, while simultaneously not treating the
> > > user-provided frequency as gospel, is obviously incongruous. Sweep the
> > > problem under the rug for now to avoid opening a big can of worms that
> > > likely doesn't have a great answer.
> >
> > There is a good answer I think.
> >
> > early_tsc_khz exists to cater for the overclocking crowd. On their
> > modded systems the firmware supplied TSC frequency (CPUID/MSR) is not
> > matching reality anymore. So they work around that by supplying a close
> > enough tsc_early_khz and then they let the refined calibration work
> > figure it out.
> >
> > Arguably that's only relevant for bare metal systems and what's worse is
> > that in virtual environments the refined calibration work can fail,
> > which renders the TSC unstable.
> >
> > So I'd rather say we change this logic to:
> >
> > if (!hypervisor_is_type(X86_HYPER_NATIVE)) {
> > tsc_khz = x86_init.....();
> > force(X86_FEATURE_TSC_KNOWN_FREQ);
> > } else if (tsc_khz_early) {
> > ....
> > } else {
> > ...
> > }
> >
> > Along with:
> >
> > if (!hypervisor_is_type(X86_HYPER_NATIVE)) {
> > if (tsc_khz_early)
> > pr_warn("Ignoring non-sensical tsc_early_khz command line
> > argument\n");
> >
> > or something daft like that.
Ya, I ended up in the same place once Sashiko pointed out that skipping the
SNP/TDX
setup was hazardous[*], and also once I realized that tsc_khz_early
*complemented*
the refinement instead of replacing it.
This is what I have locally:
if (cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC))
known_tsc_khz = snp_secure_tsc_init();
else if (boot_cpu_has(X86_FEATURE_TDX_GUEST))
known_tsc_khz = tdx_tsc_init();
/*
* If the TSC frequency wasn't provided by trusted firmware, try to get
* it from the hypervisor (which is untrusted when running as a CoCo
guest).
*/
if (!known_tsc_khz && x86_init.hyper.get_tsc_khz)
known_tsc_khz = x86_init.hyper.get_tsc_khz();
/*
* Mark the TSC frequency as known if it was obtained from a hypervisor
* or trusted firmware. Don't mark the frequency as known if the user
* specified the frequency, as the user-provided frequency is intended
* as a "starting point", not a known, guaranteed frequency.
*/
if (known_tsc_khz && !tsc_early_khz)
setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
/*
* Ignore the user-provided TSC frequency if the exact frequency was
* obtained from trusted firmware or the hypervisor, as the user-
* provided frequency is intended as a "starting point", not a known,
* guaranteed frequency.
*/
if (!known_tsc_khz)
known_tsc_khz = tsc_early_khz;
else if (tsc_early_khz)
pr_err("Ignoring 'tsc_early_khz' in favor of
firmware/hypervisor.\n");
[*] https://lore.kernel.org/all/[email protected]
> > The kernel has for various reasons always tried to cater for the needs
> > of users who are plagued by bonkers firmware, but we have to stop to
> > prioritize or treating equal ancient and modded out of spec hardware.
> >
> > TBH, I consider that whole KVM clock nonsense to fall into the modded
> > out of spec hardware realm. Do a reality check:
> >
> > How many production systems are out there still which run VMs on CPUs
> > with a broken TSC and the lack of VM TSC scaling?
> >
> > I'm not saying that we should not support the few remaining systems
> > anymore, but our tendency to pretend that we can keep all of this
> > nonsense working and at the same time making progress is just a fallacy.
FWIW, I have the exact same sentiments about kvmclock, but I'm also trying my
best not to break folks that are happily running on what is effectively flawed,
ancient "hardward".
> I don't know that we can take the KVM (and Xen) clock away from guests,
> but all of the *horrid* part about it is the way it attempts to cope
> with the possibility that the *host* timekeeping might flip away from
> TSC-based mode at any point in time. By the end of my outstanding
> cleanup series, that is the *only* thing the gtod_notifier remains for.
>
> If we can trust the hardware *and* the host kernel, then KVM could
> theoretically hardwire the kvmclock into 'master clock mode' where it
> basically just advertises the TSC→kvmclock relationship *once* to all
> CPUs and it never changes.
>
> All the nonsense about updating it every time we enter a CPU could just
> go away completely.
But to Thomas' point, why bother? For actual old hardware, kvmclock is what it
is. For modern hardware, it's completely antiquated.