On Mon, Aug 5, 2024 at 10:33 AM 'Vincent Donnefort' via kernel-team <[email protected]> wrote: > > On arm64 systems, the arch timer can be accessible by both EL1 and EL2. > This means when running with nVHE or protected KVM, it is easy to > generate clock values from the hypervisor, synchronized with the kernel. > > For tracing purpose, the boot clock is interesting as it doesn't stop on > suspend. Export it as part of the time snapshot. This will later allow > the hypervisor to add boot clock timestamps to its events. > > Signed-off-by: Vincent Donnefort <[email protected]> > > diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h > index fc12a9ba2c88..0fc6a61d64bd 100644 > --- a/include/linux/timekeeping.h > +++ b/include/linux/timekeeping.h > @@ -275,18 +275,24 @@ struct ktime_timestamps { > * counter value > * @cycles: Clocksource counter value to produce the system times > * @real: Realtime system time > + * @boot: Boot time
So, adding the boottime to this kernel-internal snapshot seems reasonable to me. > * @raw: Monotonic raw system time > * @cs_id: Clocksource ID > * @clock_was_set_seq: The sequence number of clock-was-set events > * @cs_was_changed_seq: The sequence number of clocksource change > events > + * @mono_shift: The monotonic clock slope shift > + * @mono_mult: The monotonic clock slope mult This bit, including the mult/shift pair however, isn't well explained and is a little more worrying. > @@ -1074,14 +1076,21 @@ void ktime_get_snapshot(struct system_time_snapshot > *systime_snapshot) > systime_snapshot->clock_was_set_seq = tk->clock_was_set_seq; > base_real = ktime_add(tk->tkr_mono.base, > tk_core.timekeeper.offs_real); > + base_boot = ktime_add(tk->tkr_mono.base, > + tk_core.timekeeper.offs_boot); > base_raw = tk->tkr_raw.base; > nsec_real = timekeeping_cycles_to_ns(&tk->tkr_mono, now); > nsec_raw = timekeeping_cycles_to_ns(&tk->tkr_raw, now); > + mono_mult = tk->tkr_mono.mult; > + mono_shift = tk->tkr_mono.shift; > } while (read_seqcount_retry(&tk_core.seq, seq)); > > systime_snapshot->cycles = now; > systime_snapshot->real = ktime_add_ns(base_real, nsec_real); > + systime_snapshot->boot = ktime_add_ns(base_boot, nsec_real); > systime_snapshot->raw = ktime_add_ns(base_raw, nsec_raw); > + systime_snapshot->mono_shift = mono_shift; > + systime_snapshot->mono_mult = mono_mult; > } > EXPORT_SYMBOL_GPL(ktime_get_snapshot); So this looks like you're trying to stuff kernel timekeeping internal values into the snapshot so you can skirt around the timekeeping subsystem and generate your own timestamps. This ends up duplicating logic, but in an incomplete way. For instance, you don't have things like ntp state, etc, so the timestamps you generate will not exactly match the kernel, and may have discontinuities. :( Now for many cases "close enough" is fine. But the difficulty is the expectation bar always raises, and eventually "close enough" isn't and we have a broken interface that has to be fixed. That said, I do get the need to have something like this is legitimate. There have been a number of cases where external hardware (PTP timestamps from NICs) or contexts (virt) are able to record hardware clocksource timestamps on their own, and want to be able to map that back to the kernel's (or maybe "a kernel's" if there are multiple VMs) sense of time. Sometimes even wanting to do this quite a bit later after the timestamp was recorded. The ktime_get_snapshot() logic was added in the first place for this reason. Some more aggressive approaches try to dump a bunch of the internal kernel timekeeping state out to userland and call it an api. See https://lore.kernel.org/lkml/[email protected]/ for a recent (and thorough) effort there. I'm very much not a fan of this approach, as it mimics older efforts for userspace time calculations that were done before we settled on VDSOs, which were very fragile and required years of keeping backwards compatibility logic to map the current kernel state back to separate structures and expensive conversions to different units that userland expected. The benefit with VDSO interface is while the data is exposed to userland, the structure is not, and the logic is still kernel controlled, so changes to internal state can be done without breaking userland. Something I have been thinking about is maybe it would be beneficial to rework the timekeeping core so that given a clocksource timestamp, it could calculate the time for that timestamp. While existing apis would still do a new read of the clocksource, so the timestamps would always increase, an old timestamp could be used to retro-calculate a past time. The thing that prevents this now is that the timekeeping core doesn't keep any history, so we can't correctly back-calculate times before the last state change. But potentially we could keep a buffer of timekeeper states associated with clocksource intervals, and so we could find the right state to use for a given clocksource timestamp. Now, this would still only work to a point, as we don't want to keep tons of historical state. But then with this, maybe we could switch to something more VDSO-like where the PTP drivers or host systems could request a time given a timestamp (and probably some clocksource id so we can sanity check everyone is using the same clock), and we could still provide what they want without having to expose all of our state. Unfortunately though, this is all hand waving and pontificating on my part, as it would be a large rework. But it seems something closer where we share opaque kernel state along with logic with proper syscall like APIs to do the calculations, would be a much better approach over just exporting more kernel state as an API. For a more short term approach, since you can't be exact outside of the timekeeping logic, why not interpolate from the data ktime_get_snapshot already provides to calculate your own sense of the frequency? thanks -john
