On Mon, Aug 5, 2024 at 10:33 AM 'Vincent Donnefort' via kernel-team
<[email protected]> wrote:
>
> On arm64 systems, the arch timer can be accessible by both EL1 and EL2.
> This means when running with nVHE or protected KVM, it is easy to
> generate clock values from the hypervisor, synchronized with the kernel.
>
> For tracing purpose, the boot clock is interesting as it doesn't stop on
> suspend. Export it as part of the time snapshot. This will later allow
> the hypervisor to add boot clock timestamps to its events.
>
> Signed-off-by: Vincent Donnefort <[email protected]>
>
> diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h
> index fc12a9ba2c88..0fc6a61d64bd 100644
> --- a/include/linux/timekeeping.h
> +++ b/include/linux/timekeeping.h
> @@ -275,18 +275,24 @@ struct ktime_timestamps {
>   *                              counter value
>   * @cycles:    Clocksource counter value to produce the system times
>   * @real:      Realtime system time
> + * @boot:      Boot time

So, adding the boottime to this kernel-internal snapshot seems reasonable to me.

>   * @raw:       Monotonic raw system time
>   * @cs_id:     Clocksource ID
>   * @clock_was_set_seq: The sequence number of clock-was-set events
>   * @cs_was_changed_seq:        The sequence number of clocksource change 
> events
> + * @mono_shift:        The monotonic clock slope shift
> + * @mono_mult: The monotonic clock slope mult


This bit, including the mult/shift pair however, isn't well explained
and is a little more worrying.


> @@ -1074,14 +1076,21 @@ void ktime_get_snapshot(struct system_time_snapshot 
> *systime_snapshot)
>                 systime_snapshot->clock_was_set_seq = tk->clock_was_set_seq;
>                 base_real = ktime_add(tk->tkr_mono.base,
>                                       tk_core.timekeeper.offs_real);
> +               base_boot = ktime_add(tk->tkr_mono.base,
> +                                     tk_core.timekeeper.offs_boot);
>                 base_raw = tk->tkr_raw.base;
>                 nsec_real = timekeeping_cycles_to_ns(&tk->tkr_mono, now);
>                 nsec_raw  = timekeeping_cycles_to_ns(&tk->tkr_raw, now);
> +               mono_mult = tk->tkr_mono.mult;
> +               mono_shift = tk->tkr_mono.shift;
>         } while (read_seqcount_retry(&tk_core.seq, seq));
>
>         systime_snapshot->cycles = now;
>         systime_snapshot->real = ktime_add_ns(base_real, nsec_real);
> +       systime_snapshot->boot = ktime_add_ns(base_boot, nsec_real);
>         systime_snapshot->raw = ktime_add_ns(base_raw, nsec_raw);
> +       systime_snapshot->mono_shift = mono_shift;
> +       systime_snapshot->mono_mult = mono_mult;
>  }
>  EXPORT_SYMBOL_GPL(ktime_get_snapshot);

So this looks like you're trying to stuff kernel timekeeping internal
values into the snapshot so you can skirt around the timekeeping
subsystem and generate your own timestamps.

This ends up duplicating logic, but in an incomplete way.  For
instance, you don't have things like ntp state, etc, so the timestamps
you generate will not exactly match the kernel, and may have
discontinuities. :(

Now for many cases "close enough" is fine. But the difficulty is the
expectation bar always raises, and eventually "close enough" isn't and
we have a broken interface that has to be fixed.

That said, I do get the need to have something like this is
legitimate. There have been a number of cases where external hardware
(PTP timestamps from NICs) or contexts (virt) are able to record
hardware clocksource timestamps on their own, and want to be able to
map that back to the kernel's (or maybe "a kernel's" if there are
multiple VMs) sense of time.  Sometimes even wanting to do this quite
a bit later after the timestamp was recorded. The ktime_get_snapshot()
logic was added in the first place for this reason.

Some more aggressive approaches try to dump a bunch of the internal
kernel timekeeping state out to userland and call it an api.
See 
https://lore.kernel.org/lkml/[email protected]/
for a recent (and thorough) effort there.

I'm very much not a fan of this approach, as it mimics older efforts
for userspace time calculations that were done before we settled on
VDSOs, which were very fragile and required years of keeping backwards
compatibility logic to map the current kernel state back to separate
structures and expensive conversions to different units that userland
expected.

The benefit with VDSO interface is while the data is exposed to
userland, the structure is not, and the logic is still kernel
controlled, so changes to internal state can be done without breaking
userland.

Something I have been thinking about is maybe it would be beneficial
to rework the timekeeping core so that given a clocksource timestamp,
it could calculate the time for that timestamp. While existing apis
would still do a new read of the clocksource, so the timestamps would
always increase, an old timestamp could be used to retro-calculate a
past time.  The thing that prevents this now is that the timekeeping
core doesn't keep any history, so we can't correctly back-calculate
times before the last state change. But potentially we could keep a
buffer of timekeeper states associated with clocksource intervals, and
so we could find the right state to use for a given clocksource
timestamp. Now, this would still only work to a point, as we don't
want to keep tons of historical state.  But then with this, maybe we
could switch to something more VDSO-like where the PTP drivers or host
systems could request a time given a timestamp (and probably some
clocksource id so we can sanity check everyone is using the same
clock), and we could still provide what they want without having to
expose all of our state.

Unfortunately though, this is all hand waving and pontificating on my
part, as it would be a large rework. But it seems something closer
where we share opaque kernel state along with logic with proper
syscall like APIs to do the calculations, would be a much better
approach over just exporting more kernel state as an API.

For a more short term approach, since you can't be exact outside of
the timekeeping logic, why not interpolate from the data
ktime_get_snapshot already provides to calculate your own sense of the
frequency?

thanks
-john

Reply via email to