On 04/25 12:25:03, Brian Brooks wrote:
> On 04/24 08:07:58, Savolainen, Petri (Nokia - FI/Espoo) wrote:
> > > > diff --git a/platform/linux-generic/arch/x86/odp_cpu_arch.c
> > > b/platform/linux-generic/arch/x86/odp_cpu_arch.c
> > > > index c8cf27b6..9ba601a3 100644
> > > > --- a/platform/linux-generic/arch/x86/odp_cpu_arch.c
> > > > +++ b/platform/linux-generic/arch/x86/odp_cpu_arch.c
> > > > @@ -3,7 +3,14 @@
> > > >   *
> > > >   * SPDX-License-Identifier:     BSD-3-Clause
> > > >   */
> > > > +
> > > > +#include <odp_posix_extensions.h>
> > > > +
> > > >  #include <odp/api/cpu.h>
> > > > +#include <odp_time_internal.h>
> > > > +#include <odp_debug_internal.h>
> > > > +
> > > > +#include <time.h>
> > > >
> > > >  uint64_t odp_cpu_cycles(void)
> > > >  {
> > > > @@ -31,3 +38,55 @@ uint64_t odp_cpu_cycles_resolution(void)
> > > >  {
> > > >         return 1;
> > > >  }
> > > > +
> > > > +uint64_t cpu_global_time(void)
> > > > +{
> > > > +       return odp_cpu_cycles();
> > > 
> > > A cycle counter cannot always be used to measure time. Even on x86,
> > > odp_cpu_cycles() will return the value of RDTSC which is not actually
> > > representative of the cycle count. Even if the x86 processor is set
> > > to a fixed frequency, the Invariant TSC may run at a different fixed
> > > frequency. Please take a look at the odp_tick_t proposal here:
> > > 
> > > https://docs.google.com/document/d/1sY7rOxqCNu-bMqjBiT5_keAIohrX1ZW-
> > > eL0oGLAQ4OM/edit?usp=sharing
> > >
> > 
> > From coverletter:
> > "This patch set modifies time implementation to use TSC when running on a 
> > x86 
> > CPU that has invarint TSC CPU flag set. Otherwise, the same Linux system 
> > time 
> > is used as before. TSC is much more efficient both in performance and 
> > latency/jitter wise than Linux system call. This can be seen also with 
> > scheduler latency test which time stamps events with this API. All latency 
> > measurements (min, ave, max) improved significantly."
> > 
> > This function (cpu_global_time()) is called only when we have first checked 
> > that TSC is invariant. Also we measure the TSC frequency in that case. This 
> > function is defined in the same file as cpu_cycles(), and the file is x86 
> > specific. So, we know what we are doing, and just re-using the code to read 
> > TSC.

What sort of timing accuracy is expected from the app?

>From benchmarking the maximum single-threaded rate of these reads:

 x86_64:

   read       7 ns/op
   read_sync  22 ns/op

 A57:

   read       4 ns/op
   read_sync  26 ns/op

read_sync issues a synchronizing instruction for greater timing accuracy
but clearly takes more time to return the time value read from the core.

> > > > +}
> > > > +
> > > > +#define SEC_IN_NS 1000000000ULL
> > > > +
> > > > +/* Measure TSC frequency. Frequency information registers are defined
> > > for x86,
> > > > + * but those are often not enumerated. */
> > > > +uint64_t cpu_global_time_freq(void)
> > > > +{
> > > 
> > > The 0x40000010 leaf is standardized in Hyper-V as returning the TSC's
> > > frequency in kHz. It would be better to use this when running under a
> > > hypervisor. It appears to work under VMware as well.
> > 
> > 
> > Sure, that can be done later, if we want to add any code into odp-linux 
> > that would behave differently under VM vs. host. Today, we don't have any.
> > 
> > Also there's leaf 0x15, which should give you TSC frequency but apparently 
> > is not enumerated often. So, I ended up to do the same as DPDK - just 
> > measure it.
> > 
> > 
> > > > diff --git a/platform/linux-generic/include/odp/api/plat/time_types.h
> > > b/platform/linux-generic/include/odp/api/plat/time_types.h
> > > > index 4847f3b1..8ae7ce82 100644
> > > > --- a/platform/linux-generic/include/odp/api/plat/time_types.h
> > > > +++ b/platform/linux-generic/include/odp/api/plat/time_types.h
> > > > @@ -26,11 +26,28 @@ extern "C" {
> > > >   * the linux timespec structure, which is dependent on POSIX extension
> > > level.
> > > >   */
> > > >  typedef struct odp_time_t {
> > > > -       int64_t tv_sec;      /**< @internal Seconds */
> > > > -       int64_t tv_nsec;     /**< @internal Nanoseconds */
> > > > +       union {
> > > > +               /** @internal Posix timespec */
> > > > +               struct {
> > > > +                       /** @internal Seconds */
> > > > +                       int64_t tv_sec;
> > > > +
> > > > +                       /** @internal Nanoseconds */
> > > > +                       int64_t tv_nsec;
> > > > +               } spec;
> > > > +
> > > > +               /** @internal HW time counter */
> > > > +               struct {
> > > > +                       /** @internal Counter value */
> > > > +                       uint64_t count;
> > > > +
> > > > +                       /** @internal Reserved */
> > > > +                       uint64_t res;
> > > > +               } hw;
> > > > +       };
> > > 
> > > A processor's tick counter cannot be used to measure calendar time by
> > > itself. The delta between two ticks, however, can be converted to
> > > calendar time.
> > > 
> > > Please see the proposal that introduces odp_tick_t which eliminates
> > > the wasted u64 reserved field in the union above.
> > 
> > 
> > Nothing is wasted here today.
> 
> The 64-bit CPU time is stored in a 128-bit data structure where 64-bits
> are wasted. You can use just a 64-bit type and then convert that to
> a timespec (using a starting timestamp taken on global init) if needed.
> 
> > We need to support Posix timespec anyway. Timespec is used when there's no 
> > x86 invariant TSC available - so all non-x86 and even on some x86 systems.
> > 
> > Also "count" here starts from zero (we save cpu counter value at start). 
> > So, it takes 580 years to wrap at 1GHz. Why 64 bit counter value (that 
> > starts from zero) is not good for calendar time ?
> > 
> > 
> > 
> > 
> > > > +/*
> > > > + * Posix timespec based functions
> > > > + */
> > > >
> > > > -static inline odp_time_t time_diff(odp_time_t t2, odp_time_t t1)
> > > > +static inline odp_time_t time_spec_diff(odp_time_t t2, odp_time_t t1)
> > > 
> > > Static functions in .c files will always be considered for inlining, so
> > > the
> > > inline qualifier is unnecessary here. You can use always_inline compiler
> > > attribute if the disassembly and run time performance is not as expected.
> > > 
> > > Comment applies to nearly every function in this file.
> > 
> > 
> > This is because all our code (in this and other files) does "static inline" 
> > for functions that we want to be inlined. Static is used for functions that 
> > are static, but we don't care if those are inlined (slow path). Most time 
> > API functions are fast path.
> 
> I am saying that adding the 'inline' qualifier here doesn't actually do 
> anything
> because the compiler will consider static functions for inlining anyways.
> 
> > 
> > > >
> > > > -static inline uint64_t time_local_res(void)
> > > > +static inline uint64_t time_hw_res(void)
> > > 
> > > I think 'freq' is a more descriptive name than 'res'.
> > > 
> > > It would also be good to document at either the function declaration
> > > or the function definition (if multiple implementations differ) whether
> > > the frequency is in Hz or kHz. For example, on ARMv8 it is just a
> > > register read to get Hz. No estimation, and no lowering to kHz.
> > 
> > 
> > This implements resolution API functions ...
> > 
> > /**
> >  * Global time resolution in hertz
> >  *
> >  * @return      Global time resolution in hertz
> >  */
> > uint64_t odp_time_global_res(void);
> > 
> > 
> > ... so, from that context you know that it's Hz. In this implementation, I 
> > lower the measured Hz to be conservative.
> > 
> > 
> > > > +
> > > > +static inline odp_time_t time_hw_sum(odp_time_t t1, odp_time_t t2)
> > > > +{
> > > > +       odp_time_t time;
> > > > +
> > > > +       time.hw.res = 0;
> > > > +       time.hw.count = t1.hw.count + t2.hw.count;
> > > > +
> > > > +       return time;
> > > > +}
> > > 
> > > If a single u64 were used this code wouldn't need to exist.
> > 
> > Which code? hw.res = 0 ? It not actually used for anything and could be 
> > removed. Although, the performance gain would not be huge. Anyway, I'll 
> > remove it for v2.
> 
> If the CPU time was stored in a 64-bit type, you can use arithmetic operators
> on the values directly instead of doing arithmetic on a 128-bit compound
> datatype where 64-bits are unused. This is the union above.
> 
> > -Petri
> > 

Reply via email to