On 04/25/17 21:51, Brian Brooks wrote:
> On 04/25 12:25:03, Brian Brooks wrote:
>> On 04/24 08:07:58, Savolainen, Petri (Nokia - FI/Espoo) wrote:
>>>>> diff --git a/platform/linux-generic/arch/x86/odp_cpu_arch.c
>>>> b/platform/linux-generic/arch/x86/odp_cpu_arch.c
>>>>> index c8cf27b6..9ba601a3 100644
>>>>> --- a/platform/linux-generic/arch/x86/odp_cpu_arch.c
>>>>> +++ b/platform/linux-generic/arch/x86/odp_cpu_arch.c
>>>>> @@ -3,7 +3,14 @@
>>>>> *
>>>>> * SPDX-License-Identifier: BSD-3-Clause
>>>>> */
>>>>> +
>>>>> +#include <odp_posix_extensions.h>
>>>>> +
>>>>> #include <odp/api/cpu.h>
>>>>> +#include <odp_time_internal.h>
>>>>> +#include <odp_debug_internal.h>
>>>>> +
>>>>> +#include <time.h>
>>>>>
>>>>> uint64_t odp_cpu_cycles(void)
>>>>> {
>>>>> @@ -31,3 +38,55 @@ uint64_t odp_cpu_cycles_resolution(void)
>>>>> {
>>>>> return 1;
>>>>> }
>>>>> +
>>>>> +uint64_t cpu_global_time(void)
>>>>> +{
>>>>> + return odp_cpu_cycles();
>>>>
>>>> A cycle counter cannot always be used to measure time. Even on x86,
>>>> odp_cpu_cycles() will return the value of RDTSC which is not actually
>>>> representative of the cycle count. Even if the x86 processor is set
>>>> to a fixed frequency, the Invariant TSC may run at a different fixed
>>>> frequency. Please take a look at the odp_tick_t proposal here:
>>>>
>>>> https://docs.google.com/document/d/1sY7rOxqCNu-bMqjBiT5_keAIohrX1ZW-
>>>> eL0oGLAQ4OM/edit?usp=sharing
>>>>
>>>
>>> From coverletter:
>>> "This patch set modifies time implementation to use TSC when running on a
>>> x86
>>> CPU that has invarint TSC CPU flag set. Otherwise, the same Linux system
>>> time
>>> is used as before. TSC is much more efficient both in performance and
>>> latency/jitter wise than Linux system call. This can be seen also with
>>> scheduler latency test which time stamps events with this API. All latency
>>> measurements (min, ave, max) improved significantly."
>>>
>>> This function (cpu_global_time()) is called only when we have first checked
>>> that TSC is invariant. Also we measure the TSC frequency in that case. This
>>> function is defined in the same file as cpu_cycles(), and the file is x86
>>> specific. So, we know what we are doing, and just re-using the code to read
>>> TSC.
>
> What sort of timing accuracy is expected from the app?
>
> From benchmarking the maximum single-threaded rate of these reads:
>
> x86_64:
>
> read 7 ns/op
> read_sync 22 ns/op
>
> A57:
>
> read 4 ns/op
> read_sync 26 ns/op
>
> read_sync issues a synchronizing instruction for greater timing accuracy
> but clearly takes more time to return the time value read from the core.
>
it has to be depend on cpu frequency.
Maxim.
>>>>> +}
>>>>> +
>>>>> +#define SEC_IN_NS 1000000000ULL
>>>>> +
>>>>> +/* Measure TSC frequency. Frequency information registers are defined
>>>> for x86,
>>>>> + * but those are often not enumerated. */
>>>>> +uint64_t cpu_global_time_freq(void)
>>>>> +{
>>>>
>>>> The 0x40000010 leaf is standardized in Hyper-V as returning the TSC's
>>>> frequency in kHz. It would be better to use this when running under a
>>>> hypervisor. It appears to work under VMware as well.
>>>
>>>
>>> Sure, that can be done later, if we want to add any code into odp-linux
>>> that would behave differently under VM vs. host. Today, we don't have any.
>>>
>>> Also there's leaf 0x15, which should give you TSC frequency but apparently
>>> is not enumerated often. So, I ended up to do the same as DPDK - just
>>> measure it.
>>>
>>>
>>>>> diff --git a/platform/linux-generic/include/odp/api/plat/time_types.h
>>>> b/platform/linux-generic/include/odp/api/plat/time_types.h
>>>>> index 4847f3b1..8ae7ce82 100644
>>>>> --- a/platform/linux-generic/include/odp/api/plat/time_types.h
>>>>> +++ b/platform/linux-generic/include/odp/api/plat/time_types.h
>>>>> @@ -26,11 +26,28 @@ extern "C" {
>>>>> * the linux timespec structure, which is dependent on POSIX extension
>>>> level.
>>>>> */
>>>>> typedef struct odp_time_t {
>>>>> - int64_t tv_sec; /**< @internal Seconds */
>>>>> - int64_t tv_nsec; /**< @internal Nanoseconds */
>>>>> + union {
>>>>> + /** @internal Posix timespec */
>>>>> + struct {
>>>>> + /** @internal Seconds */
>>>>> + int64_t tv_sec;
>>>>> +
>>>>> + /** @internal Nanoseconds */
>>>>> + int64_t tv_nsec;
>>>>> + } spec;
>>>>> +
>>>>> + /** @internal HW time counter */
>>>>> + struct {
>>>>> + /** @internal Counter value */
>>>>> + uint64_t count;
>>>>> +
>>>>> + /** @internal Reserved */
>>>>> + uint64_t res;
>>>>> + } hw;
>>>>> + };
>>>>
>>>> A processor's tick counter cannot be used to measure calendar time by
>>>> itself. The delta between two ticks, however, can be converted to
>>>> calendar time.
>>>>
>>>> Please see the proposal that introduces odp_tick_t which eliminates
>>>> the wasted u64 reserved field in the union above.
>>>
>>>
>>> Nothing is wasted here today.
>>
>> The 64-bit CPU time is stored in a 128-bit data structure where 64-bits
>> are wasted. You can use just a 64-bit type and then convert that to
>> a timespec (using a starting timestamp taken on global init) if needed.
>>
>>> We need to support Posix timespec anyway. Timespec is used when there's no
>>> x86 invariant TSC available - so all non-x86 and even on some x86 systems.
>>>
>>> Also "count" here starts from zero (we save cpu counter value at start).
>>> So, it takes 580 years to wrap at 1GHz. Why 64 bit counter value (that
>>> starts from zero) is not good for calendar time ?
>>>
>>>
>>>
>>>
>>>>> +/*
>>>>> + * Posix timespec based functions
>>>>> + */
>>>>>
>>>>> -static inline odp_time_t time_diff(odp_time_t t2, odp_time_t t1)
>>>>> +static inline odp_time_t time_spec_diff(odp_time_t t2, odp_time_t t1)
>>>>
>>>> Static functions in .c files will always be considered for inlining, so
>>>> the
>>>> inline qualifier is unnecessary here. You can use always_inline compiler
>>>> attribute if the disassembly and run time performance is not as expected.
>>>>
>>>> Comment applies to nearly every function in this file.
>>>
>>>
>>> This is because all our code (in this and other files) does "static inline"
>>> for functions that we want to be inlined. Static is used for functions that
>>> are static, but we don't care if those are inlined (slow path). Most time
>>> API functions are fast path.
>>
>> I am saying that adding the 'inline' qualifier here doesn't actually do
>> anything
>> because the compiler will consider static functions for inlining anyways.
>>
>>>
>>>>>
>>>>> -static inline uint64_t time_local_res(void)
>>>>> +static inline uint64_t time_hw_res(void)
>>>>
>>>> I think 'freq' is a more descriptive name than 'res'.
>>>>
>>>> It would also be good to document at either the function declaration
>>>> or the function definition (if multiple implementations differ) whether
>>>> the frequency is in Hz or kHz. For example, on ARMv8 it is just a
>>>> register read to get Hz. No estimation, and no lowering to kHz.
>>>
>>>
>>> This implements resolution API functions ...
>>>
>>> /**
>>> * Global time resolution in hertz
>>> *
>>> * @return Global time resolution in hertz
>>> */
>>> uint64_t odp_time_global_res(void);
>>>
>>>
>>> ... so, from that context you know that it's Hz. In this implementation, I
>>> lower the measured Hz to be conservative.
>>>
>>>
>>>>> +
>>>>> +static inline odp_time_t time_hw_sum(odp_time_t t1, odp_time_t t2)
>>>>> +{
>>>>> + odp_time_t time;
>>>>> +
>>>>> + time.hw.res = 0;
>>>>> + time.hw.count = t1.hw.count + t2.hw.count;
>>>>> +
>>>>> + return time;
>>>>> +}
>>>>
>>>> If a single u64 were used this code wouldn't need to exist.
>>>
>>> Which code? hw.res = 0 ? It not actually used for anything and could be
>>> removed. Although, the performance gain would not be huge. Anyway, I'll
>>> remove it for v2.
>>
>> If the CPU time was stored in a 64-bit type, you can use arithmetic operators
>> on the values directly instead of doing arithmetic on a 128-bit compound
>> datatype where 64-bits are unused. This is the union above.
>>
>>> -Petri
>>>