On 04/27 08:21:34, Savolainen, Petri (Nokia - FI/Espoo) wrote: > > > > -----Original Message----- > > From: Brian Brooks [mailto:[email protected]] > > Sent: Wednesday, April 19, 2017 9:46 PM > > To: Savolainen, Petri (Nokia - FI/Espoo) <[email protected]> > > Cc: [email protected] > > Subject: Re: [lng-odp] [API-NEXT PATCH 8/8] linux-gen: time: use hw time > > counter when available > > > > On 04/26 07:30:15, Savolainen, Petri (Nokia - FI/Espoo) wrote: > > > > > > > > > This function (cpu_global_time()) is called only when we have > > first > > > > checked that TSC is invariant. Also we measure the TSC frequency in > > that > > > > case. This function is defined in the same file as cpu_cycles(), and > > the > > > > file is x86 specific. So, we know what we are doing, and just re-using > > the > > > > code to read TSC. > > > > > > > > What sort of timing accuracy is expected from the app? > > > > > > > > From benchmarking the maximum single-threaded rate of these reads: > > > > > > > > x86_64: > > > > > > > > read 7 ns/op > > > > read_sync 22 ns/op > > > > > > > > A57: > > > > > > > > read 4 ns/op > > > > read_sync 26 ns/op > > > > > > > > read_sync issues a synchronizing instruction for greater timing > > accuracy > > > > but clearly takes more time to return the time value read from the > > core. > > > > > > Accuracy is as good as implementation can offer with reasonable > > overhead. We do not put any nsec figures into API spec. ODP API should > > offer application the most efficient way to read time anyway. > > > > 'reasonable' is what we need to define. > > > > Another reason why you're seeing a performance boost on x86 is that when > > switching from clock_gettime() to RDTSC, you're no longer issuing a > > synchronizing > > instruction (fence). As shown above, this can be a significant factor > > depending > > on how often the time is being sampled. > > > > However, there is a loss in timing accuracy because the load of the value > > may not happen at the time it happens in program order. This is why a > > synchronizing instruction needs to be used, but it slows down the > > execution > > of the thread on the core... > > > > > This patch does not take a position which way TSC should be read. There > > are three options: rdtsc, rdtsc + barrier, rdtscp. I think the current > > code is good enough for the accuracy. Barrier adds slight overhead. Rdtscp > > is not as widely supported as rdtsc. This detail is a magnitude less > > significant compared to: use system call vs direct TSC read. It can be > > tuned later. This patch set helps if rdtscp should be used later on > > (introduces x86 cpu flags). > > > > So you're saying that you do not need the synchronizing instruction, and > > the > > loss of timing accuracy is OK, right? > > > What is our timing accuracy today? How much jitter the system call (and > everything may launch) causes in the current implementation? How much we are > losing accuracy compared what we have today? I'd say we are better off that > today anyway, just because we avoid the system call. > > We don't have fence today on rdtsc read for cycle count. This patch does not > add it, either. If we are good without it for cycles, we are good for nsec > also. > > The scale time scale application typically measures is in order of micro to > milliseconds - thousands to millions of nanoseconds. If out of order > execution moves the sample position e.g. 20 nsec (40 CPU cycles), it is not > significant error to the measurement. Already single cache miss during a > measurement causes same kind of noise to the measurement. Also CPU crystal > might not be too accurate either, etc.
Thanks for confirming that from the application point of view. Can also comment as to whether this is the same position for Timer Pool, Timer, and Timeout events? > -Petri > >
