* David Goulet ([email protected]) wrote: > > > On 10-07-07 12:32 PM, Mathieu Desnoyers wrote: >> * David Goulet ([email protected]) wrote: >>> On 10-07-06 03:39 PM, Nils Carlson wrote: >>>> Cool, so the measurements came through... >>>> >>> >>> I've retested UST per event time with the new commit made few days ago >>> fixing the custom probes and cache line alignment. Here are the results >>> for TSC counter and clock_gettime (test made 1000 times on i7) : >>> >>> rdtsc : >>> Average : 0.000000242229708 sec, 242.22971 nsec >>> Standard Deviation : 0.000000001663147 sec , 1.66315 nsec >>> >>> clock_gettime : >>> Average : 0.000000272516616 sec, 272.51662 nsec >>> Standard Deviation : 0.000000002340784 sec , 2.34078 nsec >>> >>>> What I would like to see is the automatic detection of whether the rdtsc >>>> instruction is usable, >>>> a test for this already exists in the kernel and the question is whether >>>> this info is currently exported >>>> or whether we need to submit a patch to export it. >>>> >>> >>> From userspace, to test, this would be a syscall via prctl right? The >>> thing is that it's needed at compile time. Right now, the __i386__ and >>> __x86_64__ define is tested. Upon gcc compilation, it would be great to >>> have something like TSC_AVAILABLE define and then compile the right >>> function (either clock_gettime or rdtsc). >>> >>> However, there is some issues about consistency by using TSC for example >>> between CPUs counter... so I think we need to be very careful about that >>> even if the performance are 30ns less and much more _stable_ (see std >>> variation). >> >> We only care about having a consistent read across CPU (and speed, aka >> throughput). Having different standard deviation does not matter much. >> We cannot know if the architecture we will be deployed on has consistent >> TSCs across cores, so we have to test it at runtime. >> >> One approach might be to try using prctl at library load, but I don't >> see any information about consistent tsc in there. >> >> The other approach is to use a vDSO for trace clock (as I proposed >> earlier). You can try to create something very similar in userland for >> benchmarks: Create a function that tests a global boolean to figure out >> if we can simply read the TSC, and perform the TSC read if the check is >> ok. Make sure the function is -not- static and has the attribute >> "noinline", so the compiler generates the function call. Also make sure >> that the variable you are testing for "tsc consistency" is not marked >> static neither, but rather marked "volatile", so the compiler does not >> optimize the load away. Compile with -O2. >> > > I'm wondering why do the test function need to be "noinline" and the > bool volatile? If the test is done at library load (prctl() syscall), it > won't change for the rest of the execution so inlining should be here > more efficient and static bool also no?
I'm saying this for the specific case where you want to test this kind of function directly in a program, without doing the library already. So it would "mimic" the call to a library from within a program. You could do without the volatile if you don't expect the variable to be read concurrently with its "set" (done at library load time at process start). Thanks, Mathieu > > Note that this is not about TSC consistency has we talked the other day > but rather only check _if_ the TSC is available. > > Thanks > David > >> For the cases where we need more kernel support (due to non-consistent >> TSCs across cores), you might also want to export the linux sequence >> lock: include/linux/seqlock.h into user-space (we only nead the read >> seqlock part, with smp_mb() mapped to the urcu memory barriers) and >> figure out the overhead of this sequence lock. This will be needed to >> ensure consistency of the data structures that will be needed to support >> the vDSO when the "consistent tsc" dynamic check fails. >> >> Thanks, >> >> Mathieu >> >>> >>> David >>> >>>> Then we should probably start looking at a simple choosing mechanism, >>>> probably a function pointer? >>>> >>>> /Nils >>>> On Jul 6, 2010, at 8:12 PM, David Goulet wrote: >>>> >>>>> Hey, >>>>> >>>>> After some talks with Nils from Ericsson, there was some questions >>>>> about using the TSC counter and not clock_gettime in include/ust/clock.h >>>>> >>>>> I ran some test after the meeting and was quite surprised by the >>>>> overhead of clock_gettime. >>>>> >>>>> On an average run ... >>>>> WITH clock_gettime : ~ 266ns per events >>>>> WITH rdtsc instruction : ~ 235ns per events >>>>> >>>>> And it is systematic... I'm getting stable result with rdtsc with >>>>> standard deviation of ~2ns. >>>>> >>>>> As little as I know on TSC, one thing for sure, with SMP, it becomes >>>>> much more "fragile" to rely on it because we don't have assurance of >>>>> coherent counters between CPUs and also the CPU scaling policy >>>>> (ondemand is default on Ubuntu now). New CPUs support constant_tsc and >>>>> nonstop_tsc flags but still a small range of them. >>>>> >>>>> Right now, UST is forcing the use of clock_gettime even if i386 or >>>>> x86_64 is used. >>>>> Should a change be consider ? >>>>> >>>>> Thanks >>>>> David >>>>> >>>>> _______________________________________________ >>>>> ltt-dev mailing list >>>>> [email protected] >>>>> http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev >>>> >>> >>> _______________________________________________ >>> ltt-dev mailing list >>> [email protected] >>> http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev >>> >> > > -- > David Goulet > LTTng project, DORSAL Lab. > > PGP/GPG : 1024D/16BD8563 > BE3C 672B 9331 9796 291A 14C6 4AF7 C14B 16BD 8563 > > _______________________________________________ > ltt-dev mailing list > [email protected] > http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev > -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com _______________________________________________ ltt-dev mailing list [email protected] http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
