times(2) used in timevars.c introduces significant measurement error

Mikhail Maltsev Sun, 12 Jul 2015 19:16:48 -0700

Hi, all.

Recently I found out that internal GCC timing measurement is rather expensive.
On my desktop (x86_64-linux, Debian 8) GCC uses times(2) to query performance
information (note that this function is considered the default and used whenever
available) and it introduces up to 8% change in wallclock run time.
I compared two invocations of cc1plus: with -ftime-report switch and without it
(both use -quiet):


TraMP3d v.4 (-fpermissive -O3)
normal run:       19.18 user    0.18 system    0:19.34 wall
-ftime-report:    19.59 user    1.58 system    0:21.15 wall

As you see, the difference is rather significant. I would say that it makes the
"system time" counter useless. What is even worse, is that this overhead is
spread non-uniformly across the code. To get some estimate I replaced current
time measurement code with rdtsc instruction (affinity was set using taskset;
CPU clock rate was hard-coded). Some results:

Using times(2):
 phase setup             :   0.01 ( 0%) wall
 phase parsing           :   1.33 ( 6%) wall
 phase lang. deferred    :   1.65 ( 8%) wall
 phase opt and generate  :  18.14 (86%) wall
 |name lookup            :   0.27 ( 1%) wall
 |overload resolution    :   0.83 ( 4%) wall
 garbage collection      :   0.71 ( 3%) wall
 dump files              :   0.09 ( 0%) wall
...
19.59 user      1.58 system     0:21.15 wall

Using rdtsc:
 phase setup             :   0.00 ( 0%) wall
 phase parsing           :   0.94 ( 5%) wall
 phase lang. deferred    :   1.47 ( 7%) wall
 phase opt and generate  :  17.24 (88%) wall
 |name lookup            :   0.18 ( 1%) wall
 |overload resolution    :   0.73 ( 4%) wall
 garbage collection      :   0.73 ( 4%) wall
 dump files              :   0.04 ( 0%) wall
...
19.46 user      0.22 system     0:19.68 wall

For example, we can see that GC time is not affected, for "opt and generate" the
difference is 5.2% and for parsing it is 41%!

The majority of calls comes from the frontend: C++ FE tries to measure time
spent in rather frequently used routines like overload resolution and template
instantation. The lexer calls timer::push and timer::pop once per token. More
detailed profiling results:
https://docs.google.com/spreadsheets/d/1icH6KZo88FvXkOMW2stP7qujPLqHDOeC0WBCJ_92DJU/edit?usp=sharing

Some ideas about fixing this:
1. Do not use times for some cases. That is, add a second kind of timers which
do not measure user/system time separately but are relatively cheap. Use
clock_gettime(CLOCK_MONOTONIC, ...) for these timers, if available (on modern
Linux'es it is implemented as a VDSO, i.e., does not require syscall, at least
for x86_64). This does not solve the problem completely (I tried to replace
times with clock_gettime but it's still ~0.5 seconds slower than normal run).
Using rdtsc is, IMHO, not an option (performance is OK, but it's too hard to
make it portable and reliable).
2. Reduce the total number of calls somehow (figure out which are not really
needed). During tramp3d compilation, get_time is called 13172693 times.

Thoughts?

-- 
Regards,
    Mikhail Maltsev

times(2) used in timevars.c introduces significant measurement error

Reply via email to