Hi, all. Recently I found out that internal GCC timing measurement is rather expensive. On my desktop (x86_64-linux, Debian 8) GCC uses times(2) to query performance information (note that this function is considered the default and used whenever available) and it introduces up to 8% change in wallclock run time. I compared two invocations of cc1plus: with -ftime-report switch and without it (both use -quiet):
TraMP3d v.4 (-fpermissive -O3) normal run: 19.18 user 0.18 system 0:19.34 wall -ftime-report: 19.59 user 1.58 system 0:21.15 wall As you see, the difference is rather significant. I would say that it makes the "system time" counter useless. What is even worse, is that this overhead is spread non-uniformly across the code. To get some estimate I replaced current time measurement code with rdtsc instruction (affinity was set using taskset; CPU clock rate was hard-coded). Some results: Using times(2): phase setup : 0.01 ( 0%) wall phase parsing : 1.33 ( 6%) wall phase lang. deferred : 1.65 ( 8%) wall phase opt and generate : 18.14 (86%) wall |name lookup : 0.27 ( 1%) wall |overload resolution : 0.83 ( 4%) wall garbage collection : 0.71 ( 3%) wall dump files : 0.09 ( 0%) wall ... 19.59 user 1.58 system 0:21.15 wall Using rdtsc: phase setup : 0.00 ( 0%) wall phase parsing : 0.94 ( 5%) wall phase lang. deferred : 1.47 ( 7%) wall phase opt and generate : 17.24 (88%) wall |name lookup : 0.18 ( 1%) wall |overload resolution : 0.73 ( 4%) wall garbage collection : 0.73 ( 4%) wall dump files : 0.04 ( 0%) wall ... 19.46 user 0.22 system 0:19.68 wall For example, we can see that GC time is not affected, for "opt and generate" the difference is 5.2% and for parsing it is 41%! The majority of calls comes from the frontend: C++ FE tries to measure time spent in rather frequently used routines like overload resolution and template instantation. The lexer calls timer::push and timer::pop once per token. More detailed profiling results: https://docs.google.com/spreadsheets/d/1icH6KZo88FvXkOMW2stP7qujPLqHDOeC0WBCJ_92DJU/edit?usp=sharing Some ideas about fixing this: 1. Do not use times for some cases. That is, add a second kind of timers which do not measure user/system time separately but are relatively cheap. Use clock_gettime(CLOCK_MONOTONIC, ...) for these timers, if available (on modern Linux'es it is implemented as a VDSO, i.e., does not require syscall, at least for x86_64). This does not solve the problem completely (I tried to replace times with clock_gettime but it's still ~0.5 seconds slower than normal run). Using rdtsc is, IMHO, not an option (performance is OK, but it's too hard to make it portable and reliable). 2. Reduce the total number of calls somehow (figure out which are not really needed). During tramp3d compilation, get_time is called 13172693 times. Thoughts? -- Regards, Mikhail Maltsev