[Bug middle-end/44053] benchmark function attribute.
--- Comment #1 from svfuerst at gmail dot com 2010-05-10 06:36 --- A common technique is to benchmark a function by calling it many times i.e. void foo(void) { /* foo's implementation */ } int main(void) { int i; for (i = 0; i LARGE_NUM; i++) foo(); return 0; } The problem with this technique is that although the programmer would like to optimize foo, the compiler will over optimize. In short, it would be nice if there was a flag to enforce the ABI in usage of foo to get meaningful results. i.e. 1) Prevent inlining of foo. (The current 'noinline' attribute.) 2) Prevent cloning of foo for specific argument cases. 3) Prevent deletion of calls to foo. In the example case, the calls to foo may be removed as no external side effects may be visible. However, the most important side effect, the total time taken, is then altered. 4) Enforce the existence of foo, so its disassembly may be examined. (The current 'used' attribute.) At the moment an increasing number of work-arounds need to be done to avoid these problems. It would be nice if a single function attribute would tell the compiler to avoid optimizations that cross the foo function-call interface boundary, but maintain optimizations within foo itself. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44053
[Bug middle-end/44053] benchmark function attribute.
--- Comment #2 from steven at gcc dot gnu dot org 2010-05-10 06:55 --- Re. comment #1: (1) For this, there is the noinline attribute, as you already knew. (2) See the noclone attribute (3) See the regparm attribute (4) You could use volatile and things like that, or put the unit in a separate translation unit. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44053
[Bug middle-end/44053] benchmark function attribute.
--- Comment #3 from rguenth at gcc dot gnu dot org 2010-05-10 09:13 --- 4) is already fine with noclone,noinline for 3) you can add artificial side-effects by an empty asm(); -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44053
[Bug middle-end/44053] benchmark function attribute.
--- Comment #4 from steven at gcc dot gnu dot org 2010-05-10 11:00 --- In other words: not an issue. -- steven at gcc dot gnu dot org changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution||WORKSFORME http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44053
[Bug middle-end/44053] benchmark function attribute.
--- Comment #5 from svfuerst at gmail dot com 2010-05-10 14:53 --- The problem is that the list of these workarounds tends to increase with each release of gcc. (i.e. noclone was added in gcc 4.5) It would be nice if there was a single attribute to use that would work with all future versions of the compiler, no matter how smart it gets. For example, putting the function to benchmark into a separate compilation unit isn't guaranteed to work indefinitely. If ever lto is enabled by default in the future, it will then cause problems. Think of it as the difference between -O2, and the long list of command line optimization flags that -O2 represents. Eventually the complexity of adding that new flag is less than the complexity of the list of things it represents. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44053
[Bug middle-end/44053] benchmark function attribute.
--- Comment #6 from pinskia at gcc dot gnu dot org 2010-05-10 22:04 --- Also I think it is a bad idea for having this kind of attribute. If your benchmark can be optimized away, that is better for newer versions of the compiler. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44053
[Bug middle-end/44053] benchmark function attribute.
--- Comment #7 from svfuerst at gmail dot com 2010-05-10 22:44 --- Perhaps an example usage helps: The __float128 version of isnan() is rather slow. Trying different implmentations to see which is faster required some benchmarking. However, implementing the benchmark code requires an increasing number of work-arounds as gcc will rightly optimize everything away if given the chance to do so. This required using the result of that function/macro in some way, the simplest being to sum them, and print the result. The problem is that for the faster code, the overhead of the addition starts to perturb the results. In addition, an increasing number of function attributes are required to make sure the function wasn't cloned/inlined/elided as gcc version number increases. However, this isn't enough. It is conceivable that eventually gcc will be smart enough to completely understand the benchmarked function enough to replace the summation loop + printf with a single puts(result); This is allowable since the internal state of the abstract machine is never used, only its output. For timing purposes, this is a disaster. (This doesn't happen for the isnan() currently, but does for other simpler functions.) In short, it would be really nice if there was a way to tell gcc that there is a hidden side effect of a function that is important: the total time taken due to calling it. Such an attribute may only be a combination of other attributes, but given the history of the compiler, the number of component attributes will increase with time, and is already an unwieldy number. Anyway, the result of much benchmarking shows that: #include emmintrin.h static __attribute__((noinline)) int fastisnan(__float128 x) { __m128i c1 = {0xull, 0x7fffull}; __m128i c2 = {0x7fff7fff7fff7fffull, 0x00017fff7fff7fffull}; __m128i x2 = *(__m128i *) x; x2 = c1; x2 = _mm_adds_epu16(c2, x2); return (_mm_movemask_epi8(x2) 0x) 0x8000; } is an order of magnitude faster than the current isnan() implementation for __float128 on x86_64. Similar improvements exist for isinf() and fpclassify() -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44053
[Bug middle-end/44053] benchmark function attribute.
--- Comment #8 from pinskia at gcc dot gnu dot org 2010-05-10 22:49 --- Anyway, the result of much benchmarking shows that: Is it? It definitely moves from the x87 registers to the SSE registers which can be slow. Micro benchmarks are not always true benchmarks. Also there are otherways of benchmarking something like isnan. Make a big array of random numbers (who's seed is based on the wall clock) and then test it that way. Save the seed somewhere you read it into the program if you want consistent numbers. Add the number isnan returns non zero and that should give you a good benchmark. Better than what you have below. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44053
[Bug middle-end/44053] benchmark function attribute.
--- Comment #9 from svfuerst at gmail dot com 2010-05-10 23:27 --- Remember that isnan() is a weird type-dependent macro. The special case I was testing is the __float128 version. __float128's are passed in sse registers, so using sse instructions to manipulate them can be a win. (No x87 involved.) Unfortunately, the sse instruction set isn't all that orthogonal, so using the normal 64bit registers can be faster in some cases. It also isn't obvious which sse-based algorithm is the best without testing. Hence all the benchmarking. In this case, the resulting function is branchless, so it doesn't matter much which particular values you use for the input for timings. However, adding extra memory reads (like scanning an array for input like you describe), or writes (via storing the output to a volatile) does change the timings. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44053