[Bug middle-end/44053] benchmark function attribute.

2010-05-10 Thread svfuerst at gmail dot com


--- Comment #1 from svfuerst at gmail dot com  2010-05-10 06:36 ---
A common technique is to benchmark a function by calling it many times i.e.

void foo(void)
{
  /* foo's implementation */
}

int main(void)
{
   int i;

   for (i = 0; i  LARGE_NUM; i++) foo();

   return 0;
}

The problem with this technique is that although the programmer would like to
optimize foo, the compiler will over optimize.

In short, it would be nice if there was a flag to enforce the ABI in usage of
foo to get meaningful results.

i.e.

1) Prevent inlining of foo.  (The current 'noinline' attribute.)
2) Prevent cloning of foo for specific argument cases.
3) Prevent deletion of calls to foo.  In the example case, the calls to foo may
be removed as no external side effects may be visible.  However, the most
important side effect, the total time taken, is then altered.
4) Enforce the existence of foo, so its disassembly may be examined.  (The
current 'used' attribute.)

At the moment an increasing number of work-arounds need to be done to avoid
these problems.  It would be nice if a single function attribute would tell the
compiler to avoid optimizations that cross the foo function-call interface
boundary, but maintain optimizations within foo itself.



-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44053



[Bug middle-end/44053] benchmark function attribute.

2010-05-10 Thread steven at gcc dot gnu dot org


--- Comment #2 from steven at gcc dot gnu dot org  2010-05-10 06:55 ---
Re. comment #1:

(1) For this, there is the noinline attribute, as you already knew.
(2) See the noclone attribute
(3) See the regparm attribute
(4) You could use volatile and things like that, or put the unit in a separate
translation unit.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44053



[Bug middle-end/44053] benchmark function attribute.

2010-05-10 Thread rguenth at gcc dot gnu dot org


--- Comment #3 from rguenth at gcc dot gnu dot org  2010-05-10 09:13 ---
4) is already fine with noclone,noinline

for 3) you can add artificial side-effects by an empty asm();


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44053



[Bug middle-end/44053] benchmark function attribute.

2010-05-10 Thread steven at gcc dot gnu dot org


--- Comment #4 from steven at gcc dot gnu dot org  2010-05-10 11:00 ---
In other words: not an issue.


-- 

steven at gcc dot gnu dot org changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution||WORKSFORME


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44053



[Bug middle-end/44053] benchmark function attribute.

2010-05-10 Thread svfuerst at gmail dot com


--- Comment #5 from svfuerst at gmail dot com  2010-05-10 14:53 ---
The problem is that the list of these workarounds tends to increase with each
release of gcc.  (i.e. noclone was added in gcc 4.5) It would be nice if there
was a single attribute to use that would work with all future versions of the
compiler, no matter how smart it gets.  For example, putting the function to
benchmark into a separate compilation unit isn't guaranteed to work
indefinitely.  If ever lto is enabled by default in the future, it will then
cause problems.

Think of it as the difference between -O2, and the long list of command line
optimization flags that -O2 represents.  Eventually the complexity of adding
that new flag is less than the complexity of the list of things it represents.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44053



[Bug middle-end/44053] benchmark function attribute.

2010-05-10 Thread pinskia at gcc dot gnu dot org


--- Comment #6 from pinskia at gcc dot gnu dot org  2010-05-10 22:04 ---
Also I think it is a bad idea for having this kind of attribute.  If your
benchmark can be optimized away, that is better for newer versions of the
compiler.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44053



[Bug middle-end/44053] benchmark function attribute.

2010-05-10 Thread svfuerst at gmail dot com


--- Comment #7 from svfuerst at gmail dot com  2010-05-10 22:44 ---
Perhaps an example usage helps:

The __float128 version of isnan() is rather slow.  Trying different
implmentations to see which is faster required some benchmarking.  However,
implementing the benchmark code requires an increasing number of work-arounds
as gcc will rightly optimize everything away if given the chance to do so. 
This required using the result of that function/macro in some way, the simplest
being to sum them, and print the result.  The problem is that for the faster
code, the overhead of the addition starts to perturb the results.  In addition,
an increasing number of function attributes are required to make sure the
function wasn't cloned/inlined/elided as gcc version number increases.

However, this isn't enough.  It is conceivable that eventually gcc will be
smart enough to completely understand the benchmarked function enough to
replace the summation loop + printf with a single puts(result);  This is
allowable since the internal state of the abstract machine is never used, only
its output.  For timing purposes, this is a disaster.  (This doesn't happen for
the isnan() currently, but does for other simpler functions.)

In short, it would be really nice if there was a way to tell gcc that there is
a hidden side effect of a function that is important: the total time taken due
to calling it.  Such an attribute may only be a combination of other
attributes, but given the history of the compiler, the number of component
attributes will increase with time, and is already an unwieldy number.

Anyway, the result of much benchmarking shows that:
#include emmintrin.h
static __attribute__((noinline)) int fastisnan(__float128 x)
{
__m128i c1 = {0xull, 0x7fffull};
__m128i c2 = {0x7fff7fff7fff7fffull, 0x00017fff7fff7fffull};
__m128i x2 = *(__m128i *) x;

x2 = c1;
x2 = _mm_adds_epu16(c2, x2);
return (_mm_movemask_epi8(x2)  0x)  0x8000;
}
is an order of magnitude faster than the current isnan() implementation for
__float128 on x86_64.  Similar improvements exist for isinf() and fpclassify()


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44053



[Bug middle-end/44053] benchmark function attribute.

2010-05-10 Thread pinskia at gcc dot gnu dot org


--- Comment #8 from pinskia at gcc dot gnu dot org  2010-05-10 22:49 ---
Anyway, the result of much benchmarking shows that:

Is it?  It definitely moves from the x87 registers to the SSE registers which
can be slow.  Micro benchmarks are not always true benchmarks.  Also there are
otherways of benchmarking something like isnan.  Make a big array of random
numbers (who's seed is based on the wall clock) and then test it that way. 
Save the seed somewhere you read it into the program if you want consistent
numbers.  Add the number isnan returns non zero and that should give you a good
benchmark.  Better than what you have below.  


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44053



[Bug middle-end/44053] benchmark function attribute.

2010-05-10 Thread svfuerst at gmail dot com


--- Comment #9 from svfuerst at gmail dot com  2010-05-10 23:27 ---
Remember that isnan() is a weird type-dependent macro.  The special case I was
testing is the __float128 version.  __float128's are passed in sse registers,
so using sse instructions to manipulate them can be a win.  (No x87 involved.) 
Unfortunately, the sse instruction set isn't all that orthogonal, so using the
normal 64bit registers can be faster in some cases.  It also isn't obvious
which sse-based algorithm is the best without testing.  Hence all the
benchmarking.

In this case, the resulting function is branchless, so it doesn't matter much
which particular values you use for the input for timings.  However, adding
extra memory reads (like scanning an array for input like you describe), or
writes (via storing the output to a volatile) does change the timings.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=44053