[Bug c/46186] Clang creates code running 1600 times faster than gcc's

2010-10-29 Thread tg at mirbsd dot org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186

Thorsten Glaser tg at mirbsd dot org changed:

   What|Removed |Added

 CC||tg at mirbsd dot org

--- Comment #22 from Thorsten Glaser tg at mirbsd dot org 2010-10-29 21:06:44 
UTC ---
(In reply to comment #19)
 It's a bit more complicated than that, in that you can't just compute 
 (b*(b-1)-a*(a-1)) mod 2^n, then divide by 2, as that will lose the top 
 bit.  (I haven't checked exactly what the generated code is doing here.)

Haven’t checked either, but in i386 asm, using RCR instead of SHR would work
if the carry flag is correct after the sequence calculating that (or SAR).


[Bug c/46186] Clang creates code running 1600 times faster than gcc's

2010-10-26 Thread j...@jak-linux.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186

--- Comment #1 from Julian Andres Klode j...@jak-linux.org 2010-10-26 
14:30:24 UTC ---
Created attachment 22162
  -- http://gcc.gnu.org/bugzilla/attachment.cgi?id=22162
Clang's assember

Attaching the assembler output from clang, it should help understand which
optimizations clang does (and improve gcc to do them as well).


[Bug c/46186] Clang creates code running 1600 times faster than gcc's

2010-10-26 Thread paolo.carlini at oracle dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186

Paolo Carlini paolo.carlini at oracle dot com changed:

   What|Removed |Added

 CC||paolo.carlini at oracle dot
   ||com

--- Comment #2 from Paolo Carlini paolo.carlini at oracle dot com 2010-10-26 
14:31:47 UTC ---
;)


[Bug c/46186] Clang creates code running 1600 times faster than gcc's

2010-10-26 Thread j...@jak-linux.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186

--- Comment #3 from Julian Andres Klode j...@jak-linux.org 2010-10-26 
14:32:27 UTC ---
System information:

Using built-in specs.
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian 4.4.5-5'
--with-bugurl=file:///usr/share/doc/gcc-4.4/README.Bugs
--enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr
--program-suffix=-4.4 --enable-shared --enable-multiarch
--enable-linker-build-id --with-system-zlib --libexecdir=/usr/lib
--without-included-gettext --enable-threads=posix
--with-gxx-include-dir=/usr/include/c++/4.4 --libdir=/usr/lib --enable-nls
--enable-clocale=gnu --enable-libstdcxx-debug --enable-objc-gc
--with-arch-32=i586 --with-tune=generic --enable-checking=release
--build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 4.4.5 (Debian 4.4.5-5)


[Bug c/46186] Clang creates code running 1600 times faster than gcc's

2010-10-26 Thread redi at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186

--- Comment #4 from Jonathan Wakely redi at gcc dot gnu.org 2010-10-26 
14:47:12 UTC ---
GCC's output is significantly faster at -O3 or without the noinline attribute


[Bug c/46186] Clang creates code running 1600 times faster than gcc's

2010-10-26 Thread j...@jak-linux.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186

--- Comment #5 from Julian Andres Klode j...@jak-linux.org 2010-10-26 
14:53:24 UTC ---
(In reply to comment #4)
 GCC's output is significantly faster at -O3 or without the noinline attribute

I just tested and at -O3, gcc-4.4 creates slow code and gcc-4.5 fast code. At
-O2, both are equally fast. The clang 1.1 code at -O2 is as fast as GCC 4.5's
code at -O3.


[Bug c/46186] Clang creates code running 1600 times faster than gcc's

2010-10-26 Thread dominiq at lps dot ens.fr
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186

--- Comment #6 from Dominique d'Humieres dominiq at lps dot ens.fr 2010-10-26 
14:59:18 UTC ---
You get this kind of speedup if the compiler knows that the result of the loop
is

sum=(b*(b-1)-a*(a-1))/2

In which case the timing is meaningless (it is 0.000s on my laptop), so is the
ratio with the execution of the loop.

The basic question is: how much the user's ignorance should be repaired by the
optimizer? (A colleague of mine told me that he once audited a CFD code and
found that \int_a^b dx/x was evaluated numerically instead of using
log(b)-log(a)!-)


[Bug c/46186] Clang creates code running 1600 times faster than gcc's

2010-10-26 Thread j...@jak-linux.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186

--- Comment #7 from Julian Andres Klode j...@jak-linux.org 2010-10-26 
15:00:37 UTC ---
(In reply to comment #5)
 (In reply to comment #4)
  GCC's output is significantly faster at -O3 or without the noinline 
  attribute
 
 I just tested and at -O3, gcc-4.4 creates slow code and gcc-4.5 fast code. At
 -O2, both are equally fast. The clang 1.1 code at -O2 is as fast as GCC 4.5's
 code at -O3.

Using gcc 4.5 with -O3 may work for the C code, but it does not optimize the
C++ code posted at http://lwn.net/Articles/411776/:

g++-4.5 -O3: real 0m1.608s
clang++ -O2: real 0m0.003s
clang++ -Os: real 0m0.003s

But I guess the C++ problem might be a different one.


[Bug c/46186] Clang creates code running 1600 times faster than gcc's

2010-10-26 Thread j...@jak-linux.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186

--- Comment #8 from Julian Andres Klode j...@jak-linux.org 2010-10-26 
15:25:56 UTC ---
(In reply to comment #6)
 You get this kind of speedup if the compiler knows that the result of the loop
 is
 
 sum=(b*(b-1)-a*(a-1))/2
 
 In which case the timing is meaningless (it is 0.000s on my laptop), so is the
 ratio with the execution of the loop.
 
 The basic question is: how much the user's ignorance should be repaired by the
 optimizer? (A colleague of mine told me that he once audited a CFD code and
 found that \int_a^b dx/x was evaluated numerically instead of using
 log(b)-log(a)!-)

Since the optimization seems to be mostly there in -O3, it's just a matter of
enabling it in -O2. 

I just found out that it does not optimize if you call f() via a global
function pointer, it still takes 1.6 seconds despite being compiled at -O3,
whereas clang can optimize it to 0.001s.


[Bug c/46186] Clang creates code running 1600 times faster than gcc's

2010-10-26 Thread redi at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186

--- Comment #9 from Jonathan Wakely redi at gcc dot gnu.org 2010-10-26 
15:28:51 UTC ---
(In reply to comment #8)
 
 Since the optimization seems to be mostly there in -O3, it's just a matter of
 enabling it in -O2. 

Or if you want all optimisations, it's just a matter of using -O3

Expecting all optimisations to be done below the maximum optimisation level is
... unexpected.


[Bug c/46186] Clang creates code running 1600 times faster than gcc's

2010-10-26 Thread jakub at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186

Jakub Jelinek jakub at gcc dot gnu.org changed:

   What|Removed |Added

 CC||jakub at gcc dot gnu.org

--- Comment #10 from Jakub Jelinek jakub at gcc dot gnu.org 2010-10-26 
15:37:04 UTC ---
-O2 -fipa-cp-clone should be in theory enough, then f would be normally cloned,
assuming gcc thinks it is from a hot spot.  But this is not a real-world
testcase, the call from main where it happens just once for the process is not
considered a hot spot.


[Bug c/46186] Clang creates code running 1600 times faster than gcc's

2010-10-26 Thread paolo.carlini at oracle dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186

--- Comment #11 from Paolo Carlini paolo.carlini at oracle dot com 2010-10-26 
15:42:58 UTC ---
Can we please stop talking about nano and giga numbers like kids? If an
optimization like complete loop unrolling is involved of course very small or
large numbers can be involved, doesn't really contribute anything to the
problem talking about the exact figure.


Re: [Bug c/46186] Clang creates code running 1600 times faster than gcc's

2010-10-26 Thread Andrew Pinski



On Oct 26, 2010, at 7:30 AM, j...@jak-linux.org gcc-bugzi...@gcc.gnu.org 
 wrote:



http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186

--- Comment #1 from Julian Andres Klode j...@jak-linux.org  
2010-10-26 14:30:24 UTC ---

Created attachment 22162
 -- http://gcc.gnu.org/bugzilla/attachment.cgi?id=22162
Clang's assember


This multiplication transformation is incorrect if the loop wraps  
(unsigned always wraps; never overflows). Gcc is correct in its speed.  
Though -O3 is faster but not because of multiplication but rather  
constant propatagtion. So this bug is invalid and some one should  
report a bug to llvm folks about this.




Attaching the assembler output from clang, it should help understand  
which

optimizations clang does (and improve gcc to do them as well).


[Bug c/46186] Clang creates code running 1600 times faster than gcc's

2010-10-26 Thread pinskia at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186

--- Comment #12 from pinskia at gmail dot com pinskia at gmail dot com 
2010-10-26 15:56:20 UTC ---
On Oct 26, 2010, at 7:30 AM, j...@jak-linux.org gcc-bugzi...@gcc.gnu.org 
  wrote:

 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186

 --- Comment #1 from Julian Andres Klode j...@jak-linux.org  
 2010-10-26 14:30:24 UTC ---
 Created attachment 22162
  -- http://gcc.gnu.org/bugzilla/attachment.cgi?id=22162
 Clang's assember

This multiplication transformation is incorrect if the loop wraps  
(unsigned always wraps; never overflows). Gcc is correct in its speed.  
Though -O3 is faster but not because of multiplication but rather  
constant propatagtion. So this bug is invalid and some one should  
report a bug to llvm folks about this.


 Attaching the assembler output from clang, it should help understand  
 which
 optimizations clang does (and improve gcc to do them as well).


[Bug c/46186] Clang creates code running 1600 times faster than gcc's

2010-10-26 Thread dominiq at lps dot ens.fr
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186

--- Comment #13 from Dominique d'Humieres dominiq at lps dot ens.fr 
2010-10-26 16:36:05 UTC ---
 This multiplication transformation is incorrect if the loop wraps  
 (unsigned always wraps; never overflows).

I think this is wrong: wrapping is nothing but a modulo 2^n operation (n=64
here) which works for additions and multiplications, so if there is wrapping,
the result is sum=(b*(b-1)-a*(a-1))/2 modulo 2^n, i.e. correctly wrapped.

On my Core2duo 2.53Ghz with -Ofast the run time is ~1.2s for elementary 2*10^9
loops or .6ns/loop or ~1.5 clock cycle per loop. For a perfect vectorization
and no loop overhead, I would expect a minimum of 0.5 clock cycle per loop. If
you get anything below this number, it means that the loop

for (; a  b; a++)
sum += a;

is replaced with sum=(b*(b-1)-a*(a-1))/2 (you can confirm it by checking that
the timing behaves as O(len) or not). Apparently clang does this (valid)
transformation while gcc don't for any options I have tried.

Note that If I write such a loop, it is because I am interested by the timing
of the loop, not by the result I know for more than 40 years!


[Bug c/46186] Clang creates code running 1600 times faster than gcc's

2010-10-26 Thread jakub at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186

Jakub Jelinek jakub at gcc dot gnu.org changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2010.10.26 16:59:00
 CC||spop at gcc dot gnu.org
 Ever Confirmed|0   |1

--- Comment #14 from Jakub Jelinek jakub at gcc dot gnu.org 2010-10-26 
16:59:00 UTC ---
For sum += 2 or sum += b sccp handles this, so I wonder whether it couldn't
handle even the sum += a case.  Sebastian?


[Bug c/46186] Clang creates code running 1600 times faster than gcc's

2010-10-26 Thread dominiq at lps dot ens.fr
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186

--- Comment #15 from Dominique d'Humieres dominiq at lps dot ens.fr 
2010-10-26 17:15:31 UTC ---
 For sum += 2 or sum += b sccp handles this, so I wonder whether it couldn't
 handle even the sum += a case.

2 and b are constants while a is not. For constants you have to know that the
sum is cst*nloop, while if a is incremented you have to know that it is related
to nloop*(nloop+1)/2 (and if you use sum += a*a, nloop*(nloop+1)*(2*nloop+1)/6,
if sum += a*a*a, nloop^2*(nloop+1)^2/4 and so on).

However is it worth the work?


[Bug c/46186] Clang creates code running 1600 times faster than gcc's

2010-10-26 Thread jakub at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186

--- Comment #16 from Jakub Jelinek jakub at gcc dot gnu.org 2010-10-26 
18:43:40 UTC ---
chrec_apply is called with
{a_4(D), +, {a_4(D) + 1, +, 1}_1}_1
chrec and ~a_4(D) + b_5(D) in x.
I wonder if this can be fixed just by recognizing such special cases in
chrec_apply (after checking e.g. that it is unsigned computation instead of
signed etc.).


[Bug c/46186] Clang creates code running 1600 times faster than gcc's

2010-10-26 Thread dominiq at lps dot ens.fr
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186

--- Comment #17 from Dominique d'Humieres dominiq at lps dot ens.fr 
2010-10-26 18:53:49 UTC ---
Note that clang seems to know the general result: \sum_{i=a}^b p(i)=P(b), where
p(i) is a given polynomial of degree n and P(x) a polynomial of degree n+1 such
that P(x)=P(x-1)+p(x) and P(a)=p(a). At least clang is able to remove the loop
for

sum5 += a*a*a*a*a + 2*a*a*a +5*a;


[Bug c/46186] Clang creates code running 1600 times faster than gcc's

2010-10-26 Thread jakub at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186

--- Comment #18 from Jakub Jelinek jakub at gcc dot gnu.org 2010-10-26 
19:11:59 UTC ---
I guess you mean LLVM instead of clang, I'm pretty sure the FE doesn't perform
this optimization.

Anyway, given:

#define F(n, exp) \
unsigned long   \
f##n (unsigned long a, unsigned long b) \
{   \
  unsigned long sum = 0;\
  for (; a  b; a++)\
exp;\
  return sum;   \
}

F (1, sum += a)
F (2, sum += 2)
F (3, sum += b)
F (4, sum += a * a)
F (5, sum += a * a * a)
F (6, a * a * a * a * a + 2 * a * a * a + 5 * a)

only the f1/f2/f3 cases make it into chrec_apply (and only f2/f3 are currently
handled there).


[Bug c/46186] Clang creates code running 1600 times faster than gcc's

2010-10-26 Thread joseph at codesourcery dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186

--- Comment #19 from joseph at codesourcery dot com joseph at codesourcery dot 
com 2010-10-26 20:29:56 UTC ---
On Tue, 26 Oct 2010, dominiq at lps dot ens.fr wrote:

 --- Comment #13 from Dominique d'Humieres dominiq at lps dot ens.fr 
 2010-10-26 16:36:05 UTC ---
  This multiplication transformation is incorrect if the loop wraps  
  (unsigned always wraps; never overflows).
 
 I think this is wrong: wrapping is nothing but a modulo 2^n operation (n=64
 here) which works for additions and multiplications, so if there is 
 wrapping,
 the result is sum=(b*(b-1)-a*(a-1))/2 modulo 2^n, i.e. correctly wrapped.

It's a bit more complicated than that, in that you can't just compute 
(b*(b-1)-a*(a-1)) mod 2^n, then divide by 2, as that will lose the top 
bit.  (I haven't checked exactly what the generated code is doing here.)


[Bug c/46186] Clang creates code running 1600 times faster than gcc's

2010-10-26 Thread jakub at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186

--- Comment #20 from Jakub Jelinek jakub at gcc dot gnu.org 2010-10-26 
21:00:11 UTC ---
If I translate the assembly back to C, it seems it is performing part of the
arithmetics in TImode:

unsigned long f (unsigned long a, unsigned long b)
{
  if (a = b)
return 0;
  else
return (a + 1) * (b - 1 - a) + a + (unsigned long)(((unsigned __int128) (b
- 1 - a) * (b - 2 - a))  1);
}


[Bug c/46186] Clang creates code running 1600 times faster than gcc's

2010-10-26 Thread dominiq at lps dot ens.fr
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186

--- Comment #21 from Dominique d'Humieres dominiq at lps dot ens.fr 
2010-10-26 21:06:48 UTC ---
 I guess you mean LLVM instead of clang,

Yes, if you prefer. I was referring to the command I used.

 F (6, a * a * a * a * a + 2 * a * a * a + 5 * a)

you probably mean

F (6, sum +=a * a * a * a * a + 2 * a * a * a + 5 * a)

 It's a bit more complicated than that, in that you can't just compute 
 (b*(b-1)-a*(a-1)) mod 2^n, then divide by 2, as that will lose the top 
 bit.  (I haven't checked exactly what the generated code is doing here.)

This is right if you do the multiply before the division. However b or b-1 can
be divided exactly by 2, so you have to do (b/2)*(b-1) if b is even and
b*((b-1)/2) if b is odd. The same result applies for the sum of square and
cubes, although you may be one more trick if 2*b+1 wrap and can be divided
exactly by 3.

I have tested the following variant


[macbook] f90/bug% cat pr46186_db.c
#include stdio.h

#define len 10L

unsigned long f(unsigned long a, unsigned long b) __attribute__((noinline));

int main()
{
unsigned long res;
res = f(2, 2*len);
/*printf(%lu\n, res); */
return 0;
}

unsigned long f(unsigned long a, unsigned long b)
{
unsigned long sum = 0, sum0 = 0, sum2 = 0, sum5 = 0;
for (; a  b; a++)
  {
sum0 += 2;
sum += a;
sum2 += a*a;
sum5 += a*a*a*a*a + 2*a*a*a +5*a;
  }
printf(%lu\n, sum0);
printf(%lu\n, sum);
printf(%lu\n, sum2);
printf(%lu\n, sum5);
return sum;
}
 which gives

[macbook] f90/bug% gcc46 -Ofast -ftree-vectorizer-verbose=2 pr46186_db.c

pr46186_db.c:18: note: LOOP VECTORIZED.
pr46186_db.c:15: note: vectorized 1 loops in function.
[macbook] f90/bug% time a.out
36
189
10262176583330622975
11833798674328980984
18.114u 0.012s 0:18.15 99.8%0+0k 0+0io 0pf+0w
[macbook] f90/bug% clang -O pr46186_db.c
[macbook] f90/bug% time a.out
36
189
10262176583330622975
11833798674328980984
0.000u 0.000s 0:00.00 0.0%0+0k 0+0io 0pf+0w

So the wrapping seems properly handled for the loop replacement.

Now I have also tested the following variant


[macbook] f90/bug% cat pr46186_db_1.c
#include stdio.h

#define len 10L

unsigned long f(unsigned long a, unsigned long b) __attribute__((noinline));

int main()
{
unsigned long res;
res = f(2, 2*len);
/*printf(%lu\n, res); */
return 0;
}

unsigned long f(unsigned long a, unsigned long b)
{
unsigned long sum = 0, sum0 = 0, sum2 = 0, sum5 = 0;
unsigned long a2;
for (; a  b; a++)
  {
sum0 += 2;
sum += a;
a2 = a*a;
sum2 += a2;
sum5 += a*(a2*(a2 + 2) +5);
  }
printf(%lu\n, sum0);
printf(%lu\n, sum);
printf(%lu\n, sum2);
printf(%lu\n, sum5);
return sum;
}

and the timings are

[macbook] f90/bug% gcc46 -Ofast -ftree-vectorizer-verbose=2 pr46186_db_1.c

pr46186_db_1.c:19: note: LOOP VECTORIZED.
pr46186_db_1.c:15: note: vectorized 1 loops in function.
[macbook] f90/bug% time a.out
36
189
10262176583330622975
11833798674328980984
12.227u 0.016s 0:12.36 98.9%0+0k 0+0io 0pf+0w  == was 18.114u 0.012s
0:18.15 without
hand optimization
[macbook] f90/bug% clang -O pr46186_db_1.c
[macbook] f90/bug% time a.out
36
189
10262176583330622975
11833798674328980984
0.000u 0.000s 0:00.00 0.0%0+0k 0+0io 0pf+0w

So clearly gcc is missing some generic optimizations in products.