[Bug c/46186] Clang creates code running 1600 times faster than gcc's
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186 Thorsten Glaser tg at mirbsd dot org changed: What|Removed |Added CC||tg at mirbsd dot org --- Comment #22 from Thorsten Glaser tg at mirbsd dot org 2010-10-29 21:06:44 UTC --- (In reply to comment #19) It's a bit more complicated than that, in that you can't just compute (b*(b-1)-a*(a-1)) mod 2^n, then divide by 2, as that will lose the top bit. (I haven't checked exactly what the generated code is doing here.) Haven’t checked either, but in i386 asm, using RCR instead of SHR would work if the carry flag is correct after the sequence calculating that (or SAR).
[Bug c/46186] Clang creates code running 1600 times faster than gcc's
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186 --- Comment #1 from Julian Andres Klode j...@jak-linux.org 2010-10-26 14:30:24 UTC --- Created attachment 22162 -- http://gcc.gnu.org/bugzilla/attachment.cgi?id=22162 Clang's assember Attaching the assembler output from clang, it should help understand which optimizations clang does (and improve gcc to do them as well).
[Bug c/46186] Clang creates code running 1600 times faster than gcc's
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186 Paolo Carlini paolo.carlini at oracle dot com changed: What|Removed |Added CC||paolo.carlini at oracle dot ||com --- Comment #2 from Paolo Carlini paolo.carlini at oracle dot com 2010-10-26 14:31:47 UTC --- ;)
[Bug c/46186] Clang creates code running 1600 times faster than gcc's
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186 --- Comment #3 from Julian Andres Klode j...@jak-linux.org 2010-10-26 14:32:27 UTC --- System information: Using built-in specs. Target: x86_64-linux-gnu Configured with: ../src/configure -v --with-pkgversion='Debian 4.4.5-5' --with-bugurl=file:///usr/share/doc/gcc-4.4/README.Bugs --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.4 --enable-shared --enable-multiarch --enable-linker-build-id --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.4 --libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-objc-gc --with-arch-32=i586 --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu Thread model: posix gcc version 4.4.5 (Debian 4.4.5-5)
[Bug c/46186] Clang creates code running 1600 times faster than gcc's
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186 --- Comment #4 from Jonathan Wakely redi at gcc dot gnu.org 2010-10-26 14:47:12 UTC --- GCC's output is significantly faster at -O3 or without the noinline attribute
[Bug c/46186] Clang creates code running 1600 times faster than gcc's
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186 --- Comment #5 from Julian Andres Klode j...@jak-linux.org 2010-10-26 14:53:24 UTC --- (In reply to comment #4) GCC's output is significantly faster at -O3 or without the noinline attribute I just tested and at -O3, gcc-4.4 creates slow code and gcc-4.5 fast code. At -O2, both are equally fast. The clang 1.1 code at -O2 is as fast as GCC 4.5's code at -O3.
[Bug c/46186] Clang creates code running 1600 times faster than gcc's
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186 --- Comment #6 from Dominique d'Humieres dominiq at lps dot ens.fr 2010-10-26 14:59:18 UTC --- You get this kind of speedup if the compiler knows that the result of the loop is sum=(b*(b-1)-a*(a-1))/2 In which case the timing is meaningless (it is 0.000s on my laptop), so is the ratio with the execution of the loop. The basic question is: how much the user's ignorance should be repaired by the optimizer? (A colleague of mine told me that he once audited a CFD code and found that \int_a^b dx/x was evaluated numerically instead of using log(b)-log(a)!-)
[Bug c/46186] Clang creates code running 1600 times faster than gcc's
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186 --- Comment #7 from Julian Andres Klode j...@jak-linux.org 2010-10-26 15:00:37 UTC --- (In reply to comment #5) (In reply to comment #4) GCC's output is significantly faster at -O3 or without the noinline attribute I just tested and at -O3, gcc-4.4 creates slow code and gcc-4.5 fast code. At -O2, both are equally fast. The clang 1.1 code at -O2 is as fast as GCC 4.5's code at -O3. Using gcc 4.5 with -O3 may work for the C code, but it does not optimize the C++ code posted at http://lwn.net/Articles/411776/: g++-4.5 -O3: real 0m1.608s clang++ -O2: real 0m0.003s clang++ -Os: real 0m0.003s But I guess the C++ problem might be a different one.
[Bug c/46186] Clang creates code running 1600 times faster than gcc's
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186 --- Comment #8 from Julian Andres Klode j...@jak-linux.org 2010-10-26 15:25:56 UTC --- (In reply to comment #6) You get this kind of speedup if the compiler knows that the result of the loop is sum=(b*(b-1)-a*(a-1))/2 In which case the timing is meaningless (it is 0.000s on my laptop), so is the ratio with the execution of the loop. The basic question is: how much the user's ignorance should be repaired by the optimizer? (A colleague of mine told me that he once audited a CFD code and found that \int_a^b dx/x was evaluated numerically instead of using log(b)-log(a)!-) Since the optimization seems to be mostly there in -O3, it's just a matter of enabling it in -O2. I just found out that it does not optimize if you call f() via a global function pointer, it still takes 1.6 seconds despite being compiled at -O3, whereas clang can optimize it to 0.001s.
[Bug c/46186] Clang creates code running 1600 times faster than gcc's
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186 --- Comment #9 from Jonathan Wakely redi at gcc dot gnu.org 2010-10-26 15:28:51 UTC --- (In reply to comment #8) Since the optimization seems to be mostly there in -O3, it's just a matter of enabling it in -O2. Or if you want all optimisations, it's just a matter of using -O3 Expecting all optimisations to be done below the maximum optimisation level is ... unexpected.
[Bug c/46186] Clang creates code running 1600 times faster than gcc's
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186 Jakub Jelinek jakub at gcc dot gnu.org changed: What|Removed |Added CC||jakub at gcc dot gnu.org --- Comment #10 from Jakub Jelinek jakub at gcc dot gnu.org 2010-10-26 15:37:04 UTC --- -O2 -fipa-cp-clone should be in theory enough, then f would be normally cloned, assuming gcc thinks it is from a hot spot. But this is not a real-world testcase, the call from main where it happens just once for the process is not considered a hot spot.
[Bug c/46186] Clang creates code running 1600 times faster than gcc's
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186 --- Comment #11 from Paolo Carlini paolo.carlini at oracle dot com 2010-10-26 15:42:58 UTC --- Can we please stop talking about nano and giga numbers like kids? If an optimization like complete loop unrolling is involved of course very small or large numbers can be involved, doesn't really contribute anything to the problem talking about the exact figure.
Re: [Bug c/46186] Clang creates code running 1600 times faster than gcc's
On Oct 26, 2010, at 7:30 AM, j...@jak-linux.org gcc-bugzi...@gcc.gnu.org wrote: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186 --- Comment #1 from Julian Andres Klode j...@jak-linux.org 2010-10-26 14:30:24 UTC --- Created attachment 22162 -- http://gcc.gnu.org/bugzilla/attachment.cgi?id=22162 Clang's assember This multiplication transformation is incorrect if the loop wraps (unsigned always wraps; never overflows). Gcc is correct in its speed. Though -O3 is faster but not because of multiplication but rather constant propatagtion. So this bug is invalid and some one should report a bug to llvm folks about this. Attaching the assembler output from clang, it should help understand which optimizations clang does (and improve gcc to do them as well).
[Bug c/46186] Clang creates code running 1600 times faster than gcc's
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186 --- Comment #12 from pinskia at gmail dot com pinskia at gmail dot com 2010-10-26 15:56:20 UTC --- On Oct 26, 2010, at 7:30 AM, j...@jak-linux.org gcc-bugzi...@gcc.gnu.org wrote: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186 --- Comment #1 from Julian Andres Klode j...@jak-linux.org 2010-10-26 14:30:24 UTC --- Created attachment 22162 -- http://gcc.gnu.org/bugzilla/attachment.cgi?id=22162 Clang's assember This multiplication transformation is incorrect if the loop wraps (unsigned always wraps; never overflows). Gcc is correct in its speed. Though -O3 is faster but not because of multiplication but rather constant propatagtion. So this bug is invalid and some one should report a bug to llvm folks about this. Attaching the assembler output from clang, it should help understand which optimizations clang does (and improve gcc to do them as well).
[Bug c/46186] Clang creates code running 1600 times faster than gcc's
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186 --- Comment #13 from Dominique d'Humieres dominiq at lps dot ens.fr 2010-10-26 16:36:05 UTC --- This multiplication transformation is incorrect if the loop wraps (unsigned always wraps; never overflows). I think this is wrong: wrapping is nothing but a modulo 2^n operation (n=64 here) which works for additions and multiplications, so if there is wrapping, the result is sum=(b*(b-1)-a*(a-1))/2 modulo 2^n, i.e. correctly wrapped. On my Core2duo 2.53Ghz with -Ofast the run time is ~1.2s for elementary 2*10^9 loops or .6ns/loop or ~1.5 clock cycle per loop. For a perfect vectorization and no loop overhead, I would expect a minimum of 0.5 clock cycle per loop. If you get anything below this number, it means that the loop for (; a b; a++) sum += a; is replaced with sum=(b*(b-1)-a*(a-1))/2 (you can confirm it by checking that the timing behaves as O(len) or not). Apparently clang does this (valid) transformation while gcc don't for any options I have tried. Note that If I write such a loop, it is because I am interested by the timing of the loop, not by the result I know for more than 40 years!
[Bug c/46186] Clang creates code running 1600 times faster than gcc's
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186 Jakub Jelinek jakub at gcc dot gnu.org changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2010.10.26 16:59:00 CC||spop at gcc dot gnu.org Ever Confirmed|0 |1 --- Comment #14 from Jakub Jelinek jakub at gcc dot gnu.org 2010-10-26 16:59:00 UTC --- For sum += 2 or sum += b sccp handles this, so I wonder whether it couldn't handle even the sum += a case. Sebastian?
[Bug c/46186] Clang creates code running 1600 times faster than gcc's
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186 --- Comment #15 from Dominique d'Humieres dominiq at lps dot ens.fr 2010-10-26 17:15:31 UTC --- For sum += 2 or sum += b sccp handles this, so I wonder whether it couldn't handle even the sum += a case. 2 and b are constants while a is not. For constants you have to know that the sum is cst*nloop, while if a is incremented you have to know that it is related to nloop*(nloop+1)/2 (and if you use sum += a*a, nloop*(nloop+1)*(2*nloop+1)/6, if sum += a*a*a, nloop^2*(nloop+1)^2/4 and so on). However is it worth the work?
[Bug c/46186] Clang creates code running 1600 times faster than gcc's
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186 --- Comment #16 from Jakub Jelinek jakub at gcc dot gnu.org 2010-10-26 18:43:40 UTC --- chrec_apply is called with {a_4(D), +, {a_4(D) + 1, +, 1}_1}_1 chrec and ~a_4(D) + b_5(D) in x. I wonder if this can be fixed just by recognizing such special cases in chrec_apply (after checking e.g. that it is unsigned computation instead of signed etc.).
[Bug c/46186] Clang creates code running 1600 times faster than gcc's
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186 --- Comment #17 from Dominique d'Humieres dominiq at lps dot ens.fr 2010-10-26 18:53:49 UTC --- Note that clang seems to know the general result: \sum_{i=a}^b p(i)=P(b), where p(i) is a given polynomial of degree n and P(x) a polynomial of degree n+1 such that P(x)=P(x-1)+p(x) and P(a)=p(a). At least clang is able to remove the loop for sum5 += a*a*a*a*a + 2*a*a*a +5*a;
[Bug c/46186] Clang creates code running 1600 times faster than gcc's
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186 --- Comment #18 from Jakub Jelinek jakub at gcc dot gnu.org 2010-10-26 19:11:59 UTC --- I guess you mean LLVM instead of clang, I'm pretty sure the FE doesn't perform this optimization. Anyway, given: #define F(n, exp) \ unsigned long \ f##n (unsigned long a, unsigned long b) \ { \ unsigned long sum = 0;\ for (; a b; a++)\ exp;\ return sum; \ } F (1, sum += a) F (2, sum += 2) F (3, sum += b) F (4, sum += a * a) F (5, sum += a * a * a) F (6, a * a * a * a * a + 2 * a * a * a + 5 * a) only the f1/f2/f3 cases make it into chrec_apply (and only f2/f3 are currently handled there).
[Bug c/46186] Clang creates code running 1600 times faster than gcc's
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186 --- Comment #19 from joseph at codesourcery dot com joseph at codesourcery dot com 2010-10-26 20:29:56 UTC --- On Tue, 26 Oct 2010, dominiq at lps dot ens.fr wrote: --- Comment #13 from Dominique d'Humieres dominiq at lps dot ens.fr 2010-10-26 16:36:05 UTC --- This multiplication transformation is incorrect if the loop wraps (unsigned always wraps; never overflows). I think this is wrong: wrapping is nothing but a modulo 2^n operation (n=64 here) which works for additions and multiplications, so if there is wrapping, the result is sum=(b*(b-1)-a*(a-1))/2 modulo 2^n, i.e. correctly wrapped. It's a bit more complicated than that, in that you can't just compute (b*(b-1)-a*(a-1)) mod 2^n, then divide by 2, as that will lose the top bit. (I haven't checked exactly what the generated code is doing here.)
[Bug c/46186] Clang creates code running 1600 times faster than gcc's
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186 --- Comment #20 from Jakub Jelinek jakub at gcc dot gnu.org 2010-10-26 21:00:11 UTC --- If I translate the assembly back to C, it seems it is performing part of the arithmetics in TImode: unsigned long f (unsigned long a, unsigned long b) { if (a = b) return 0; else return (a + 1) * (b - 1 - a) + a + (unsigned long)(((unsigned __int128) (b - 1 - a) * (b - 2 - a)) 1); }
[Bug c/46186] Clang creates code running 1600 times faster than gcc's
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46186 --- Comment #21 from Dominique d'Humieres dominiq at lps dot ens.fr 2010-10-26 21:06:48 UTC --- I guess you mean LLVM instead of clang, Yes, if you prefer. I was referring to the command I used. F (6, a * a * a * a * a + 2 * a * a * a + 5 * a) you probably mean F (6, sum +=a * a * a * a * a + 2 * a * a * a + 5 * a) It's a bit more complicated than that, in that you can't just compute (b*(b-1)-a*(a-1)) mod 2^n, then divide by 2, as that will lose the top bit. (I haven't checked exactly what the generated code is doing here.) This is right if you do the multiply before the division. However b or b-1 can be divided exactly by 2, so you have to do (b/2)*(b-1) if b is even and b*((b-1)/2) if b is odd. The same result applies for the sum of square and cubes, although you may be one more trick if 2*b+1 wrap and can be divided exactly by 3. I have tested the following variant [macbook] f90/bug% cat pr46186_db.c #include stdio.h #define len 10L unsigned long f(unsigned long a, unsigned long b) __attribute__((noinline)); int main() { unsigned long res; res = f(2, 2*len); /*printf(%lu\n, res); */ return 0; } unsigned long f(unsigned long a, unsigned long b) { unsigned long sum = 0, sum0 = 0, sum2 = 0, sum5 = 0; for (; a b; a++) { sum0 += 2; sum += a; sum2 += a*a; sum5 += a*a*a*a*a + 2*a*a*a +5*a; } printf(%lu\n, sum0); printf(%lu\n, sum); printf(%lu\n, sum2); printf(%lu\n, sum5); return sum; } which gives [macbook] f90/bug% gcc46 -Ofast -ftree-vectorizer-verbose=2 pr46186_db.c pr46186_db.c:18: note: LOOP VECTORIZED. pr46186_db.c:15: note: vectorized 1 loops in function. [macbook] f90/bug% time a.out 36 189 10262176583330622975 11833798674328980984 18.114u 0.012s 0:18.15 99.8%0+0k 0+0io 0pf+0w [macbook] f90/bug% clang -O pr46186_db.c [macbook] f90/bug% time a.out 36 189 10262176583330622975 11833798674328980984 0.000u 0.000s 0:00.00 0.0%0+0k 0+0io 0pf+0w So the wrapping seems properly handled for the loop replacement. Now I have also tested the following variant [macbook] f90/bug% cat pr46186_db_1.c #include stdio.h #define len 10L unsigned long f(unsigned long a, unsigned long b) __attribute__((noinline)); int main() { unsigned long res; res = f(2, 2*len); /*printf(%lu\n, res); */ return 0; } unsigned long f(unsigned long a, unsigned long b) { unsigned long sum = 0, sum0 = 0, sum2 = 0, sum5 = 0; unsigned long a2; for (; a b; a++) { sum0 += 2; sum += a; a2 = a*a; sum2 += a2; sum5 += a*(a2*(a2 + 2) +5); } printf(%lu\n, sum0); printf(%lu\n, sum); printf(%lu\n, sum2); printf(%lu\n, sum5); return sum; } and the timings are [macbook] f90/bug% gcc46 -Ofast -ftree-vectorizer-verbose=2 pr46186_db_1.c pr46186_db_1.c:19: note: LOOP VECTORIZED. pr46186_db_1.c:15: note: vectorized 1 loops in function. [macbook] f90/bug% time a.out 36 189 10262176583330622975 11833798674328980984 12.227u 0.016s 0:12.36 98.9%0+0k 0+0io 0pf+0w == was 18.114u 0.012s 0:18.15 without hand optimization [macbook] f90/bug% clang -O pr46186_db_1.c [macbook] f90/bug% time a.out 36 189 10262176583330622975 11833798674328980984 0.000u 0.000s 0:00.00 0.0%0+0k 0+0io 0pf+0w So clearly gcc is missing some generic optimizations in products.