[Bug tree-optimization/98544] [11 regression] Wrong code generated by tree vectorizer since r11-3917-g28290cb50c7dbf87
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544 --- Comment #23 from rguenther at suse dot de --- On January 8, 2021 3:07:48 PM GMT+01:00, "mar...@mpa-garching.mpg.de" wrote: >https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544 > >--- Comment #22 from Martin Reinecke --- >Brilliant, thank you very much for tracking this one down! >My FFT library now works correctly again with all optimizations >enabled, which >is a great relief. The scipy maintainers will be happy that they won't >need to >fiddle with flags depending on gcc versions :) Thanks again for testing and reporting issues before we released GCC 11!
[Bug tree-optimization/98544] [11 regression] Wrong code generated by tree vectorizer since r11-3917-g28290cb50c7dbf87
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544 --- Comment #22 from Martin Reinecke --- Brilliant, thank you very much for tracking this one down! My FFT library now works correctly again with all optimizations enabled, which is a great relief. The scipy maintainers will be happy that they won't need to fiddle with flags depending on gcc versions :)
[Bug tree-optimization/98544] [11 regression] Wrong code generated by tree vectorizer since r11-3917-g28290cb50c7dbf87
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544 Richard Biener changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #21 from Richard Biener --- Fixed.
[Bug tree-optimization/98544] [11 regression] Wrong code generated by tree vectorizer since r11-3917-g28290cb50c7dbf87
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544 --- Comment #20 from CVS Commits --- The master branch has been updated by Richard Biener : https://gcc.gnu.org/g:bdcde1504502719504a7a63ab10059e171694dc2 commit r11-6549-gbdcde1504502719504a7a63ab10059e171694dc2 Author: Richard Biener Date: Fri Jan 8 13:17:18 2021 +0100 tree-optimization/98544 - more permute optimization fixes Permute nodes are not transparent to the permute of their children. Instead we have to materialize child permutes always and in future may treat permute nodes as the source of arbitrary permutes as we can permute the lane permutation vector at will (as the target supports in the end). 2021-01-08 Richard Biener PR tree-optimization/98544 * tree-vect-slp.c (vect_optimize_slp): Always materialize permutes at a permute node. * gcc.dg/vect/bb-slp-pr98544.c: New testcase.
[Bug tree-optimization/98544] [11 regression] Wrong code generated by tree vectorizer since r11-3917-g28290cb50c7dbf87
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544 --- Comment #19 from Martin Liška --- > err > > /* { dg-do run } */ > > double a[2], b[2], c[2], d[2]; > > void __attribute__((noipa)) > foo() > { > double a0 = a[0]; > double a1 = a[1]; > double b0 = b[0]; > double b1 = b[1]; > double c0 = c[0]; > double c1 = c[1]; > double tem1 = a1 - b1; > double tem2 = a0 + b0; > d[0] = tem1 * c1; > d[1] = tem2 * c0; > } > > int main() > { > a[0] = 1.; > a[1] = 2.; > b[0] = 3.; > b[1] = 4.; > c[0] = 2.; > c[1] = 3.; > foo (); > if (d[0] != -6. || d[1] != 8.) > __builtin_abort (); > return 0; > } This started to fail with -O3 since r11-3823-g126ed72b9f48f853 if it helps.
[Bug tree-optimization/98544] [11 regression] Wrong code generated by tree vectorizer since r11-3917-g28290cb50c7dbf87
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544 --- Comment #18 from Richard Biener --- (In reply to Richard Biener from comment #17) > /* { dg-do run } */ > > double a[4], b[2]; > > void __attribute__((noipa)) > foo () > { > double a0 = a[0]; > double a1 = a[1]; > double a2 = a[2]; > double a3 = a[3]; > b[0] = a1 - a3; > b[1] = a0 + a2; > } > > int main() > { > a[0] = 1.; > a[1] = 2.; > a[2] = 3.; > a[3] = 4.; > foo (); > if (b[0] != -2 || b[1] != 4) > __builtin_abort (); > return 0; > } err /* { dg-do run } */ double a[2], b[2], c[2], d[2]; void __attribute__((noipa)) foo() { double a0 = a[0]; double a1 = a[1]; double b0 = b[0]; double b1 = b[1]; double c0 = c[0]; double c1 = c[1]; double tem1 = a1 - b1; double tem2 = a0 + b0; d[0] = tem1 * c1; d[1] = tem2 * c0; } int main() { a[0] = 1.; a[1] = 2.; b[0] = 3.; b[1] = 4.; c[0] = 2.; c[1] = 3.; foo (); if (d[0] != -6. || d[1] != 8.) __builtin_abort (); return 0; }
[Bug tree-optimization/98544] [11 regression] Wrong code generated by tree vectorizer since r11-3917-g28290cb50c7dbf87
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544 --- Comment #17 from Richard Biener --- /* { dg-do run } */ double a[4], b[2]; void __attribute__((noipa)) foo () { double a0 = a[0]; double a1 = a[1]; double a2 = a[2]; double a3 = a[3]; b[0] = a1 - a3; b[1] = a0 + a2; } int main() { a[0] = 1.; a[1] = 2.; a[2] = 3.; a[3] = 4.; foo (); if (b[0] != -2 || b[1] != 4) __builtin_abort (); return 0; }
[Bug tree-optimization/98544] [11 regression] Wrong code generated by tree vectorizer since r11-3917-g28290cb50c7dbf87
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544 --- Comment #16 from Richard Biener --- Interestingly trunk doesn't exhibit the issue with radb2 but radb4 and radb3 and radb5 still do ... (radb3 being the next "smallest") but a similarly stripped down testcase doesn't exhibit problems either (at length 15 as reported from the full testcase). *sigh*
[Bug tree-optimization/98544] [11 regression] Wrong code generated by tree vectorizer since r11-3917-g28290cb50c7dbf87
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544 --- Comment #15 from Martin Reinecke --- "Problem at length N" means that the FFT of length N is computed incorrectly. Also, N==l1*ido*x. For an FFT of length N, the computation is broken down into several passes. Let's take N=15. First the prome factors of N are computed, in this case 3 and 5. The (simplified) procedure is then: l1=1; // first pass with x=3 x=3; ido=N/(x*l1); radb3(ido, l1, p1, p2, ); swap (p1,p2); l1*=x; x=5; ido=N/(x*l1); radb5(ido, l1, p1, p2, ); swap (p1,p2);
[Bug tree-optimization/98544] [11 regression] Wrong code generated by tree vectorizer since r11-3917-g28290cb50c7dbf87
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544 --- Comment #14 from Richard Biener --- So that makes it sth like the following then - still not miscomparing. What's the "problem at length N" hinting at? Does N somehow make ido/l1 different? typedef unsigned long size_t; template inline void PM(T &a, T &b, T c, T d) { a=c+d; b=c-d; } template inline void MULPM (T1 &a, T1 &b, T2 c, T2 d, T3 e, T3 f) { a=c*e+d*f; b=c*f-d*e; } typedef double T; void __attribute__((noipa)) radb2(size_t ido, size_t l1, const T * __restrict cc, T * __restrict ch, const T * __restrict wa) { auto WA = [wa,ido](size_t x, size_t i) { return wa[i+x*(ido-1)]; }; auto CC = [cc,ido](size_t a, size_t b, size_t c) -> const T& { return cc[a+ido*(b+2*c)]; }; auto CH = [ch,ido,l1](size_t a, size_t b, size_t c) -> T& { return ch[a+ido*(b+l1*c)]; }; for (size_t k=0; k
[Bug tree-optimization/98544] [11 regression] Wrong code generated by tree vectorizer since r11-3917-g28290cb50c7dbf87
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544 --- Comment #13 from Martin Reinecke --- > What kind of shape (w/o too much guessing) is the function expecting for its > input arrays? For radb the size of the cc and ch arrays is l1*ido*x. Size of wa is (x-1)*ido.
[Bug tree-optimization/98544] [11 regression] Wrong code generated by tree vectorizer since r11-3917-g28290cb50c7dbf87
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544 --- Comment #12 from Richard Biener --- So at least the following doesn't produce any different answers for me comparing -O -ftree-loop-vectorize -mavx to -O0 typedef unsigned long size_t; template inline void PM(T &a, T &b, T c, T d) { a=c+d; b=c-d; } template inline void MULPM (T1 &a, T1 &b, T2 c, T2 d, T3 e, T3 f) { a=c*e+d*f; b=c*f-d*e; } typedef double T; void __attribute__((noipa)) radb2(size_t ido, size_t l1, const T * __restrict cc, T * __restrict ch, const T * __restrict wa) { auto WA = [wa,ido](size_t x, size_t i) { return wa[i+x*(ido-1)]; }; auto CC = [cc,ido](size_t a, size_t b, size_t c) -> const T& { return cc[a+ido*(b+2*c)]; }; auto CH = [ch,ido,l1](size_t a, size_t b, size_t c) -> T& { return ch[a+ido*(b+l1*c)]; }; for (size_t k=0; k
[Bug tree-optimization/98544] [11 regression] Wrong code generated by tree vectorizer since r11-3917-g28290cb50c7dbf87
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544 --- Comment #11 from Richard Biener --- Or massaging the C++ to produce a single-function testcase (driver still missing, I think we can remove the "unrelated" loops as well). What kind of shape (w/o too much guessing) is the function expecting for its input arrays? typedef unsigned long size_t; template inline void PM(T &a, T &b, T c, T d) { a=c+d; b=c-d; } template inline void MULPM (T1 &a, T1 &b, T2 c, T2 d, T3 e, T3 f) { a=c*e+d*f; b=c*f-d*e; } typedef double T; void radb2(size_t ido, size_t l1, const T * __restrict cc, T * __restrict ch, const T * __restrict wa) { auto WA = [wa,ido](size_t x, size_t i) { return wa[i+x*(ido-1)]; }; auto CC = [cc,ido](size_t a, size_t b, size_t c) -> const T& { return cc[a+ido*(b+2*c)]; }; auto CH = [ch,ido,l1](size_t a, size_t b, size_t c) -> T& { return ch[a+ido*(b+l1*c)]; }; for (size_t k=0; k
[Bug tree-optimization/98544] [11 regression] Wrong code generated by tree vectorizer since r11-3917-g28290cb50c7dbf87
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544 --- Comment #10 from Richard Biener --- Created attachment 49912 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49912&action=edit GIMPLE testcase for one of the miscompiled(?) loops It shows similar vectorized / analysis IL as for the big testcase with -O -ftree-loop-vectorize -mavx2 -fgimple
[Bug tree-optimization/98544] [11 regression] Wrong code generated by tree vectorizer since r11-3917-g28290cb50c7dbf87
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544 --- Comment #9 from Richard Biener --- (In reply to Richard Biener from comment #8) > (In reply to Martin Liška from comment #2) > > Confirmed, one can reduce that to a single loop vectorization: > > > > $ g++ bug2.cc -std=c++17 -O1 -mavx -ftree-loop-vectorize > > -fdbg-cnt=vect_loop:10-10 && ./a.out > > > > but the loop is quite huge. > > btw, 11-11 or 12-12 or 13-13 also is enough individually to trigger a > miscompare. > The 11-11 loop looks smallest to me: > > ***dbgcnt: lower limit 11 reached for vect_loop.*** > ***dbgcnt: upper limit 11 reached for vect_loop.*** > fft1d.h:1256:23: optimized: loop vectorized using 32 byte vectors > fft1d.h:1256:23: optimized: loop versioned for vectorization because of > possible aliasing > > it also only needs a single alias check (just guessing where things may go > wrong) > > The source corresponds to > > template void radb2(size_t ido, size_t l1, > const T * DUCC0_RESTRICT cc, T * DUCC0_RESTRICT ch, > const T0 * DUCC0_RESTRICT wa) const > { > auto WA = [wa,ido](size_t x, size_t i) { return wa[i+x*(ido-1)]; }; > auto CC = [cc,ido](size_t a, size_t b, size_t c) -> const T& > { return cc[a+ido*(b+2*c)]; }; > auto CH = [ch,ido,l1](size_t a, size_t b, size_t c) -> T& > { return ch[a+ido*(b+l1*c)]; }; > > for (size_t k=0; k PM (CH(0,k,0),CH(0,k,1),CC(0,0,k),CC(ido-1,1,k)); > if ((ido&1)==0) > for (size_t k=0; k { > CH(ido-1,k,0) = T0( 2)*CC(ido-1,0,k); > CH(ido-1,k,1) = T0(-2)*CC(0,1,k); > } > if (ido<=2) return; > for (size_t k=0; k > this loop > for (size_t i=2; i { > size_t ic=ido-i; > T ti2, tr2; > PM (CH(i-1,k,0),tr2,CC(i-1,0,k),CC(ic-1,1,k)); > PM (ti2,CH(i ,k,0),CC(i ,0,k),CC(ic ,1,k)); > MULPM (CH(i,k,1),CH(i-1,k,1),WA(0,i-2),WA(0,i-1),ti2,tr2); > } > < > } Notably the access functions end up with large (negative) initial values and we have a negative step (more negative step stuff is vectorized with GCC 11 now!) Creating dr for *_123 analyze_innermost: success. - base_address: (const double &) cc_30(D) + (sizetype) long unsigned int) ido_29(D) + _135) + 18446744073709551614) * 8) + base_address: (const double &) cc_30(D) + (sizetype) (((long unsigned int) ido_29(D) + _120) * 8) offset from base address: 0 - constant offset from base address: 0 + constant offset from base address: -24(OVF) step: -16(OVF) base alignment: 8 base misalignment: 0 offset alignment: 256 step alignment: 16 - base_object: *(const double &) cc_30(D) + (sizetype) long unsigned int) ido_29(D) + _135) + 18446744073709551614) * 8) - Access function 0: {0, +, 18446744073709551600}_4 + base_object: *(const double &) cc_30(D) + (sizetype) (((long unsigned int) ido_29(D) + _120) * 8) + Access function 0: {18446744073709551592, +, 18446744073709551600}_4 we now use SLP for this loop (great!) while we previously failed to vectorize this loop at all. We can disable epilouge vectorization with no (good) effect. The function in question is ducc0::detail_fft::rfftp::radb2, it should be possible to isolate this kernel into a testcase, I will try to do this from the GIMPLE IL.
[Bug tree-optimization/98544] [11 regression] Wrong code generated by tree vectorizer since r11-3917-g28290cb50c7dbf87
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544 --- Comment #8 from Richard Biener --- (In reply to Martin Liška from comment #2) > Confirmed, one can reduce that to a single loop vectorization: > > $ g++ bug2.cc -std=c++17 -O1 -mavx -ftree-loop-vectorize > -fdbg-cnt=vect_loop:10-10 && ./a.out > > but the loop is quite huge. btw, 11-11 or 12-12 or 13-13 also is enough individually to trigger a miscompare. The 11-11 loop looks smallest to me: ***dbgcnt: lower limit 11 reached for vect_loop.*** ***dbgcnt: upper limit 11 reached for vect_loop.*** fft1d.h:1256:23: optimized: loop vectorized using 32 byte vectors fft1d.h:1256:23: optimized: loop versioned for vectorization because of possible aliasing it also only needs a single alias check (just guessing where things may go wrong) The source corresponds to template void radb2(size_t ido, size_t l1, const T * DUCC0_RESTRICT cc, T * DUCC0_RESTRICT ch, const T0 * DUCC0_RESTRICT wa) const { auto WA = [wa,ido](size_t x, size_t i) { return wa[i+x*(ido-1)]; }; auto CC = [cc,ido](size_t a, size_t b, size_t c) -> const T& { return cc[a+ido*(b+2*c)]; }; auto CH = [ch,ido,l1](size_t a, size_t b, size_t c) -> T& { return ch[a+ido*(b+l1*c)]; }; for (size_t k=0; k this loop for (size_t i=2; i
[Bug tree-optimization/98544] [11 regression] Wrong code generated by tree vectorizer since r11-3917-g28290cb50c7dbf87
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544 --- Comment #7 from Richard Biener --- (In reply to Richard Biener from comment #5) > (In reply to Martin Liška from comment #3) > > For g++ bug2.cc -std=c++17 -O1 -mavx -ftree-loop-vectorize, it started with > > r11-3917-g28290cb50c7dbf87. > > Are you sure about this bisection? Reverting this doesn't fix the issue > (reversion isn't quite easy since split_constant_offset has undergone quite > some refactoring meanwhile). So I can confirm the bisection. I see quite some good effect of the change with no "obvious" mistakes done though so it might just trigger some latent issue elsewhere.
[Bug tree-optimization/98544] [11 regression] Wrong code generated by tree vectorizer since r11-3917-g28290cb50c7dbf87
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544 --- Comment #6 from Martin Liška --- (In reply to Richard Biener from comment #5) > (In reply to Martin Liška from comment #3) > > For g++ bug2.cc -std=c++17 -O1 -mavx -ftree-loop-vectorize, it started with > > r11-3917-g28290cb50c7dbf87. > > Are you sure about this bisection? Yes, it's correct. I've just verified that. > Reverting this doesn't fix the issue > (reversion isn't quite easy since split_constant_offset has undergone quite > some refactoring meanwhile). I'm gonna bisect until when the reversion worked on the trunk.
[Bug tree-optimization/98544] [11 regression] Wrong code generated by tree vectorizer since r11-3917-g28290cb50c7dbf87
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544 --- Comment #5 from Richard Biener --- (In reply to Martin Liška from comment #3) > For g++ bug2.cc -std=c++17 -O1 -mavx -ftree-loop-vectorize, it started with > r11-3917-g28290cb50c7dbf87. Are you sure about this bisection? Reverting this doesn't fix the issue (reversion isn't quite easy since split_constant_offset has undergone quite some refactoring meanwhile).
[Bug tree-optimization/98544] [11 regression] Wrong code generated by tree vectorizer since r11-3917-g28290cb50c7dbf87
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544 Richard Biener changed: What|Removed |Added Target Milestone|--- |11.0 CC||rsandifo at gcc dot gnu.org Priority|P3 |P1 Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org Status|NEW |ASSIGNED --- Comment #4 from Richard Biener --- split_constant_offset? Meh.
[Bug tree-optimization/98544] [11 regression] Wrong code generated by tree vectorizer since r11-3917-g28290cb50c7dbf87
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544 Martin Liška changed: What|Removed |Added Summary|[11 regression] Wrong code |[11 regression] Wrong code |generated by tree |generated by tree |vectorizer |vectorizer since ||r11-3917-g28290cb50c7dbf87 --- Comment #3 from Martin Liška --- For g++ bug2.cc -std=c++17 -O1 -mavx -ftree-loop-vectorize, it started with r11-3917-g28290cb50c7dbf87.
[Bug tree-optimization/98544] [11 regression] Wrong code generated by tree vectorizer
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544 Martin Liška changed: What|Removed |Added CC||marxin at gcc dot gnu.org Status|UNCONFIRMED |NEW Last reconfirmed||2021-01-06 Ever confirmed|0 |1 --- Comment #2 from Martin Liška --- Confirmed, one can reduce that to a single loop vectorization: $ g++ bug2.cc -std=c++17 -O1 -mavx -ftree-loop-vectorize -fdbg-cnt=vect_loop:10-10 && ./a.out but the loop is quite huge.
[Bug tree-optimization/98544] [11 regression] Wrong code generated by tree vectorizer
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544 --- Comment #1 from Martin Reinecke --- Problem seems to be related to the use of __restrict__. If I remove the DUCC0_RESTRICT from the function definitions of "radb3", "radb4" etc., the problem goes away. However I don't see where I'm violating the promise made by __restrict__ in these functions ...