https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544
--- Comment #9 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to Richard Biener from comment #8) > (In reply to Martin Liška from comment #2) > > Confirmed, one can reduce that to a single loop vectorization: > > > > $ g++ bug2.cc -std=c++17 -O1 -mavx -ftree-loop-vectorize > > -fdbg-cnt=vect_loop:10-10 && ./a.out > > > > but the loop is quite huge. > > btw, 11-11 or 12-12 or 13-13 also is enough individually to trigger a > miscompare. > The 11-11 loop looks smallest to me: > > ***dbgcnt: lower limit 11 reached for vect_loop.*** > ***dbgcnt: upper limit 11 reached for vect_loop.*** > fft1d.h:1256:23: optimized: loop vectorized using 32 byte vectors > fft1d.h:1256:23: optimized: loop versioned for vectorization because of > possible aliasing > > it also only needs a single alias check (just guessing where things may go > wrong) > > The source corresponds to > > template<typename T> void radb2(size_t ido, size_t l1, > const T * DUCC0_RESTRICT cc, T * DUCC0_RESTRICT ch, > const T0 * DUCC0_RESTRICT wa) const > { > auto WA = [wa,ido](size_t x, size_t i) { return wa[i+x*(ido-1)]; }; > auto CC = [cc,ido](size_t a, size_t b, size_t c) -> const T& > { return cc[a+ido*(b+2*c)]; }; > auto CH = [ch,ido,l1](size_t a, size_t b, size_t c) -> T& > { return ch[a+ido*(b+l1*c)]; }; > > for (size_t k=0; k<l1; k++) > PM (CH(0,k,0),CH(0,k,1),CC(0,0,k),CC(ido-1,1,k)); > if ((ido&1)==0) > for (size_t k=0; k<l1; k++) > { > CH(ido-1,k,0) = T0( 2)*CC(ido-1,0,k); > CH(ido-1,k,1) = T0(-2)*CC(0 ,1,k); > } > if (ido<=2) return; > for (size_t k=0; k<l1;++k) > ====> this loop > for (size_t i=2; i<ido; i+=2) > { > size_t ic=ido-i; > T ti2, tr2; > PM (CH(i-1,k,0),tr2,CC(i-1,0,k),CC(ic-1,1,k)); > PM (ti2,CH(i ,k,0),CC(i ,0,k),CC(ic ,1,k)); > MULPM (CH(i,k,1),CH(i-1,k,1),WA(0,i-2),WA(0,i-1),ti2,tr2); > } > <==== > } Notably the access functions end up with large (negative) initial values and we have a negative step (more negative step stuff is vectorized with GCC 11 now!) Creating dr for *_123 analyze_innermost: success. - base_address: (const double &) cc_30(D) + (sizetype) ((((long unsigned int) ido_29(D) + _135) + 18446744073709551614) * 8) + base_address: (const double &) cc_30(D) + (sizetype) (((long unsigned int) ido_29(D) + _120) * 8) offset from base address: 0 - constant offset from base address: 0 + constant offset from base address: -24(OVF) step: -16(OVF) base alignment: 8 base misalignment: 0 offset alignment: 256 step alignment: 16 - base_object: *(const double &) cc_30(D) + (sizetype) ((((long unsigned int) ido_29(D) + _135) + 18446744073709551614) * 8) - Access function 0: {0, +, 18446744073709551600}_4 + base_object: *(const double &) cc_30(D) + (sizetype) (((long unsigned int) ido_29(D) + _120) * 8) + Access function 0: {18446744073709551592, +, 18446744073709551600}_4 we now use SLP for this loop (great!) while we previously failed to vectorize this loop at all. We can disable epilouge vectorization with no (good) effect. The function in question is ducc0::detail_fft::rfftp<double>::radb2<double>, it should be possible to isolate this kernel into a testcase, I will try to do this from the GIMPLE IL.