https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544

--- Comment #9 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #8)
> (In reply to Martin Liška from comment #2)
> > Confirmed, one can reduce that to a single loop vectorization:
> > 
> > $ g++ bug2.cc  -std=c++17 -O1 -mavx -ftree-loop-vectorize
> > -fdbg-cnt=vect_loop:10-10 && ./a.out
> > 
> > but the loop is quite huge.
> 
> btw, 11-11 or 12-12 or 13-13 also is enough individually to trigger a
> miscompare.
> The 11-11 loop looks smallest to me:
> 
> ***dbgcnt: lower limit 11 reached for vect_loop.***
> ***dbgcnt: upper limit 11 reached for vect_loop.***
> fft1d.h:1256:23: optimized: loop vectorized using 32 byte vectors
> fft1d.h:1256:23: optimized:  loop versioned for vectorization because of
> possible aliasing
> 
> it also only needs a single alias check (just guessing where things may go
> wrong)
> 
> The source corresponds to
> 
> template<typename T> void radb2(size_t ido, size_t l1,
>   const T * DUCC0_RESTRICT cc, T * DUCC0_RESTRICT ch,
>   const T0 * DUCC0_RESTRICT wa) const
>   {
>   auto WA = [wa,ido](size_t x, size_t i) { return wa[i+x*(ido-1)]; };
>   auto CC = [cc,ido](size_t a, size_t b, size_t c) -> const T&
>     { return cc[a+ido*(b+2*c)]; };
>   auto CH = [ch,ido,l1](size_t a, size_t b, size_t c) -> T&
>     { return ch[a+ido*(b+l1*c)]; };
> 
>   for (size_t k=0; k<l1; k++)
>     PM (CH(0,k,0),CH(0,k,1),CC(0,0,k),CC(ido-1,1,k));
>   if ((ido&1)==0)
>     for (size_t k=0; k<l1; k++)
>       {
>       CH(ido-1,k,0) = T0( 2)*CC(ido-1,0,k);
>       CH(ido-1,k,1) = T0(-2)*CC(0    ,1,k);
>       }
>   if (ido<=2) return;
>   for (size_t k=0; k<l1;++k)
> ====>  this loop
>     for (size_t i=2; i<ido; i+=2)
>       {
>       size_t ic=ido-i;
>       T ti2, tr2;
>       PM (CH(i-1,k,0),tr2,CC(i-1,0,k),CC(ic-1,1,k));
>       PM (ti2,CH(i  ,k,0),CC(i  ,0,k),CC(ic  ,1,k));
>       MULPM (CH(i,k,1),CH(i-1,k,1),WA(0,i-2),WA(0,i-1),ti2,tr2);
>       }
> <====
>   }

Notably the access functions end up with large (negative) initial values
and we have a negative step (more negative step stuff is vectorized with
GCC 11 now!)

 Creating dr for *_123
 analyze_innermost: success.
-       base_address: (const double &) cc_30(D) + (sizetype) ((((long unsigned
int) ido_29(D) + _135)
 + 18446744073709551614) * 8)
+       base_address: (const double &) cc_30(D) + (sizetype) (((long unsigned
int) ido_29(D) + _120) 
* 8)
        offset from base address: 0
-       constant offset from base address: 0
+       constant offset from base address: -24(OVF)
        step: -16(OVF)
        base alignment: 8
        base misalignment: 0
        offset alignment: 256
        step alignment: 16
-       base_object: *(const double &) cc_30(D) + (sizetype) ((((long unsigned
int) ido_29(D) + _135)
 + 18446744073709551614) * 8)
-       Access function 0: {0, +, 18446744073709551600}_4
+       base_object: *(const double &) cc_30(D) + (sizetype) (((long unsigned
int) ido_29(D) + _120) 
* 8)
+       Access function 0: {18446744073709551592, +, 18446744073709551600}_4

we now use SLP for this loop (great!) while we previously failed to vectorize
this loop at all.

We can disable epilouge vectorization with no (good) effect.

The function in question is ducc0::detail_fft::rfftp<double>::radb2<double>,
it should be possible to isolate this kernel into a testcase, I will try to do
this from the GIMPLE IL.

Reply via email to