[Bug tree-optimization/98544] [11 regression] Wrong code generated by tree vectorizer since r11-3917-g28290cb50c7dbf87

2021-01-08 Thread rguenther at suse dot de via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544

--- Comment #23 from rguenther at suse dot de  ---
On January 8, 2021 3:07:48 PM GMT+01:00, "mar...@mpa-garching.mpg.de"
 wrote:
>https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544
>
>--- Comment #22 from Martin Reinecke  ---
>Brilliant, thank you very much for tracking this one down!
>My FFT library now works correctly again with all optimizations
>enabled, which
>is a great relief. The scipy maintainers will be happy that they won't
>need to
>fiddle with flags depending on gcc versions :)

Thanks again for testing and reporting issues before we released GCC 11!

[Bug tree-optimization/98544] [11 regression] Wrong code generated by tree vectorizer since r11-3917-g28290cb50c7dbf87

2021-01-08 Thread martin--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544

--- Comment #22 from Martin Reinecke  ---
Brilliant, thank you very much for tracking this one down!
My FFT library now works correctly again with all optimizations enabled, which
is a great relief. The scipy maintainers will be happy that they won't need to
fiddle with flags depending on gcc versions :)

[Bug tree-optimization/98544] [11 regression] Wrong code generated by tree vectorizer since r11-3917-g28290cb50c7dbf87

2021-01-08 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544

Richard Biener  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|ASSIGNED|RESOLVED

--- Comment #21 from Richard Biener  ---
Fixed.

[Bug tree-optimization/98544] [11 regression] Wrong code generated by tree vectorizer since r11-3917-g28290cb50c7dbf87

2021-01-08 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544

--- Comment #20 from CVS Commits  ---
The master branch has been updated by Richard Biener :

https://gcc.gnu.org/g:bdcde1504502719504a7a63ab10059e171694dc2

commit r11-6549-gbdcde1504502719504a7a63ab10059e171694dc2
Author: Richard Biener 
Date:   Fri Jan 8 13:17:18 2021 +0100

tree-optimization/98544 - more permute optimization fixes

Permute nodes are not transparent to the permute of their children.
Instead we have to materialize child permutes always and in future
may treat permute nodes as the source of arbitrary permutes as
we can permute the lane permutation vector at will (as the target
supports in the end).

2021-01-08  Richard Biener  

PR tree-optimization/98544
* tree-vect-slp.c (vect_optimize_slp): Always materialize
permutes at a permute node.

* gcc.dg/vect/bb-slp-pr98544.c: New testcase.

[Bug tree-optimization/98544] [11 regression] Wrong code generated by tree vectorizer since r11-3917-g28290cb50c7dbf87

2021-01-08 Thread marxin at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544

--- Comment #19 from Martin Liška  ---
> err
> 
> /* { dg-do run } */
> 
> double a[2], b[2], c[2], d[2];
> 
> void __attribute__((noipa))
> foo()
> {
>   double a0 = a[0];
>   double a1 = a[1];
>   double b0 = b[0];
>   double b1 = b[1];
>   double c0 = c[0];
>   double c1 = c[1];
>   double tem1 = a1 - b1;
>   double tem2 = a0 + b0;
>   d[0] = tem1 * c1;
>   d[1] = tem2 * c0;
> }
> 
> int main()
> {
>   a[0] = 1.;
>   a[1] = 2.;
>   b[0] = 3.;
>   b[1] = 4.;
>   c[0] = 2.;
>   c[1] = 3.;
>   foo ();
>   if (d[0] != -6. || d[1] != 8.)
> __builtin_abort ();
>   return 0;
> }

This started to fail with -O3 since r11-3823-g126ed72b9f48f853 if it helps.

[Bug tree-optimization/98544] [11 regression] Wrong code generated by tree vectorizer since r11-3917-g28290cb50c7dbf87

2021-01-08 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544

--- Comment #18 from Richard Biener  ---
(In reply to Richard Biener from comment #17)
> /* { dg-do run } */
> 
> double a[4], b[2];
> 
> void __attribute__((noipa))
> foo ()
> {
>   double a0 = a[0];
>   double a1 = a[1];
>   double a2 = a[2];
>   double a3 = a[3];
>   b[0] = a1 - a3;
>   b[1] = a0 + a2;
> }
> 
> int main()
> {
>   a[0] = 1.;
>   a[1] = 2.;
>   a[2] = 3.;
>   a[3] = 4.;
>   foo ();
>   if (b[0] != -2 || b[1] != 4)
> __builtin_abort ();
>   return 0;
> }

err

/* { dg-do run } */

double a[2], b[2], c[2], d[2];

void __attribute__((noipa))
foo()
{
  double a0 = a[0];
  double a1 = a[1];
  double b0 = b[0];
  double b1 = b[1];
  double c0 = c[0];
  double c1 = c[1];
  double tem1 = a1 - b1;
  double tem2 = a0 + b0;
  d[0] = tem1 * c1;
  d[1] = tem2 * c0;
}

int main()
{
  a[0] = 1.;
  a[1] = 2.;
  b[0] = 3.;
  b[1] = 4.;
  c[0] = 2.;
  c[1] = 3.;
  foo ();
  if (d[0] != -6. || d[1] != 8.)
__builtin_abort ();
  return 0;
}

[Bug tree-optimization/98544] [11 regression] Wrong code generated by tree vectorizer since r11-3917-g28290cb50c7dbf87

2021-01-08 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544

--- Comment #17 from Richard Biener  ---
/* { dg-do run } */

double a[4], b[2];

void __attribute__((noipa))
foo ()
{
  double a0 = a[0];
  double a1 = a[1];
  double a2 = a[2];
  double a3 = a[3];
  b[0] = a1 - a3;
  b[1] = a0 + a2;
}

int main()
{
  a[0] = 1.;
  a[1] = 2.;
  a[2] = 3.;
  a[3] = 4.;
  foo ();
  if (b[0] != -2 || b[1] != 4)
__builtin_abort ();
  return 0;
}

[Bug tree-optimization/98544] [11 regression] Wrong code generated by tree vectorizer since r11-3917-g28290cb50c7dbf87

2021-01-07 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544

--- Comment #16 from Richard Biener  ---
Interestingly trunk doesn't exhibit the issue with radb2 but radb4 and radb3
and radb5 still do ... (radb3 being the next "smallest") but a similarly
stripped down testcase doesn't exhibit problems either (at length 15 as
reported from the full testcase).  *sigh*

[Bug tree-optimization/98544] [11 regression] Wrong code generated by tree vectorizer since r11-3917-g28290cb50c7dbf87

2021-01-07 Thread martin--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544

--- Comment #15 from Martin Reinecke  ---
"Problem at length N" means that the FFT of length N is computed incorrectly.
Also, N==l1*ido*x.

For an FFT of length N, the computation is broken down into several passes.
Let's take N=15.
First the prome factors of N are computed, in this case 3 and 5.

The (simplified) procedure is then:

l1=1;

// first pass with x=3
x=3;
ido=N/(x*l1);
radb3(ido, l1, p1, p2, );
swap (p1,p2);

l1*=x;
x=5;
ido=N/(x*l1);
radb5(ido, l1, p1, p2, );
swap (p1,p2);

[Bug tree-optimization/98544] [11 regression] Wrong code generated by tree vectorizer since r11-3917-g28290cb50c7dbf87

2021-01-07 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544

--- Comment #14 from Richard Biener  ---
So that makes it sth like the following then - still not miscomparing.  What's
the "problem at length N" hinting at?  Does N somehow make ido/l1 different?

typedef unsigned long size_t;
template inline void PM(T , T , T c, T d)
  { a=c+d; b=c-d; }
template inline void MULPM
  (T1 , T1 , T2 c, T2 d, T3 e, T3 f)
  {  a=c*e+d*f; b=c*f-d*e; }
typedef double T;
void __attribute__((noipa))
radb2(size_t ido, size_t l1,
  const T * __restrict cc, T * __restrict ch,
  const T * __restrict wa)
{
  auto WA = [wa,ido](size_t x, size_t i) { return wa[i+x*(ido-1)]; };
  auto CC = [cc,ido](size_t a, size_t b, size_t c) -> const T&
{ return cc[a+ido*(b+2*c)]; };
  auto CH = [ch,ido,l1](size_t a, size_t b, size_t c) -> T&
{ return ch[a+ido*(b+l1*c)]; };

  for (size_t k=0; k

[Bug tree-optimization/98544] [11 regression] Wrong code generated by tree vectorizer since r11-3917-g28290cb50c7dbf87

2021-01-07 Thread martin--- via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544

--- Comment #13 from Martin Reinecke  ---
> What kind of shape (w/o too much guessing) is the function expecting for its 
> input arrays?

For radb the size of the cc and ch arrays is l1*ido*x.
Size of wa is (x-1)*ido.

[Bug tree-optimization/98544] [11 regression] Wrong code generated by tree vectorizer since r11-3917-g28290cb50c7dbf87

2021-01-07 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544

--- Comment #12 from Richard Biener  ---
So at least the following doesn't produce any different answers for me
comparing -O -ftree-loop-vectorize -mavx to -O0

typedef unsigned long size_t;

template inline void PM(T , T , T c, T d)
  { a=c+d; b=c-d; }
template inline void MULPM
  (T1 , T1 , T2 c, T2 d, T3 e, T3 f)
  {  a=c*e+d*f; b=c*f-d*e; }

typedef double T;
void __attribute__((noipa))
radb2(size_t ido, size_t l1,
  const T * __restrict cc, T * __restrict ch,
  const T * __restrict wa)
{
  auto WA = [wa,ido](size_t x, size_t i) { return wa[i+x*(ido-1)]; };
  auto CC = [cc,ido](size_t a, size_t b, size_t c) -> const T&
{ return cc[a+ido*(b+2*c)]; };
  auto CH = [ch,ido,l1](size_t a, size_t b, size_t c) -> T&
{ return ch[a+ido*(b+l1*c)]; };

  for (size_t k=0; k

[Bug tree-optimization/98544] [11 regression] Wrong code generated by tree vectorizer since r11-3917-g28290cb50c7dbf87

2021-01-07 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544

--- Comment #11 from Richard Biener  ---
Or massaging the C++ to produce a single-function testcase (driver still
missing,
I think we can remove the "unrelated" loops as well).  What kind of shape (w/o
too much guessing) is the function expecting for its input arrays?

typedef unsigned long size_t;

template inline void PM(T , T , T c, T d)
  { a=c+d; b=c-d; }
template inline void MULPM
  (T1 , T1 , T2 c, T2 d, T3 e, T3 f)
  {  a=c*e+d*f; b=c*f-d*e; }

typedef double T;
void radb2(size_t ido, size_t l1,
   const T * __restrict cc, T * __restrict ch,
   const T * __restrict wa)
{
  auto WA = [wa,ido](size_t x, size_t i) { return wa[i+x*(ido-1)]; };
  auto CC = [cc,ido](size_t a, size_t b, size_t c) -> const T&
{ return cc[a+ido*(b+2*c)]; };
  auto CH = [ch,ido,l1](size_t a, size_t b, size_t c) -> T&
{ return ch[a+ido*(b+l1*c)]; };

  for (size_t k=0; k

[Bug tree-optimization/98544] [11 regression] Wrong code generated by tree vectorizer since r11-3917-g28290cb50c7dbf87

2021-01-07 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544

--- Comment #10 from Richard Biener  ---
Created attachment 49912
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49912=edit
GIMPLE testcase for one of the miscompiled(?) loops

It shows similar vectorized / analysis IL as for the big testcase with -O
-ftree-loop-vectorize -mavx2 -fgimple

[Bug tree-optimization/98544] [11 regression] Wrong code generated by tree vectorizer since r11-3917-g28290cb50c7dbf87

2021-01-07 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544

--- Comment #9 from Richard Biener  ---
(In reply to Richard Biener from comment #8)
> (In reply to Martin Liška from comment #2)
> > Confirmed, one can reduce that to a single loop vectorization:
> > 
> > $ g++ bug2.cc  -std=c++17 -O1 -mavx -ftree-loop-vectorize
> > -fdbg-cnt=vect_loop:10-10 && ./a.out
> > 
> > but the loop is quite huge.
> 
> btw, 11-11 or 12-12 or 13-13 also is enough individually to trigger a
> miscompare.
> The 11-11 loop looks smallest to me:
> 
> ***dbgcnt: lower limit 11 reached for vect_loop.***
> ***dbgcnt: upper limit 11 reached for vect_loop.***
> fft1d.h:1256:23: optimized: loop vectorized using 32 byte vectors
> fft1d.h:1256:23: optimized:  loop versioned for vectorization because of
> possible aliasing
> 
> it also only needs a single alias check (just guessing where things may go
> wrong)
> 
> The source corresponds to
> 
> template void radb2(size_t ido, size_t l1,
>   const T * DUCC0_RESTRICT cc, T * DUCC0_RESTRICT ch,
>   const T0 * DUCC0_RESTRICT wa) const
>   {
>   auto WA = [wa,ido](size_t x, size_t i) { return wa[i+x*(ido-1)]; };
>   auto CC = [cc,ido](size_t a, size_t b, size_t c) -> const T&
> { return cc[a+ido*(b+2*c)]; };
>   auto CH = [ch,ido,l1](size_t a, size_t b, size_t c) -> T&
> { return ch[a+ido*(b+l1*c)]; };
> 
>   for (size_t k=0; k PM (CH(0,k,0),CH(0,k,1),CC(0,0,k),CC(ido-1,1,k));
>   if ((ido&1)==0)
> for (size_t k=0; k   {
>   CH(ido-1,k,0) = T0( 2)*CC(ido-1,0,k);
>   CH(ido-1,k,1) = T0(-2)*CC(0,1,k);
>   }
>   if (ido<=2) return;
>   for (size_t k=0; k >  this loop
> for (size_t i=2; i   {
>   size_t ic=ido-i;
>   T ti2, tr2;
>   PM (CH(i-1,k,0),tr2,CC(i-1,0,k),CC(ic-1,1,k));
>   PM (ti2,CH(i  ,k,0),CC(i  ,0,k),CC(ic  ,1,k));
>   MULPM (CH(i,k,1),CH(i-1,k,1),WA(0,i-2),WA(0,i-1),ti2,tr2);
>   }
> <
>   }

Notably the access functions end up with large (negative) initial values
and we have a negative step (more negative step stuff is vectorized with
GCC 11 now!)

 Creating dr for *_123
 analyze_innermost: success.
-   base_address: (const double &) cc_30(D) + (sizetype) long unsigned
int) ido_29(D) + _135)
 + 18446744073709551614) * 8)
+   base_address: (const double &) cc_30(D) + (sizetype) (((long unsigned
int) ido_29(D) + _120) 
* 8)
offset from base address: 0
-   constant offset from base address: 0
+   constant offset from base address: -24(OVF)
step: -16(OVF)
base alignment: 8
base misalignment: 0
offset alignment: 256
step alignment: 16
-   base_object: *(const double &) cc_30(D) + (sizetype) long unsigned
int) ido_29(D) + _135)
 + 18446744073709551614) * 8)
-   Access function 0: {0, +, 18446744073709551600}_4
+   base_object: *(const double &) cc_30(D) + (sizetype) (((long unsigned
int) ido_29(D) + _120) 
* 8)
+   Access function 0: {18446744073709551592, +, 18446744073709551600}_4

we now use SLP for this loop (great!) while we previously failed to vectorize
this loop at all.

We can disable epilouge vectorization with no (good) effect.

The function in question is ducc0::detail_fft::rfftp::radb2,
it should be possible to isolate this kernel into a testcase, I will try to do
this from the GIMPLE IL.

[Bug tree-optimization/98544] [11 regression] Wrong code generated by tree vectorizer since r11-3917-g28290cb50c7dbf87

2021-01-07 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544

--- Comment #8 from Richard Biener  ---
(In reply to Martin Liška from comment #2)
> Confirmed, one can reduce that to a single loop vectorization:
> 
> $ g++ bug2.cc  -std=c++17 -O1 -mavx -ftree-loop-vectorize
> -fdbg-cnt=vect_loop:10-10 && ./a.out
> 
> but the loop is quite huge.

btw, 11-11 or 12-12 or 13-13 also is enough individually to trigger a
miscompare.
The 11-11 loop looks smallest to me:

***dbgcnt: lower limit 11 reached for vect_loop.***
***dbgcnt: upper limit 11 reached for vect_loop.***
fft1d.h:1256:23: optimized: loop vectorized using 32 byte vectors
fft1d.h:1256:23: optimized:  loop versioned for vectorization because of
possible aliasing

it also only needs a single alias check (just guessing where things may go
wrong)

The source corresponds to

template void radb2(size_t ido, size_t l1,
  const T * DUCC0_RESTRICT cc, T * DUCC0_RESTRICT ch,
  const T0 * DUCC0_RESTRICT wa) const
  {
  auto WA = [wa,ido](size_t x, size_t i) { return wa[i+x*(ido-1)]; };
  auto CC = [cc,ido](size_t a, size_t b, size_t c) -> const T&
{ return cc[a+ido*(b+2*c)]; };
  auto CH = [ch,ido,l1](size_t a, size_t b, size_t c) -> T&
{ return ch[a+ido*(b+l1*c)]; };

  for (size_t k=0; k  this loop
for (size_t i=2; i

[Bug tree-optimization/98544] [11 regression] Wrong code generated by tree vectorizer since r11-3917-g28290cb50c7dbf87

2021-01-07 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544

--- Comment #7 from Richard Biener  ---
(In reply to Richard Biener from comment #5)
> (In reply to Martin Liška from comment #3)
> > For g++ bug2.cc  -std=c++17 -O1 -mavx -ftree-loop-vectorize, it started with
> > r11-3917-g28290cb50c7dbf87.
> 
> Are you sure about this bisection?  Reverting this doesn't fix the issue
> (reversion isn't quite easy since split_constant_offset has undergone quite
> some refactoring meanwhile).

So I can confirm the bisection.  I see quite some good effect of the change
with no "obvious" mistakes done though so it might just trigger some latent
issue elsewhere.

[Bug tree-optimization/98544] [11 regression] Wrong code generated by tree vectorizer since r11-3917-g28290cb50c7dbf87

2021-01-07 Thread marxin at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544

--- Comment #6 from Martin Liška  ---
(In reply to Richard Biener from comment #5)
> (In reply to Martin Liška from comment #3)
> > For g++ bug2.cc  -std=c++17 -O1 -mavx -ftree-loop-vectorize, it started with
> > r11-3917-g28290cb50c7dbf87.
> 
> Are you sure about this bisection?

Yes, it's correct. I've just verified that.

>  Reverting this doesn't fix the issue
> (reversion isn't quite easy since split_constant_offset has undergone quite
> some refactoring meanwhile).

I'm gonna bisect until when the reversion worked on the trunk.

[Bug tree-optimization/98544] [11 regression] Wrong code generated by tree vectorizer since r11-3917-g28290cb50c7dbf87

2021-01-07 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544

--- Comment #5 from Richard Biener  ---
(In reply to Martin Liška from comment #3)
> For g++ bug2.cc  -std=c++17 -O1 -mavx -ftree-loop-vectorize, it started with
> r11-3917-g28290cb50c7dbf87.

Are you sure about this bisection?  Reverting this doesn't fix the issue
(reversion isn't quite easy since split_constant_offset has undergone quite
some refactoring meanwhile).

[Bug tree-optimization/98544] [11 regression] Wrong code generated by tree vectorizer since r11-3917-g28290cb50c7dbf87

2021-01-06 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|--- |11.0
 CC||rsandifo at gcc dot gnu.org
   Priority|P3  |P1
   Assignee|unassigned at gcc dot gnu.org  |rguenth at gcc dot 
gnu.org
 Status|NEW |ASSIGNED

--- Comment #4 from Richard Biener  ---
split_constant_offset?  Meh.

[Bug tree-optimization/98544] [11 regression] Wrong code generated by tree vectorizer since r11-3917-g28290cb50c7dbf87

2021-01-05 Thread marxin at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98544

Martin Liška  changed:

   What|Removed |Added

Summary|[11 regression] Wrong code  |[11 regression] Wrong code
   |generated by tree   |generated by tree
   |vectorizer  |vectorizer since
   ||r11-3917-g28290cb50c7dbf87

--- Comment #3 from Martin Liška  ---
For g++ bug2.cc  -std=c++17 -O1 -mavx -ftree-loop-vectorize, it started with
r11-3917-g28290cb50c7dbf87.