[Bug tree-optimization/114324] [13/14 Regression] AVX2 vectorisation performance regression with gfortran 13/14

2024-03-15 Thread mjr19 at cam dot ac.uk via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114324

--- Comment #4 from mjr19 at cam dot ac.uk ---
Created attachment 57713
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57713=edit
Second testcase, very similar to first

Thank you for looking into this. The real code in question has more than one
loop which suffers a slow-down with gfortran 13/14 when compared to 12, and I
suspect it is the same underlying issue in all cases.

I attach another test case, which seems very similar. The odd logic surrounding
the initialisation of ci is to replicate the fact that in the real code the
sign of ci depends on an argument which I have dropped, and so the compiler
cannot optimise it away completely.

For this case, gfortran 12 and ifort produce very similar performance, gfortran
13 is over 20% slower, and ifx slower still.

[Bug tree-optimization/114324] [13/14 Regression] AVX2 vectorisation performance regression with gfortran 13/14

2024-03-14 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114324

--- Comment #3 from Richard Biener  ---
So the missed feature is to implement swapping operands of a MINUS_EXPR
during SLP discovery by introducing a conditional negate (for example
by multiplying with { 1, -1 } or with two_operator negate, "nop" and blend).

Note that with GCC 12 and the +- mixed op we ae able to use vaddsubpd,
that's in the end likely the perfect code gen for the testcase.  I'm not
sure it's easy to get back to that with the "more optimized" scalar IL.

I'll note the negate could be also consumed by the constant in the
multiplication.

[Bug tree-optimization/114324] [13/14 Regression] AVX2 vectorisation performance regression with gfortran 13/14

2024-03-14 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114324

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|12.4|13.3
 CC||rguenth at gcc dot gnu.org

[Bug tree-optimization/114324] [13/14 Regression] AVX2 vectorisation performance regression with gfortran 13/14

2024-03-14 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114324

Richard Biener  changed:

   What|Removed |Added

   Priority|P3  |P2

--- Comment #2 from Richard Biener  ---
GCC 12 manages to fully SLP the loop resulting in a vectorization factor of two
while GCC 13 ends up with hybrid SLP and a vectorization factor of four.  The
IL into the vectorizer is almost the same besides

  REALPART_EXPR <(*a_28(D))[_13]> = _53;  |   REALPART_EXPR
<(*a_28(D))[_12]> = _53;
  IMAGPART_EXPR <(*a_28(D))[_13]> = _54;  |   IMAGPART_EXPR
<(*a_28(D))[_12]> = _54;
  _108 = d1$real_51 - _42;|   _55 =
ctmp$real_43 + d1$real_51;
  _56 = ctmp$imag_44 + d1$imag_52;_56 =
ctmp$imag_44 + d1$imag_52;
  REALPART_EXPR <(*a_28(D))[_5]> = _108;  |   REALPART_EXPR
<(*a_28(D))[_6]> = _55;
  IMAGPART_EXPR <(*a_28(D))[_5]> = _56;   |   IMAGPART_EXPR
<(*a_28(D))[_6]> = _56;
  _57 = _42 + d1$real_51; |   _57 =
d1$real_51 - ctmp$real_43;
  _58 = d1$imag_52 - ctmp$imag_44;_58 =
d1$imag_52 - ctmp$imag_44;
  REALPART_EXPR <(*a_28(D))[_9]> = _57;   REALPART_EXPR
<(*a_28(D))[_9]> = _57;
  IMAGPART_EXPR <(*a_28(D))[_9]> = _58;   IMAGPART_EXPR
<(*a_28(D))[_9]> = _58;

where in GCC 13 we ended up destroying the nice complex pattern by
merging the negation into the defining stmts which causes SLP discovery
to fail there:

t.f90:11:7: note:   Build SLP for _35 = IMAGPART_EXPR <(*a_28(D))[_9]>;
t.f90:11:7: note:   precomputed vectype: vector(4) real(kind=8)
t.f90:11:7: note:   nunits = 4
t.f90:11:7: note:   Build SLP for _21 = REALPART_EXPR <(*a_28(D))[_6]>;
t.f90:11:7: note:   precomputed vectype: vector(4) real(kind=8)
t.f90:11:7: note:   nunits = 4
t.f90:11:7: missed:   Build SLP failed: different interleaving chains in one
node _21 = REALPART_EXPR <(*a_28(D))[_6]>;

since we got there from

t.f90:11:7: note:   Build SLP for _7 = _35 - _20;
t.f90:11:7: note:   precomputed vectype: vector(4) real(kind=8)
t.f90:11:7: note:   nunits = 4
t.f90:11:7: note:   Build SLP for _40 = _21 - _36;
t.f90:11:7: note:   precomputed vectype: vector(4) real(kind=8)

there's nothing to "swap".

I'll note that complex lowering produces what GCC 13 has in the end and
it seems to be PRE is what produces the "desired" IL:

Inserted _107 = -_42;
Replaced _6 * 8.660254037844385965883020617184229195117950439453125e-1 with
_107 in all uses of ctmp$real_43 = _6 *
8.660254037844385965883020617184229195117950439453125e-1;
gimple_simplified to _108 = d1$real_51 - _42;
_55 = _108;
gimple_simplified to _57 = _42 + d1$real_51;
Removing dead stmt ctmp$real_43 = _6 *
8.660254037844385965883020617184229195117950439453125e-1;

before PRE we have

  _41 = _20 - _4;
  _42 = _41 * 8.660254037844385965883020617184229195117950439453125e-1;
  _6 = _4 - _20;
  ctmp$real_43 = _6 * 8.660254037844385965883020617184229195117950439453125e-1;

GCC 13 seems to perform the same value numbering but in the end doesn't
insert.  This is because _42 is dead (also in with GCC 12) so we don't
want to make it live again by expressing _43 as -_42 as that wouldn't
be profitable.  That was added by r13-6834-g41ade3399bd1ec on purpose.

>From complex lowering we had

  _41 = _20 - _35;
  _42 = _41 * 8.660254037844385965883020617184229195117950439453125e-1;
  ctmp$real_43 = -_42;

and forwprop rightfully turned that into

  _7 = _35 - _20;
  ctmp$real_43 = _7 * 8.660254037844385965883020617184229195117950439453125e-1;

and PRE undid this in GCC 12 which the change now prohibits.

In this case this simplification is prohibitive to SLP vectorization and
we can't at the moment recover from it.

[Bug tree-optimization/114324] [13/14 Regression] AVX2 vectorisation performance regression with gfortran 13/14

2024-03-13 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114324

Andrew Pinski  changed:

   What|Removed |Added

 Ever confirmed|0   |1
   Target Milestone|--- |12.4
   Last reconfirmed||2024-03-13
 Status|UNCONFIRMED |NEW
Summary|AVX2 vectorisation  |[13/14 Regression] AVX2
   |performance regression with |vectorisation performance
   |gfortran 13/14  |regression with gfortran
   ||13/14
 Blocks||53947
  Component|target  |tree-optimization

--- Comment #1 from Andrew Pinski  ---
Definitely there is some vectorization changes happening. 
Confirmed.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations