[Bug tree-optimization/114324] [13/14 Regression] AVX2 vectorisation performance regression with gfortran 13/14

rguenth at gcc dot gnu.org via Gcc-bugs Thu, 14 Mar 2024 01:32:25 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114324


Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Priority|P3                          |P2

--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
GCC 12 manages to fully SLP the loop resulting in a vectorization factor of two
while GCC 13 ends up with hybrid SLP and a vectorization factor of four.  The
IL into the vectorizer is almost the same besides

  REALPART_EXPR <(*a_28(D))[_13]> = _53;                      |   REALPART_EXPR
<(*a_28(D))[_12]> = _53;
  IMAGPART_EXPR <(*a_28(D))[_13]> = _54;                      |   IMAGPART_EXPR
<(*a_28(D))[_12]> = _54;
  _108 = d1$real_51 - _42;                                    |   _55 =
ctmp$real_43 + d1$real_51;
  _56 = ctmp$imag_44 + d1$imag_52;                                _56 =
ctmp$imag_44 + d1$imag_52;
  REALPART_EXPR <(*a_28(D))[_5]> = _108;                      |   REALPART_EXPR
<(*a_28(D))[_6]> = _55;
  IMAGPART_EXPR <(*a_28(D))[_5]> = _56;                       |   IMAGPART_EXPR
<(*a_28(D))[_6]> = _56;
  _57 = _42 + d1$real_51;                                     |   _57 =
d1$real_51 - ctmp$real_43;
  _58 = d1$imag_52 - ctmp$imag_44;                                _58 =
d1$imag_52 - ctmp$imag_44;
  REALPART_EXPR <(*a_28(D))[_9]> = _57;                           REALPART_EXPR
<(*a_28(D))[_9]> = _57;
  IMAGPART_EXPR <(*a_28(D))[_9]> = _58;                           IMAGPART_EXPR
<(*a_28(D))[_9]> = _58;

where in GCC 13 we ended up destroying the nice complex pattern by
merging the negation into the defining stmts which causes SLP discovery
to fail there:

t.f90:11:7: note:   Build SLP for _35 = IMAGPART_EXPR <(*a_28(D))[_9]>;
t.f90:11:7: note:   precomputed vectype: vector(4) real(kind=8)
t.f90:11:7: note:   nunits = 4
t.f90:11:7: note:   Build SLP for _21 = REALPART_EXPR <(*a_28(D))[_6]>;
t.f90:11:7: note:   precomputed vectype: vector(4) real(kind=8)
t.f90:11:7: note:   nunits = 4
t.f90:11:7: missed:   Build SLP failed: different interleaving chains in one
node _21 = REALPART_EXPR <(*a_28(D))[_6]>;

since we got there from

t.f90:11:7: note:   Build SLP for _7 = _35 - _20;
t.f90:11:7: note:   precomputed vectype: vector(4) real(kind=8)
t.f90:11:7: note:   nunits = 4
t.f90:11:7: note:   Build SLP for _40 = _21 - _36;
t.f90:11:7: note:   precomputed vectype: vector(4) real(kind=8)

there's nothing to "swap".

I'll note that complex lowering produces what GCC 13 has in the end and
it seems to be PRE is what produces the "desired" IL:

Inserted _107 = -_42;
Replaced _6 * 8.660254037844385965883020617184229195117950439453125e-1 with
_107 in all uses of ctmp$real_43 = _6 *
8.660254037844385965883020617184229195117950439453125e-1;
gimple_simplified to _108 = d1$real_51 - _42;
_55 = _108;
gimple_simplified to _57 = _42 + d1$real_51;
Removing dead stmt ctmp$real_43 = _6 *
8.660254037844385965883020617184229195117950439453125e-1;

before PRE we have

  _41 = _20 - _4;
  _42 = _41 * 8.660254037844385965883020617184229195117950439453125e-1;
  _6 = _4 - _20;
  ctmp$real_43 = _6 * 8.660254037844385965883020617184229195117950439453125e-1;

GCC 13 seems to perform the same value numbering but in the end doesn't
insert.  This is because _42 is dead (also in with GCC 12) so we don't
want to make it live again by expressing _43 as -_42 as that wouldn't
be profitable.  That was added by r13-6834-g41ade3399bd1ec on purpose.

>From complex lowering we had

  _41 = _20 - _35;
  _42 = _41 * 8.660254037844385965883020617184229195117950439453125e-1;
  ctmp$real_43 = -_42;

and forwprop rightfully turned that into

  _7 = _35 - _20;
  ctmp$real_43 = _7 * 8.660254037844385965883020617184229195117950439453125e-1;

and PRE undid this in GCC 12 which the change now prohibits.

In this case this simplification is prohibitive to SLP vectorization and
we can't at the moment recover from it.

[Bug tree-optimization/114324] [13/14 Regression] AVX2 vectorisation performance regression with gfortran 13/14

Reply via email to