[Bug tree-optimization/114324] [13/14 Regression] AVX2 vectorisation performance regression with gfortran 13/14
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114324 --- Comment #4 from mjr19 at cam dot ac.uk --- Created attachment 57713 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57713=edit Second testcase, very similar to first Thank you for looking into this. The real code in question has more than one loop which suffers a slow-down with gfortran 13/14 when compared to 12, and I suspect it is the same underlying issue in all cases. I attach another test case, which seems very similar. The odd logic surrounding the initialisation of ci is to replicate the fact that in the real code the sign of ci depends on an argument which I have dropped, and so the compiler cannot optimise it away completely. For this case, gfortran 12 and ifort produce very similar performance, gfortran 13 is over 20% slower, and ifx slower still.
[Bug tree-optimization/114324] [13/14 Regression] AVX2 vectorisation performance regression with gfortran 13/14
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114324 --- Comment #3 from Richard Biener --- So the missed feature is to implement swapping operands of a MINUS_EXPR during SLP discovery by introducing a conditional negate (for example by multiplying with { 1, -1 } or with two_operator negate, "nop" and blend). Note that with GCC 12 and the +- mixed op we ae able to use vaddsubpd, that's in the end likely the perfect code gen for the testcase. I'm not sure it's easy to get back to that with the "more optimized" scalar IL. I'll note the negate could be also consumed by the constant in the multiplication.
[Bug tree-optimization/114324] [13/14 Regression] AVX2 vectorisation performance regression with gfortran 13/14
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114324 Richard Biener changed: What|Removed |Added Target Milestone|12.4|13.3 CC||rguenth at gcc dot gnu.org
[Bug tree-optimization/114324] [13/14 Regression] AVX2 vectorisation performance regression with gfortran 13/14
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114324 Richard Biener changed: What|Removed |Added Priority|P3 |P2 --- Comment #2 from Richard Biener --- GCC 12 manages to fully SLP the loop resulting in a vectorization factor of two while GCC 13 ends up with hybrid SLP and a vectorization factor of four. The IL into the vectorizer is almost the same besides REALPART_EXPR <(*a_28(D))[_13]> = _53; | REALPART_EXPR <(*a_28(D))[_12]> = _53; IMAGPART_EXPR <(*a_28(D))[_13]> = _54; | IMAGPART_EXPR <(*a_28(D))[_12]> = _54; _108 = d1$real_51 - _42;| _55 = ctmp$real_43 + d1$real_51; _56 = ctmp$imag_44 + d1$imag_52;_56 = ctmp$imag_44 + d1$imag_52; REALPART_EXPR <(*a_28(D))[_5]> = _108; | REALPART_EXPR <(*a_28(D))[_6]> = _55; IMAGPART_EXPR <(*a_28(D))[_5]> = _56; | IMAGPART_EXPR <(*a_28(D))[_6]> = _56; _57 = _42 + d1$real_51; | _57 = d1$real_51 - ctmp$real_43; _58 = d1$imag_52 - ctmp$imag_44;_58 = d1$imag_52 - ctmp$imag_44; REALPART_EXPR <(*a_28(D))[_9]> = _57; REALPART_EXPR <(*a_28(D))[_9]> = _57; IMAGPART_EXPR <(*a_28(D))[_9]> = _58; IMAGPART_EXPR <(*a_28(D))[_9]> = _58; where in GCC 13 we ended up destroying the nice complex pattern by merging the negation into the defining stmts which causes SLP discovery to fail there: t.f90:11:7: note: Build SLP for _35 = IMAGPART_EXPR <(*a_28(D))[_9]>; t.f90:11:7: note: precomputed vectype: vector(4) real(kind=8) t.f90:11:7: note: nunits = 4 t.f90:11:7: note: Build SLP for _21 = REALPART_EXPR <(*a_28(D))[_6]>; t.f90:11:7: note: precomputed vectype: vector(4) real(kind=8) t.f90:11:7: note: nunits = 4 t.f90:11:7: missed: Build SLP failed: different interleaving chains in one node _21 = REALPART_EXPR <(*a_28(D))[_6]>; since we got there from t.f90:11:7: note: Build SLP for _7 = _35 - _20; t.f90:11:7: note: precomputed vectype: vector(4) real(kind=8) t.f90:11:7: note: nunits = 4 t.f90:11:7: note: Build SLP for _40 = _21 - _36; t.f90:11:7: note: precomputed vectype: vector(4) real(kind=8) there's nothing to "swap". I'll note that complex lowering produces what GCC 13 has in the end and it seems to be PRE is what produces the "desired" IL: Inserted _107 = -_42; Replaced _6 * 8.660254037844385965883020617184229195117950439453125e-1 with _107 in all uses of ctmp$real_43 = _6 * 8.660254037844385965883020617184229195117950439453125e-1; gimple_simplified to _108 = d1$real_51 - _42; _55 = _108; gimple_simplified to _57 = _42 + d1$real_51; Removing dead stmt ctmp$real_43 = _6 * 8.660254037844385965883020617184229195117950439453125e-1; before PRE we have _41 = _20 - _4; _42 = _41 * 8.660254037844385965883020617184229195117950439453125e-1; _6 = _4 - _20; ctmp$real_43 = _6 * 8.660254037844385965883020617184229195117950439453125e-1; GCC 13 seems to perform the same value numbering but in the end doesn't insert. This is because _42 is dead (also in with GCC 12) so we don't want to make it live again by expressing _43 as -_42 as that wouldn't be profitable. That was added by r13-6834-g41ade3399bd1ec on purpose. >From complex lowering we had _41 = _20 - _35; _42 = _41 * 8.660254037844385965883020617184229195117950439453125e-1; ctmp$real_43 = -_42; and forwprop rightfully turned that into _7 = _35 - _20; ctmp$real_43 = _7 * 8.660254037844385965883020617184229195117950439453125e-1; and PRE undid this in GCC 12 which the change now prohibits. In this case this simplification is prohibitive to SLP vectorization and we can't at the moment recover from it.
[Bug tree-optimization/114324] [13/14 Regression] AVX2 vectorisation performance regression with gfortran 13/14
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114324 Andrew Pinski changed: What|Removed |Added Ever confirmed|0 |1 Target Milestone|--- |12.4 Last reconfirmed||2024-03-13 Status|UNCONFIRMED |NEW Summary|AVX2 vectorisation |[13/14 Regression] AVX2 |performance regression with |vectorisation performance |gfortran 13/14 |regression with gfortran ||13/14 Blocks||53947 Component|target |tree-optimization --- Comment #1 from Andrew Pinski --- Definitely there is some vectorization changes happening. Confirmed. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 [Bug 53947] [meta-bug] vectorizer missed-optimizations