Re: [PATCH] tree-optimization/123190 - allow VF == 1 epilog vectorization

Torbjorn SVENSSON Tue, 03 Mar 2026 04:25:55 -0800



On 2026-03-03 13:19, Richard Biener wrote:

On Tue, 3 Mar 2026, Torbjorn SVENSSON wrote:

Hi Richard,

This patch causes a regression for arm-none-eabi (log snippet is from a more
recent build of GCC):

Testing complex/fast-math-complex-add-pattern-half-float.c
doing compile
Executing on host: /build/r16-7849-g1f9879e17466f5/bin/arm-none-eabi-gcc
/build/gcc_src/gcc/testsuite/gcc.dg/vect/complex/fast-math-complex-add-pattern-half-float.c
-mthumb -march=armv6s-m -mtune=cortex-m0 -mfloat-abi=soft -mfpu=auto
-fdiagnostics-plain-output   -mfloat-abi=softfp -mcpu=unset
-march=armv7-a+simd -mfpu=auto -ffast-math -ftree-vectorize
-fno-tree-loop-distribute-patterns -fno-vect-cost-model -fno-common -O2
-fdump-tree-vect-details -ffast-math -mfloat-abi=softfp -mfpu=auto -mcpu=unset
-march=armv8.3-a+fp16+simd -S     -o
fast-math-complex-add-pattern-half-float.s    (timeout = 800)
spawn -ignore SIGHUP /build/r16-7849-g1f9879e17466f5/bin/arm-none-eabi-gcc
/build/gcc_src/gcc/testsuite/gcc.dg/vect/complex/fast-math-complex-add-pattern-half-float.c
-mthumb -march=armv6s-m -mtune=cortex-m0 -mfloat-abi=soft -mfpu=auto
-fdiagnostics-plain-output -mfloat-abi=softfp -mcpu=unset -march=armv7-a+simd
-mfpu=auto -ffast-math -ftree-vectorize -fno-tree-loop-distribute-patterns
-fno-vect-cost-model -fno-common -O2 -fdump-tree-vect-details -ffast-math
-mfloat-abi=softfp -mfpu=auto -mcpu=unset -march=armv8.3-a+fp16+simd -S -o
fast-math-complex-add-pattern-half-float.s
pid is 2955587 -2955587
pid is -1
output is  status 0
PASS: gcc.dg/vect/complex/fast-math-complex-add-pattern-half-float.c (test for
excess errors)
gcc.dg/vect/complex/fast-math-complex-add-pattern-half-float.c: pattern found
4 times
FAIL: gcc.dg/vect/complex/fast-math-complex-add-pattern-half-float.c
scan-tree-dump-times vect "add new stmt: [^\n\r]*COMPLEX_ADD_ROT90" 3
PASS: gcc.dg/vect/complex/fast-math-complex-add-pattern-half-float.c
PASS: scan-tree-dump-times vect "add new stmt: [^\n\r]*COMPLEX_ADD_ROT270" 1
PASS: gcc.dg/vect/complex/fast-math-complex-add-pattern-half-float.c
PASS: scan-tree-dump vect "Found COMPLEX_ADD_ROT270"
PASS: gcc.dg/vect/complex/fast-math-complex-add-pattern-half-float.c
PASS: scan-tree-dump vect "Found COMPLEX_ADD_ROT90"

Should the test be updated to accept 4 pattern matches or is it wrong to have
4 matches instead of 3?


We see epilogue vectorization here.  The best way forward is probably
to add --param vect-epilogues-nomask=0 to dg-additional-options in that
testcase.  I verified this fixes the issue on aarch64 with
-march=armv8.3-a.

Can you test arm-none-eabi?  arm and aarch64 are the only targets
enabled by vect_complex_add_half.


Adding "--param vect-epilogues-nomask=0" to dg-additional-options works fine 
for the targets that I test on arm-none-eabi.

Do you want me to send a patch with this or will you handle it?

Kind regards,
Torbjörn


Richard.

Kind regards,
Torbjörn

On 2026-01-14 12:52, Richard Biener wrote:

The following adjusts the condition where we reject vectorization
because the scalar loop runs only for a single iteration (or two,
in case we need to peel for gaps).  Because this is over-eager
when considering the case of VF == 1 where instead the cost model
should decide wheter it is worthwhile or not.  I'm playing
conservative here and exclude the case of two iterations as I
do not have benchmark evidence.

This helps fixing a regression observed with improved SLP handling,
not exactly for the options used in the PR though, but for a more
common -O3 -march=x86-64-v3 this speeds up 433.milc by 6%.

Bootstrapped and tested on x86_64-unknown-linux-gnu, will push later.

  PR tree-optimization/123190
  * tree-vect-loop.cc (vect_analyze_loop_costing): Allow
  vectorizing loops with a single scalar iteration iff the
  vectorization factor is 1.

  * gcc.dg/vect/costmodel/x86_64/costmodel-pr123190-1.c: New testcase.
  * gcc.dg/vect/slp-28.c: Avoid epilogue vectorization for
  simplicity.
---
   .../costmodel/x86_64/costmodel-pr123190-1.c   | 38 +++++++++++++++++++
   gcc/testsuite/gcc.dg/vect/slp-28.c            |  1 +
   gcc/tree-vect-loop.cc                         |  8 +++-
   3 files changed, 45 insertions(+), 2 deletions(-)
   create mode 100644
   gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr123190-1.c

diff --git
a/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr123190-1.c
b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr123190-1.c
new file mode 100644
index 00000000000..4265ac80a43
--- /dev/null
+++ b/gcc/testsuite/gcc.dg/vect/costmodel/x86_64/costmodel-pr123190-1.c
@@ -0,0 +1,38 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O3 -mavx2 -mno-avx512f -mtune=generic" } */
+
+typedef struct {
+   double real;
+   double imag;
+} complex;
+
+typedef struct { complex e[3][3]; } su3_matrix;
+
+void mult_su3_na( su3_matrix *a, su3_matrix *b, su3_matrix *c ){
+int i,j;
+register double t,ar,ai,br,bi,cr,ci;
+    for(i=0;i<3;i++)
+      for(j=0;j<3;j++){
+
+        ar=a->e[i][0].real; ai=a->e[i][0].imag;
+        br=b->e[j][0].real; bi=b->e[j][0].imag;
+        cr=ar*br; t=ai*bi; cr += t;
+        ci=ai*br; t=ar*bi; ci -= t;
+
+        ar=a->e[i][1].real; ai=a->e[i][1].imag;
+        br=b->e[j][1].real; bi=b->e[j][1].imag;
+        t=ar*br; cr += t; t=ai*bi; cr += t;
+        t=ar*bi; ci -= t; t=ai*br; ci += t;
+
+        ar=a->e[i][2].real; ai=a->e[i][2].imag;
+        br=b->e[j][2].real; bi=b->e[j][2].imag;
+        t=ar*br; cr += t; t=ai*bi; cr += t;
+        t=ar*bi; ci -= t; t=ai*br; ci += t;
+
+        c->e[i][j].real=cr;
+        c->e[i][j].imag=ci;
+    }
+}
+
+/* { dg-final { scan-tree-dump "optimized: loop vectorized using 32" "vect"
} } */
+/* { dg-final { scan-tree-dump "optimized: epilogue loop vectorized using
16 byte vectors and unroll factor 1" "vect" } } */
diff --git a/gcc/testsuite/gcc.dg/vect/slp-28.c
b/gcc/testsuite/gcc.dg/vect/slp-28.c
index 1f987874f0d..bf6271eed25 100644
--- a/gcc/testsuite/gcc.dg/vect/slp-28.c
+++ b/gcc/testsuite/gcc.dg/vect/slp-28.c
@@ -1,4 +1,5 @@
   /* { dg-require-effective-target vect_int } */
+/* { dg-additional-options "--param vect-epilogues-nomask=0" } */

#include <stdarg.h>

   #include "tree-vect.h"
diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 74eecb832e6..fdf544fa47b 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -1792,9 +1792,13 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo,
        }
    }
         /* Reject vectorizing for a single scalar iteration, even if
-        we could in principle implement that using partial vectors.  */
+        we could in principle implement that using partial vectors.
+        But allow such vectorization if VF == 1 in case we do not
+        need to peel for gaps (if we need, avoid vectorization for
+        reasons of code footprint).  */
         unsigned peeling_gap = LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo);
-      if (scalar_niters <= peeling_gap + 1)
+      if (scalar_niters <= peeling_gap + 1
+         && (assumed_vf > 1 || peeling_gap != 0))
    {
      if (dump_enabled_p ())
        dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,

Re: [PATCH] tree-optimization/123190 - allow VF == 1 epilog vectorization

Reply via email to