[PATCH]AArch64: tweak inner-loop penalty when doing outer-loop vect [PR121290]

Tamar Christina Mon, 29 Dec 2025 07:20:12 -0800

r16-3394-g28ab83367e8710a78fffa2513e6e008ebdfbee3e added a cost model adjustment
to detect invariant load and replicate cases when doing outer-loop vectorization
where the inner loop uses a value defined in the outer-loop.


In other words, it's trying to detect the cases where the inner loop would need
to do an ld1r and all inputs are then working on replicated values.  The
argument is that in this case the vector loop is just the scalar loop since each
lane just works on the duplicated values.

But it had two short comings.

1. It's an all or nothing thing.  The load and replicate may only be a small
   percentage of the amount of data being processed.   As such this patch now
   requires the load and replicate to be at least 50% of the leafs of an SLP
   tree.  Ideally we'd just only increase body by VF * invariant leafs,  but we
   can't since the middle-end cost model applies a rather large penalty to the
   scalar code (* 50) and as such the base cost ends up being too high and we
   just never vectorize.  The 50% is an attempt to strike a balance in this
   awkward situation.   Experiments show it works reasonably well and we get the
   right codegen in all the test cases.

2. It does not keep in mind that a load + replicate where that vector value is
   used in a by index operation will result in is decomposing the load back to
   scalar.  e.g.

   ld1r {v0.4s}, x0
   mul  v1.4s, v2.4s, v0.4s

   is transformed into

   ldr  s0, x0
   mul  v1.4s, v2.4s, v0.s[0]

   and as such this case may actually be profitable because we're only doing a
   scalar load of a single element, similar to the scalar loop.

   This patch tries to detect (loosely) such cases and doesn't apply the penalty
   for these.  It's a bit hard to tell whether we end up with a by index
   operation so early as the vectorizer itself is not aware of them and as such
   the patch does not do an exhaustive check, but only does the most obvious
   one. 

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master? Will commit after a week if no comment.

Thanks,
Tamar

gcc/ChangeLog:

        PR target/121290
        * config/aarch64/aarch64.cc (aarch64_possible_by_lane_insn_p): New.
        (aarch64_vector_costs): Add m_num_dup_stmts and m_num_total_stmts. 
        (aarch64_vector_costs::add_stmt_cost): Use them.
        (adjust_body_cost): Likewise.

gcc/testsuite/ChangeLog:

        PR target/121290
        * gcc.target/aarch64/pr121290.c: Move to...
        * gcc.target/aarch64/pr121290_1.c: ...here.
        * g++.target/aarch64/pr121290_1.C: New test.
        * gcc.target/aarch64/pr121290_2.c: New test.

---
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
8b7253b11fef1b7e0cd647251436da19ee177f0d..a6f75206f4d837cb8ec8a55641b0645a78d90d6a
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -17378,6 +17378,15 @@ private:
      support unrolling of the inner loop independently from the outerloop 
during
      outer-loop vectorization which tends to lead to pipeline bubbles.  */
   bool m_loop_fully_scalar_dup = false;
+
+  /* If m_loop_fully_scalar_dup is true then this variable contains the number
+     of statements we estimate to be duplicate.  We only increase the cost of
+     the seeds because we don't want to overly pessimist the loops.  */
+  uint64_t m_num_dup_stmts = 0;
+
+  /* If m_loop_fully_scalar_dup this contains the total number of leaf stmts
+     found in the SLP tree.  */
+  uint64_t m_num_total_stmts = 0;
 };
 
 aarch64_vector_costs::aarch64_vector_costs (vec_info *vinfo,
@@ -18367,6 +18376,43 @@ aarch64_stp_sequence_cost (unsigned int count, 
vect_cost_for_stmt kind,
     }
 }
 
+/* Determine probabilistically whether the STMT is one tht could possible be
+   made into a by lane operation later on.  We can't be sure, but certain
+   operations have a higher chance.  */
+
+static bool
+aarch64_possible_by_lane_insn_p (vec_info *m_vinfo, gimple *stmt)
+{
+  if (!gimple_has_lhs (stmt))
+    return false;
+
+  use_operand_p use_p;
+  imm_use_iterator iter;
+  tree var = gimple_get_lhs (stmt);
+  FOR_EACH_IMM_USE_FAST (use_p, iter, var)
+    {
+      gimple *new_stmt = USE_STMT (use_p);
+      auto stmt_info = vect_stmt_to_vectorize (m_vinfo->lookup_stmt 
(new_stmt));
+      auto rep_stmt = STMT_VINFO_STMT (stmt_info);
+      /* Re-association is a problem here, since lane instructions are only
+        supported as the last operand, as such we put duplicate operands
+        last.  We could check the other operand for invariancy, but it may not
+        be an outer-loop defined invariant.  For now just checking the last
+        operand catches all the cases and we can expand if needed.  */
+      if (is_gimple_assign (rep_stmt))
+       switch (gimple_assign_rhs_code (rep_stmt))
+         {
+         case MULT_EXPR:
+           return operand_equal_p (gimple_assign_rhs2 (new_stmt), var, 0);
+         case DOT_PROD_EXPR:
+           return operand_equal_p (gimple_assign_rhs3 (new_stmt), var, 0);
+         default:
+           continue;
+         }
+    }
+  return false;
+}
+
 unsigned
 aarch64_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
                                     stmt_vec_info stmt_info, slp_tree node,
@@ -18399,7 +18445,10 @@ aarch64_vector_costs::add_stmt_cost (int count, 
vect_cost_for_stmt kind,
        analyze_loop_vinfo (loop_vinfo);
 
       m_analyzed_vinfo = true;
-      if (in_inner_loop_p)
+
+      /* We should only apply the heuristic for invariant values on the inner
+        most loop in a nested loop nest.  */
+      if (in_inner_loop_p && loop_vinfo)
        m_loop_fully_scalar_dup = true;
     }
 
@@ -18408,17 +18457,21 @@ aarch64_vector_costs::add_stmt_cost (int count, 
vect_cost_for_stmt kind,
      try to vectorize.  */
   if (in_inner_loop_p
       && node
-      && m_loop_fully_scalar_dup
       && SLP_TREE_LANES (node) == 1
       && !SLP_TREE_CHILDREN (node).exists ())
     {
-      /* Check if load is a duplicate.  */
-      if (gimple_vuse (stmt_info->stmt)
-         && SLP_TREE_MEMORY_ACCESS_TYPE (node) == VMAT_INVARIANT)
-       ;
-      else if (SLP_TREE_DEF_TYPE (node) == vect_constant_def
-              || SLP_TREE_DEF_TYPE (node) == vect_external_def)
-       ;
+      m_num_total_stmts++;
+      gimple *stmt = STMT_VINFO_STMT (stmt_info);
+      /* Check if load is a duplicate that will be duplicated inside the
+        current loop.  */
+      if (gimple_vuse (stmt)
+         && SLP_TREE_MEMORY_ACCESS_TYPE (node) == VMAT_INVARIANT
+         && !aarch64_possible_by_lane_insn_p (m_vinfo, stmt))
+       m_num_dup_stmts++;
+      else if ((SLP_TREE_DEF_TYPE (node) == vect_constant_def
+               || SLP_TREE_DEF_TYPE (node) == vect_external_def)
+              && !aarch64_possible_by_lane_insn_p (m_vinfo, stmt))
+       m_num_dup_stmts++;
       else
        m_loop_fully_scalar_dup = false;
     }
@@ -18788,8 +18841,16 @@ adjust_body_cost (loop_vec_info loop_vinfo,
     threshold = CEIL (threshold, aarch64_estimated_sve_vq ());
 
   /* Increase the cost of the vector code if it looks like the vector code has
-     limited throughput due to outer-loop vectorization.  */
-  if (m_loop_fully_scalar_dup)
+     limited throughput due to outer-loop vectorization.  As a rough estimate 
we
+     require at least half the operations be a duplicate expression.  This is 
an
+     attempt ot strike a balance between scalar and vector costing wrt to outer
+     loop vectorization.  The vectorizer applies a rather huge penalty to 
scalar
+     costing when doing outer-loop vectorization (See
+     LOOP_VINFO_INNER_LOOP_COST_FACTOR) and because of this accurate costing
+     becomes rather hard.  the 50% here allows us to amortize the cost on 
longer
+     loop bodies where the majority of the inputs are not a broadcast.  */
+  if (m_loop_fully_scalar_dup
+      && (m_num_dup_stmts * 2 >= m_num_total_stmts))
     {
       body_cost *= estimated_vf;
       if (dump_enabled_p ())
diff --git a/gcc/testsuite/g++.target/aarch64/pr121290_1.C 
b/gcc/testsuite/g++.target/aarch64/pr121290_1.C
new file mode 100644
index 
0000000000000000000000000000000000000000..e16d773a7840e969883a3df35da89e1eea89a9fc
--- /dev/null
+++ b/gcc/testsuite/g++.target/aarch64/pr121290_1.C
@@ -0,0 +1,63 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O3 -mcpu=neoverse-v2 -fdump-tree-vect-all -w" } */
+
+#include <cstddef>
+#include <cstdint>
+
+std::ptrdiff_t getRunReps ();
+
+double * get_rp ();
+std::ptrdiff_t get_it ();
+
+void
+runSeqVariant ()
+{
+  const std::ptrdiff_t run_reps = getRunReps ();
+
+  double *__restrict__ B = get_rp ();
+  double *__restrict__ D = get_rp ();
+  double *__restrict__ M = get_rp ();
+  std::ptrdiff_t NE = get_it ();
+
+  for (volatile int irep = 0; irep < run_reps; ++irep)
+    {
+      for (int e = 0; e < NE; ++e)
+        {
+          double s_B[5][4];
+
+          for (int d = 0; d < 4; d++)
+            for (int q = 0; q < 5; q++)
+              s_B[q][d] = B[q + 5 * d];
+
+          double s_D[5][5][5];
+
+          for (int k1 = 0; k1 < 5; k1++)
+            for (int k2 = 0; k2 < 5; k2++)
+              for (int k3 = 0; k3 < 5; k3++)
+                s_D[k1][k2][k3] = D[k1 + 5 * k2 + 5 * 5 * k3 + 5 * 5 * 5 * e];
+
+          for (int i1 = 0; i1 < 4; i1++)
+            for (int i2 = 0; i2 < 4; i2++)
+              for (int i3 = 0; i3 < 4; i3++)
+                for (int j1 = 0; j1 < 4; ++j1)
+                  for (int j2 = 0; j2 < 4; ++j2)
+                    for (int j3 = 0; j3 < 4; ++j3)
+                      {
+                        double val = 0.0;
+                        for (int k1 = 0; k1 < 5; ++k1)
+                          for (int k2 = 0; k2 < 5; ++k2)
+                            for (int k3 = 0; k3 < 5; ++k3)
+                              val += s_B[k1][i1] * s_B[k1][j1] * s_B[k2][i2]
+                                     * s_B[k2][j2] * s_B[k3][i3] * s_B[k3][j3]
+                                     * s_D[k1][k2][k3];
+                        // clang-format off
+                        M[i1 + 4 * (i2 + 4 * (i3 + 4 * (j1 + 4 * (j2 + 4  * 
(j3 + 4 * e)))))] = val;
+                        //clang-format on
+                      }
+        }
+    }
+}
+
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
+/* { dg-final { scan-tree-dump "OUTER LOOP VECTORIZED" "vect" } } */
+/* { dg-final { scan-tree-dump-not "low throughput of per iteration due to 
splats" "vect" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/pr121290.c 
b/gcc/testsuite/gcc.target/aarch64/pr121290_1.c
similarity index 100%
rename from gcc/testsuite/gcc.target/aarch64/pr121290.c
rename to gcc/testsuite/gcc.target/aarch64/pr121290_1.c
diff --git a/gcc/testsuite/gcc.target/aarch64/pr121290_2.c 
b/gcc/testsuite/gcc.target/aarch64/pr121290_2.c
new file mode 100644
index 
0000000000000000000000000000000000000000..c39a01540e7c885840b92f6475fd2d8495fe5c4c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr121290_2.c
@@ -0,0 +1,26 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O3 -mcpu=neoverse-v2 -fdump-tree-vect-all 
-std=c99" } */
+
+#define iterations 100000
+#define LEN_1D 32000
+
+float a[LEN_1D];
+
+int main()
+{
+    float x;
+    for (int nl = 0; nl < iterations; nl++) {
+        x = a[0];
+        for (int i = 0; i < LEN_1D; ++i) {
+            if (a[i] > x) {
+                x = a[i];
+            }
+        }
+    }
+
+    return x > 1;
+}
+
+/* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" } } */
+/* { dg-final { scan-tree-dump-not "OUTER LOOP VECTORIZED" "vect" } } */
+/* { dg-final { scan-tree-dump "low throughput of per iteration due to splats" 
"vect" } } */


--

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 8b7253b11fef1b7e0cd647251436da19ee177f0d..a6f75206f4d837cb8ec8a55641b0645a78d90d6a 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -17378,6 +17378,15 @@ private:
      support unrolling of the inner loop independently from the outerloop during
      outer-loop vectorization which tends to lead to pipeline bubbles.  */
   bool m_loop_fully_scalar_dup = false;
+
+  /* If m_loop_fully_scalar_dup is true then this variable contains the number
+     of statements we estimate to be duplicate.  We only increase the cost of
+     the seeds because we don't want to overly pessimist the loops.  */
+  uint64_t m_num_dup_stmts = 0;
+
+  /* If m_loop_fully_scalar_dup this contains the total number of leaf stmts
+     found in the SLP tree.  */
+  uint64_t m_num_total_stmts = 0;
 };
 
 aarch64_vector_costs::aarch64_vector_costs (vec_info *vinfo,
@@ -18367,6 +18376,43 @@ aarch64_stp_sequence_cost (unsigned int count, vect_cost_for_stmt kind,
     }
 }
 
+/* Determine probabilistically whether the STMT is one tht could possible be
+   made into a by lane operation later on.  We can't be sure, but certain
+   operations have a higher chance.  */
+
+static bool
+aarch64_possible_by_lane_insn_p (vec_info *m_vinfo, gimple *stmt)
+{
+  if (!gimple_has_lhs (stmt))
+    return false;
+
+  use_operand_p use_p;
+  imm_use_iterator iter;
+  tree var = gimple_get_lhs (stmt);
+  FOR_EACH_IMM_USE_FAST (use_p, iter, var)
+    {
+      gimple *new_stmt = USE_STMT (use_p);
+      auto stmt_info = vect_stmt_to_vectorize (m_vinfo->lookup_stmt (new_stmt));
+      auto rep_stmt = STMT_VINFO_STMT (stmt_info);
+      /* Re-association is a problem here, since lane instructions are only
+	 supported as the last operand, as such we put duplicate operands
+	 last.  We could check the other operand for invariancy, but it may not
+	 be an outer-loop defined invariant.  For now just checking the last
+	 operand catches all the cases and we can expand if needed.  */
+      if (is_gimple_assign (rep_stmt))
+	switch (gimple_assign_rhs_code (rep_stmt))
+	  {
+	  case MULT_EXPR:
+	    return operand_equal_p (gimple_assign_rhs2 (new_stmt), var, 0);
+	  case DOT_PROD_EXPR:
+	    return operand_equal_p (gimple_assign_rhs3 (new_stmt), var, 0);
+	  default:
+	    continue;
+	  }
+    }
+  return false;
+}
+
 unsigned
 aarch64_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
 				     stmt_vec_info stmt_info, slp_tree node,
@@ -18399,7 +18445,10 @@ aarch64_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
 	analyze_loop_vinfo (loop_vinfo);
 
       m_analyzed_vinfo = true;
-      if (in_inner_loop_p)
+
+      /* We should only apply the heuristic for invariant values on the inner
+	 most loop in a nested loop nest.  */
+      if (in_inner_loop_p && loop_vinfo)
 	m_loop_fully_scalar_dup = true;
     }
 
@@ -18408,17 +18457,21 @@ aarch64_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
      try to vectorize.  */
   if (in_inner_loop_p
       && node
-      && m_loop_fully_scalar_dup
       && SLP_TREE_LANES (node) == 1
       && !SLP_TREE_CHILDREN (node).exists ())
     {
-      /* Check if load is a duplicate.  */
-      if (gimple_vuse (stmt_info->stmt)
-	  && SLP_TREE_MEMORY_ACCESS_TYPE (node) == VMAT_INVARIANT)
-	;
-      else if (SLP_TREE_DEF_TYPE (node) == vect_constant_def
-	       || SLP_TREE_DEF_TYPE (node) == vect_external_def)
-	;
+      m_num_total_stmts++;
+      gimple *stmt = STMT_VINFO_STMT (stmt_info);
+      /* Check if load is a duplicate that will be duplicated inside the
+	 current loop.  */
+      if (gimple_vuse (stmt)
+	  && SLP_TREE_MEMORY_ACCESS_TYPE (node) == VMAT_INVARIANT
+	  && !aarch64_possible_by_lane_insn_p (m_vinfo, stmt))
+	m_num_dup_stmts++;
+      else if ((SLP_TREE_DEF_TYPE (node) == vect_constant_def
+		|| SLP_TREE_DEF_TYPE (node) == vect_external_def)
+	       && !aarch64_possible_by_lane_insn_p (m_vinfo, stmt))
+	m_num_dup_stmts++;
       else
 	m_loop_fully_scalar_dup = false;
     }
@@ -18788,8 +18841,16 @@ adjust_body_cost (loop_vec_info loop_vinfo,
     threshold = CEIL (threshold, aarch64_estimated_sve_vq ());
 
   /* Increase the cost of the vector code if it looks like the vector code has
-     limited throughput due to outer-loop vectorization.  */
-  if (m_loop_fully_scalar_dup)
+     limited throughput due to outer-loop vectorization.  As a rough estimate we
+     require at least half the operations be a duplicate expression.  This is an
+     attempt ot strike a balance between scalar and vector costing wrt to outer
+     loop vectorization.  The vectorizer applies a rather huge penalty to scalar
+     costing when doing outer-loop vectorization (See
+     LOOP_VINFO_INNER_LOOP_COST_FACTOR) and because of this accurate costing
+     becomes rather hard.  the 50% here allows us to amortize the cost on longer
+     loop bodies where the majority of the inputs are not a broadcast.  */
+  if (m_loop_fully_scalar_dup
+      && (m_num_dup_stmts * 2 >= m_num_total_stmts))
     {
       body_cost *= estimated_vf;
       if (dump_enabled_p ())
diff --git a/gcc/testsuite/g++.target/aarch64/pr121290_1.C b/gcc/testsuite/g++.target/aarch64/pr121290_1.C
new file mode 100644
index 0000000000000000000000000000000000000000..e16d773a7840e969883a3df35da89e1eea89a9fc
--- /dev/null
+++ b/gcc/testsuite/g++.target/aarch64/pr121290_1.C
@@ -0,0 +1,63 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O3 -mcpu=neoverse-v2 -fdump-tree-vect-all -w" } */
+
+#include <cstddef>
+#include <cstdint>
+
+std::ptrdiff_t getRunReps ();
+
+double * get_rp ();
+std::ptrdiff_t get_it ();
+
+void
+runSeqVariant ()
+{
+  const std::ptrdiff_t run_reps = getRunReps ();
+
+  double *__restrict__ B = get_rp ();
+  double *__restrict__ D = get_rp ();
+  double *__restrict__ M = get_rp ();
+  std::ptrdiff_t NE = get_it ();
+
+  for (volatile int irep = 0; irep < run_reps; ++irep)
+    {
+      for (int e = 0; e < NE; ++e)
+        {
+          double s_B[5][4];
+
+          for (int d = 0; d < 4; d++)
+            for (int q = 0; q < 5; q++)
+              s_B[q][d] = B[q + 5 * d];
+
+          double s_D[5][5][5];
+
+          for (int k1 = 0; k1 < 5; k1++)
+            for (int k2 = 0; k2 < 5; k2++)
+              for (int k3 = 0; k3 < 5; k3++)
+                s_D[k1][k2][k3] = D[k1 + 5 * k2 + 5 * 5 * k3 + 5 * 5 * 5 * e];
+
+          for (int i1 = 0; i1 < 4; i1++)
+            for (int i2 = 0; i2 < 4; i2++)
+              for (int i3 = 0; i3 < 4; i3++)
+                for (int j1 = 0; j1 < 4; ++j1)
+                  for (int j2 = 0; j2 < 4; ++j2)
+                    for (int j3 = 0; j3 < 4; ++j3)
+                      {
+                        double val = 0.0;
+                        for (int k1 = 0; k1 < 5; ++k1)
+                          for (int k2 = 0; k2 < 5; ++k2)
+                            for (int k3 = 0; k3 < 5; ++k3)
+                              val += s_B[k1][i1] * s_B[k1][j1] * s_B[k2][i2]
+                                     * s_B[k2][j2] * s_B[k3][i3] * s_B[k3][j3]
+                                     * s_D[k1][k2][k3];
+                        // clang-format off
+                        M[i1 + 4 * (i2 + 4 * (i3 + 4 * (j1 + 4 * (j2 + 4  * (j3 + 4 * e)))))] = val;
+                        //clang-format on
+                      }
+        }
+    }
+}
+
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
+/* { dg-final { scan-tree-dump "OUTER LOOP VECTORIZED" "vect" } } */
+/* { dg-final { scan-tree-dump-not "low throughput of per iteration due to splats" "vect" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/pr121290.c b/gcc/testsuite/gcc.target/aarch64/pr121290_1.c
similarity index 100%
rename from gcc/testsuite/gcc.target/aarch64/pr121290.c
rename to gcc/testsuite/gcc.target/aarch64/pr121290_1.c
diff --git a/gcc/testsuite/gcc.target/aarch64/pr121290_2.c b/gcc/testsuite/gcc.target/aarch64/pr121290_2.c
new file mode 100644
index 0000000000000000000000000000000000000000..c39a01540e7c885840b92f6475fd2d8495fe5c4c
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr121290_2.c
@@ -0,0 +1,26 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O3 -mcpu=neoverse-v2 -fdump-tree-vect-all -std=c99" } */
+
+#define iterations 100000
+#define LEN_1D 32000
+
+float a[LEN_1D];
+
+int main()
+{
+    float x;
+    for (int nl = 0; nl < iterations; nl++) {
+        x = a[0];
+        for (int i = 0; i < LEN_1D; ++i) {
+            if (a[i] > x) {
+                x = a[i];
+            }
+        }
+    }
+
+    return x > 1;
+}
+
+/* { dg-final { scan-tree-dump-not "LOOP VECTORIZED" "vect" } } */
+/* { dg-final { scan-tree-dump-not "OUTER LOOP VECTORIZED" "vect" } } */
+/* { dg-final { scan-tree-dump "low throughput of per iteration due to splats" "vect" } } */

[PATCH]AArch64: tweak inner-loop penalty when doing outer-loop vect [PR121290]

Reply via email to