[PATCH 1/2]AArch64: Fix costing of scalar throughput based calculation for inner loops [PR121290]

Tamar Christina Wed, 20 Aug 2025 09:38:39 -0700

To avoid double counting scalar instructions when doing inner-loop costing the
vectorizer uses vect_prologue as the kind instead of vect_body.


However doing this results in our throughput based costing to think the scalar
loop issues in 0 cycles (rounded up to 1.0).  The latency costs however are fine

note:  operating on full vectors.
note:  Original vector body cost = 10
note:  Vector loop iterates at most 25000 times
note:  Scalar issue estimate:
note:    load operations = 0
note:    store operations = 0
note:    general operations = 0
note:    reduction latency = 0
note:    estimated min cycles per iteration = 1.000000
note:    estimated cycles per vector iteration (for VF 4) = 4.000000
note:  Vector issue estimate:
note:    load operations = 1
note:    store operations = 0
note:    general operations = 2
note:    reduction latency = 0
note:    estimated min cycles per iteration = 1.000000
note:  Cost model analysis:
  Vector inside of loop cost: 10
  Vector prologue cost: 4
  Vector epilogue cost: 0
  Scalar iteration cost: 6
  Scalar outside cost: 0
  Vector outside cost: 4
  prologue iterations: 0
  epilogue iterations: 0
  Calculated minimum iters for profitability: 2
note:    Runtime profitability threshold = 4
note:    Static estimate profitability threshold = 4

This patch, changes it so that when doing inner-loop costing consider all
statements for throughput costing.  Typically scalar shouldn't have an epilogue
cost but this also covers any future tweaks to the vectorizer costing where it
might be used as a similar trick for costing.  After this patch the throughputs
for scalar innner-loop vect are now correct:

note:  operating on full vectors.
note:  Original vector body cost = 500
note:  Vector loop iterates at most 25000 times
note:  Scalar issue estimate:
note:    load operations = 50
note:    store operations = 0
note:    general operations = 50
note:    reduction latency = 0
note:    estimated min cycles per iteration = 16.666667
note:    estimated cycles per vector iteration (for VF 4) = 66.666667
note:  Vector issue estimate:
note:    load operations = 1
note:    store operations = 0
note:    general operations = 2
note:    reduction latency = 0
note:    estimated min cycles per iteration = 1.000000

The cost multiplier of 50 is due to a generic multiplier the vectorizer applies
to inner loop costing.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Ok for master?

Thanks,
Tamar

gcc/ChangeLog:

        PR target/121290
        * config/aarch64/aarch64.cc (aarch64_vector_costs::add_stmt_cost):
        Include prologue costing when costing inner loop scalar stmts.

gcc/testsuite/ChangeLog:

        PR target/121290
        * gcc.target/aarch64/pr121290.c: New test.

---
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
fb8311b655d7300ce22215789437da777b3779ed..1278d563d5efa08b6a9871f69b63f0786877bf99
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -18060,7 +18060,14 @@ aarch64_vector_costs::add_stmt_cost (int count, 
vect_cost_for_stmt kind,
   fractional_cost stmt_cost
     = aarch64_builtin_vectorization_cost (kind, vectype, misalign);
 
-  bool in_inner_loop_p = (where == vect_body
+  /* When costing for scalars the vectorizer passes them as vect_prologue cost
+     instead of vect_body to avoid double counting of scaled costs.  But 
because
+     they are passed as vect_prologue our inner loop detection thinks that all
+     the loop statements are outside the inner loop.  As such the issue rate
+     information for inner loops become zero for scalars.  Therefore accept any
+     scalar statement, relardless of the where as part of the innerloop
+     body.  */
+  bool in_inner_loop_p = ((where == vect_body || m_costing_for_scalar)
                          && stmt_info
                          && stmt_in_inner_loop_p (m_vinfo, stmt_info));
 
diff --git a/gcc/testsuite/gcc.target/aarch64/pr121290.c 
b/gcc/testsuite/gcc.target/aarch64/pr121290.c
new file mode 100644
index 
0000000000000000000000000000000000000000..c88cb7e6979ef43ebaf14c3d3f3c4c561bff3e76
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr121290.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O3 -mcpu=neoverse-v2 -fdump-tree-vect-all 
-std=c99" } */
+
+void
+f (int *restrict x, int *restrict y, int *restrict z, int n)
+{
+  for (int i = 0; i < 4; ++i)
+    {
+      int res = 0;
+      for (int j = 0; j < 100; ++j)
+        res += y[j] * z[i];
+      x[i] = res;
+    }
+}
+
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
+/* { dg-final { scan-tree-dump-not "load operations = 0" "vect" } } */


--

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index fb8311b655d7300ce22215789437da777b3779ed..1278d563d5efa08b6a9871f69b63f0786877bf99 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -18060,7 +18060,14 @@ aarch64_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
   fractional_cost stmt_cost
     = aarch64_builtin_vectorization_cost (kind, vectype, misalign);
 
-  bool in_inner_loop_p = (where == vect_body
+  /* When costing for scalars the vectorizer passes them as vect_prologue cost
+     instead of vect_body to avoid double counting of scaled costs.  But because
+     they are passed as vect_prologue our inner loop detection thinks that all
+     the loop statements are outside the inner loop.  As such the issue rate
+     information for inner loops become zero for scalars.  Therefore accept any
+     scalar statement, relardless of the where as part of the innerloop
+     body.  */
+  bool in_inner_loop_p = ((where == vect_body || m_costing_for_scalar)
 			  && stmt_info
 			  && stmt_in_inner_loop_p (m_vinfo, stmt_info));
 
diff --git a/gcc/testsuite/gcc.target/aarch64/pr121290.c b/gcc/testsuite/gcc.target/aarch64/pr121290.c
new file mode 100644
index 0000000000000000000000000000000000000000..c88cb7e6979ef43ebaf14c3d3f3c4c561bff3e76
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/pr121290.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O3 -mcpu=neoverse-v2 -fdump-tree-vect-all -std=c99" } */
+
+void
+f (int *restrict x, int *restrict y, int *restrict z, int n)
+{
+  for (int i = 0; i < 4; ++i)
+    {
+      int res = 0;
+      for (int j = 0; j < 100; ++j)
+        res += y[j] * z[i];
+      x[i] = res;
+    }
+}
+
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
+/* { dg-final { scan-tree-dump-not "load operations = 0" "vect" } } */

[PATCH 1/2]AArch64: Fix costing of scalar throughput based calculation for inner loops [PR121290]

Reply via email to