[PATCH]AArch64: account for load_lanes with gaps in costing [PR124866]

Tamar Christina Wed, 06 May 2026 06:02:34 -0700

The vectorizer when costing LOAD_LANES with gaps breaks the costing into
multiple calls where the first call wants to cost GAPS loads and second
GROUP_SIZE - GAPS loads. However:


1. Both VMAT types are LOAD_LANES, we we cost effectively
   GROUP_SIZE * LOAD_LANES.  This increases the latency costs immensely.

2. The GROUP_SIZE costing also increases the throughput costs * GROUP_SIZE,
   which has the knock on effect of increasing the costs for the latency again
   as the compiler applies a latency penalty due to the scalar code issuing
   quicker.

What the vectorizer is trying to cost is simply 1x LOAD_LANES but it's asking to
cost the invidual gaps as well.  However for an LDn operation this costs is
always fixed and is the cost of the LDn itself.

when there isn't any gaps we only cost 1x LOAD_LANES which is the expected
amount.  Unfortunately we can't really tell in the backend easily what chain of
costing belongs to any one operation and so treat them as individual ones.

This patch keeps track of the LOAD_LANES we've seen already when there is a gap
and

1. marks the cost of the first GAPS costing to 0.
2. costs the cost of the second GROUP_SIZE - GAPS loads as a single LOAD_LANES.

This is the best we can do until Richi's changes to cost operations like these
as a group.

Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.

Any objections?

Thanks,
Tamar

gcc/ChangeLog:

        PR target/124866
        * config/aarch64/aarch64.cc
        (class aarch64_vector_costs): Add m_load_lanes.
        (aarch64_vector_costs::add_stmt_cost): Use it.

gcc/testsuite/ChangeLog:

        PR target/124866
        * gcc.target/aarch64/sve/cost_model_20.c: New test.
        * gcc.target/aarch64/sve/cost_model_21.c: New test.
        * gcc.target/aarch64/sve/cost_model_22.c: New test.
        * gcc.target/aarch64/sve/cost_model_23.c: New test.
        * gcc.target/aarch64/sve/pr124866.c: New test.

---
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 
37c28c8f2f8f58f6bb16774e8d5a788cdc800154..aeeef4aa47eb81e771b0bab5f5eb03b19800d16e
 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -17582,6 +17582,11 @@ private:
   /* If m_loop_fully_scalar_dup this contains the total number of leaf stmts
      found in the SLP tree.  */
   uint64_t m_num_total_stmts = 0;
+
+  /* Used when costing load lanes with gaps as we need to know whether we are
+     costing the first or the second occurrence.  (i.e.  the gap or the used
+     bits).  */
+  hash_set<stmt_vec_info> m_load_lanes;
 };
 
 aarch64_vector_costs::aarch64_vector_costs (vec_info *vinfo,
@@ -18779,6 +18784,36 @@ aarch64_vector_costs::add_stmt_cost (int count, 
vect_cost_for_stmt kind,
        }
     }
 
+  /* When a store with gaps happens the middle-end will cost these as series
+     of x loads (where x is the number of unused vectors) and set count to x
+     which causes the cost model to go off the rails because we cost e.g.
+     an LD4 as doing 4 LDRs + permutes.  Detect this and cost the load as a
+     single LDn.  The middle-end costs this as GAPS + 1 loads so the costing
+     for load lanes goes completely wrong.  On AARCH64 the gaps don't matter
+     and LOAD_LANES has the same cost so just ignore the GAP costing if we're
+     doing throughput based costing.  */
+  stmt_vec_info first_stmt_info;
+  if (node
+      && stmt_info
+      && aarch64_tune_params.vec_costs->issue_info
+      && SLP_TREE_MEMORY_ACCESS_TYPE (node) == VMAT_LOAD_STORE_LANES
+      && STMT_VINFO_GROUPED_ACCESS (stmt_info)
+      && (first_stmt_info = DR_GROUP_FIRST_ELEMENT (stmt_info))
+      && first_stmt_info == stmt_info)
+    {
+      /* Figure out of we're a group load with gaps.  */
+      unsigned int gaps = DR_GROUP_SIZE (first_stmt_info);
+      stmt_vec_info next_stmt_info = first_stmt_info;
+      do
+       {
+         gaps -= 1;
+       }
+      while ((next_stmt_info = DR_GROUP_NEXT_ELEMENT (next_stmt_info)));
+
+      if (gaps)
+       count = m_load_lanes.add (stmt_info) ? 1 : 0;
+    }
+
   /* If we're recording a nonzero vector loop body cost for the
      innermost loop, also estimate the operations that would need
      to be issued by all relevant implementations of the loop.  */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cost_model_20.c 
b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_20.c
new file mode 100644
index 
0000000000000000000000000000000000000000..ffb51b5a1a0c59ddf4ebcb71d0036b0f0f785e3f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_20.c
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O3 -march=armv9-a -fdump-tree-vect-details" } */
+
+int reduce(int *a, int n) 
+{
+    int sum = 0;
+    for (int i = 0; i < n; i=i+2) {
+        sum += a[i];
+    }
+    return sum;
+}
+
+/* { dg-final { scan-tree-dump "\\*_3 1 times vector_load costs 0 in body" 
"vect" } } */
+/* Sadly we can't get this fully right because the middle-end will re-cost 
based on ncopies, but we don't
+   know ncopies and recalculating it in the backend would replicate a lot of 
code.  */
+/* { dg-final { scan-tree-dump-not "\\*_3 1 times vector_load costs 6 in body" 
"vect" { xfail *-*-* } } } */
+/* { dg-final { scan-tree-dump "\\*_3 1 times vector_load costs 6 in body" 
"vect" } } */
+
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
+/* { dg-final { scan-tree-dump "\.(MASKED_LOAD)?_LANES" "vect" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cost_model_21.c 
b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_21.c
new file mode 100644
index 
0000000000000000000000000000000000000000..7157796b11c32314c7a0d44e34135173fab82ba1
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_21.c
@@ -0,0 +1,21 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O3 -march=armv9-a -fdump-tree-vect-details" } */
+
+int reduce(int *a, int n) 
+{
+    int sum = 0;
+    int sub = 0;
+    for (int i = 0; i < n; i=i+4) {
+      {
+        sum += a[i];
+        sub += a[i+3];
+      }
+    }
+    return sum - sub;
+}
+
+/* { dg-final { scan-tree-dump "\\*_3 2 times vector_load costs 0 in body" 
"vect" } } */
+/* { dg-final { scan-tree-dump "\\*_3 1 times vector_load costs 7 in body" 
"vect" } } */
+
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
+/* { dg-final { scan-tree-dump-not "\.(MASKED_LOAD)?_LANES" "vect" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cost_model_22.c 
b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_22.c
new file mode 100644
index 
0000000000000000000000000000000000000000..68478b55dd40c825e3d4e2f0a5351263e9696665
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_22.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O3 -march=armv9-a -fdump-tree-vect-details" } */
+
+int reduce(int *a, int n) 
+{
+    int sum = 0;
+    for (int i = 0; i < n; i=i+4) {
+        sum += a[i];
+    }
+    return sum;
+}
+
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
+/* { dg-final { scan-tree-dump "\.(MASKED_LOAD)?_LANES" "vect" } } */
+
+/* { dg-final { scan-tree-dump "\\*_3 3 times vector_load costs 0 in body" 
"vect" } } */
+/* { dg-final { scan-tree-dump "\\*_3 1 times vector_load costs 7 in body" 
"vect" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cost_model_23.c 
b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_23.c
new file mode 100644
index 
0000000000000000000000000000000000000000..cab1073453d097ddd3cca4fe5577202d1720030b
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_23.c
@@ -0,0 +1,28 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O3 -march=armv9-a -fdump-tree-vect-details" } */
+
+typedef struct _argb {
+  char r;
+  char g;
+  char b;
+  char a;
+} rgb_t;
+
+int r, g, b, a;
+
+void reduce(rgb_t *x, int n) 
+{
+    r = g = b = a = 0;
+    for (int i = 0; i < n; i++) {
+        r += x[i].r;
+        g += x[i].g;
+        b += x[i].b;
+        a += x[i].a;
+    }
+    return;
+}
+
+/* { dg-final { scan-tree-dump "_3->r 1 times vector_load costs 7 in body" 
"vect" } } */
+
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
+/* { dg-final { scan-tree-dump "\.(MASKED_LOAD)?_LANES" "vect" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pr124866.c 
b/gcc/testsuite/gcc.target/aarch64/sve/pr124866.c
new file mode 100644
index 
0000000000000000000000000000000000000000..d3e48f61c6f4714f92dc225d001c11a647c129f1
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/pr124866.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O3 -march=armv9-a -fdump-tree-vect-details" } */
+
+int reduce(int *a, int n) 
+{
+    int sum = 0;
+    for (int i = 0; i < n; i=i+4) {
+        sum += a[i];
+    }
+    return sum;
+}
+
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
+/* { dg-final { scan-tree-dump "\.(MASKED_LOAD)?_LANES" "vect" } } */


--

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 37c28c8f2f8f58f6bb16774e8d5a788cdc800154..aeeef4aa47eb81e771b0bab5f5eb03b19800d16e 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -17582,6 +17582,11 @@ private:
   /* If m_loop_fully_scalar_dup this contains the total number of leaf stmts
      found in the SLP tree.  */
   uint64_t m_num_total_stmts = 0;
+
+  /* Used when costing load lanes with gaps as we need to know whether we are
+     costing the first or the second occurrence.  (i.e.  the gap or the used
+     bits).  */
+  hash_set<stmt_vec_info> m_load_lanes;
 };
 
 aarch64_vector_costs::aarch64_vector_costs (vec_info *vinfo,
@@ -18779,6 +18784,36 @@ aarch64_vector_costs::add_stmt_cost (int count, vect_cost_for_stmt kind,
 	}
     }
 
+  /* When a store with gaps happens the middle-end will cost these as series
+     of x loads (where x is the number of unused vectors) and set count to x
+     which causes the cost model to go off the rails because we cost e.g.
+     an LD4 as doing 4 LDRs + permutes.  Detect this and cost the load as a
+     single LDn.  The middle-end costs this as GAPS + 1 loads so the costing
+     for load lanes goes completely wrong.  On AARCH64 the gaps don't matter
+     and LOAD_LANES has the same cost so just ignore the GAP costing if we're
+     doing throughput based costing.  */
+  stmt_vec_info first_stmt_info;
+  if (node
+      && stmt_info
+      && aarch64_tune_params.vec_costs->issue_info
+      && SLP_TREE_MEMORY_ACCESS_TYPE (node) == VMAT_LOAD_STORE_LANES
+      && STMT_VINFO_GROUPED_ACCESS (stmt_info)
+      && (first_stmt_info = DR_GROUP_FIRST_ELEMENT (stmt_info))
+      && first_stmt_info == stmt_info)
+    {
+      /* Figure out of we're a group load with gaps.  */
+      unsigned int gaps = DR_GROUP_SIZE (first_stmt_info);
+      stmt_vec_info next_stmt_info = first_stmt_info;
+      do
+	{
+	  gaps -= 1;
+	}
+      while ((next_stmt_info = DR_GROUP_NEXT_ELEMENT (next_stmt_info)));
+
+      if (gaps)
+	count = m_load_lanes.add (stmt_info) ? 1 : 0;
+    }
+
   /* If we're recording a nonzero vector loop body cost for the
      innermost loop, also estimate the operations that would need
      to be issued by all relevant implementations of the loop.  */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cost_model_20.c b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_20.c
new file mode 100644
index 0000000000000000000000000000000000000000..ffb51b5a1a0c59ddf4ebcb71d0036b0f0f785e3f
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_20.c
@@ -0,0 +1,20 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O3 -march=armv9-a -fdump-tree-vect-details" } */
+
+int reduce(int *a, int n) 
+{
+    int sum = 0;
+    for (int i = 0; i < n; i=i+2) {
+        sum += a[i];
+    }
+    return sum;
+}
+
+/* { dg-final { scan-tree-dump "\\*_3 1 times vector_load costs 0 in body" "vect" } } */
+/* Sadly we can't get this fully right because the middle-end will re-cost based on ncopies, but we don't
+   know ncopies and recalculating it in the backend would replicate a lot of code.  */
+/* { dg-final { scan-tree-dump-not "\\*_3 1 times vector_load costs 6 in body" "vect" { xfail *-*-* } } } */
+/* { dg-final { scan-tree-dump "\\*_3 1 times vector_load costs 6 in body" "vect" } } */
+
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
+/* { dg-final { scan-tree-dump "\.(MASKED_LOAD)?_LANES" "vect" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cost_model_21.c b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_21.c
new file mode 100644
index 0000000000000000000000000000000000000000..7157796b11c32314c7a0d44e34135173fab82ba1
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_21.c
@@ -0,0 +1,21 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O3 -march=armv9-a -fdump-tree-vect-details" } */
+
+int reduce(int *a, int n) 
+{
+    int sum = 0;
+    int sub = 0;
+    for (int i = 0; i < n; i=i+4) {
+      {
+        sum += a[i];
+        sub += a[i+3];
+      }
+    }
+    return sum - sub;
+}
+
+/* { dg-final { scan-tree-dump "\\*_3 2 times vector_load costs 0 in body" "vect" } } */
+/* { dg-final { scan-tree-dump "\\*_3 1 times vector_load costs 7 in body" "vect" } } */
+
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
+/* { dg-final { scan-tree-dump-not "\.(MASKED_LOAD)?_LANES" "vect" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cost_model_22.c b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_22.c
new file mode 100644
index 0000000000000000000000000000000000000000..68478b55dd40c825e3d4e2f0a5351263e9696665
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_22.c
@@ -0,0 +1,17 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O3 -march=armv9-a -fdump-tree-vect-details" } */
+
+int reduce(int *a, int n) 
+{
+    int sum = 0;
+    for (int i = 0; i < n; i=i+4) {
+        sum += a[i];
+    }
+    return sum;
+}
+
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
+/* { dg-final { scan-tree-dump "\.(MASKED_LOAD)?_LANES" "vect" } } */
+
+/* { dg-final { scan-tree-dump "\\*_3 3 times vector_load costs 0 in body" "vect" } } */
+/* { dg-final { scan-tree-dump "\\*_3 1 times vector_load costs 7 in body" "vect" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cost_model_23.c b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_23.c
new file mode 100644
index 0000000000000000000000000000000000000000..cab1073453d097ddd3cca4fe5577202d1720030b
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_23.c
@@ -0,0 +1,28 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O3 -march=armv9-a -fdump-tree-vect-details" } */
+
+typedef struct _argb {
+  char r;
+  char g;
+  char b;
+  char a;
+} rgb_t;
+
+int r, g, b, a;
+
+void reduce(rgb_t *x, int n) 
+{
+    r = g = b = a = 0;
+    for (int i = 0; i < n; i++) {
+        r += x[i].r;
+        g += x[i].g;
+        b += x[i].b;
+        a += x[i].a;
+    }
+    return;
+}
+
+/* { dg-final { scan-tree-dump "_3->r 1 times vector_load costs 7 in body" "vect" } } */
+
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
+/* { dg-final { scan-tree-dump "\.(MASKED_LOAD)?_LANES" "vect" } } */
diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pr124866.c b/gcc/testsuite/gcc.target/aarch64/sve/pr124866.c
new file mode 100644
index 0000000000000000000000000000000000000000..d3e48f61c6f4714f92dc225d001c11a647c129f1
--- /dev/null
+++ b/gcc/testsuite/gcc.target/aarch64/sve/pr124866.c
@@ -0,0 +1,14 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O3 -march=armv9-a -fdump-tree-vect-details" } */
+
+int reduce(int *a, int n) 
+{
+    int sum = 0;
+    for (int i = 0; i < n; i=i+4) {
+        sum += a[i];
+    }
+    return sum;
+}
+
+/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
+/* { dg-final { scan-tree-dump "\.(MASKED_LOAD)?_LANES" "vect" } } */

[PATCH]AArch64: account for load_lanes with gaps in costing [PR124866]

Reply via email to