> -----Original Message-----
> From: Richard Sandiford <[email protected]>
> Sent: 06 May 2026 23:12
> To: Tamar Christina <[email protected]>
> Cc: [email protected]; nd <[email protected]>; Richard Earnshaw
> <[email protected]>; [email protected]; Alex Coplan
> <[email protected]>; [email protected]; Wilco Dijkstra
> <[email protected]>; Alice Carlotti <[email protected]>
> Subject: Re: [PATCH]AArch64: account for load_lanes with gaps in costing
> [PR124866]
> 
> Tamar Christina <[email protected]> writes:
> > The vectorizer when costing LOAD_LANES with gaps breaks the costing into
> > multiple calls where the first call wants to cost GAPS loads and second
> > GROUP_SIZE - GAPS loads. However:
> >
> > 1. Both VMAT types are LOAD_LANES, we we cost effectively
> >    GROUP_SIZE * LOAD_LANES.  This increases the latency costs immensely.
> >
> > 2. The GROUP_SIZE costing also increases the throughput costs *
> GROUP_SIZE,
> >    which has the knock on effect of increasing the costs for the latency 
> > again
> >    as the compiler applies a latency penalty due to the scalar code issuing
> >    quicker.
> >
> > What the vectorizer is trying to cost is simply 1x LOAD_LANES but it's 
> > asking
> to
> > cost the invidual gaps as well.  However for an LDn operation this costs is
> > always fixed and is the cost of the LDn itself.
> >
> > when there isn't any gaps we only cost 1x LOAD_LANES which is the
> expected
> > amount.  Unfortunately we can't really tell in the backend easily what chain
> of
> > costing belongs to any one operation and so treat them as individual ones.
> >
> > This patch keeps track of the LOAD_LANES we've seen already when there is
> a gap
> > and
> >
> > 1. marks the cost of the first GAPS costing to 0.
> > 2. costs the cost of the second GROUP_SIZE - GAPS loads as a single
> LOAD_LANES.
> >
> > This is the best we can do until Richi's changes to cost operations like 
> > these
> > as a group.
> >
> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> 
> I think we should just remove the:
> 
>             /* An IFN_LOAD_LANES will load all its vector results,
>                regardless of which ones we actually need.  Account
>                for the cost of unused results.  */
>             if (first_stmt_info == stmt_info)
>               {
>                 unsigned int gaps = DR_GROUP_SIZE (first_stmt_info);
>                 stmt_vec_info next_stmt_info = first_stmt_info;
>                 do
>                   {
>                     gaps -= 1;
>                     next_stmt_info = DR_GROUP_NEXT_ELEMENT
> (next_stmt_info);
>                   }
>                 while (next_stmt_info);
>                 if (gaps)
>                   {
>                     if (dump_enabled_p ())
>                       dump_printf_loc (MSG_NOTE, vect_location,
>                                        "vect_model_load_cost: %d "
>                                        "unused vectors.\n",
>                                        gaps);
>                     vect_get_load_cost (vinfo, stmt_info, slp_node, gaps,
>                                         alignment_support_scheme,
>                                         misalignment, false, &inside_cost,
>                                         &prologue_cost, cost_vec, cost_vec,
>                                         true);
>                   }
>               }
> 
> that you pointed to in the PR.  This predates the "all SLP" model.
> It was originally added in r11-6662-ge45c41988bfd65, at a time when
> vectorizable_load (and thus vect_model_load_cost) was called for each
> individual *scalar* stmt.  As a result, we generally costed one vector
> load for each scalar load.  One-off costs that applied to the whole
> group were added only when costing the first stmt_info in the group.
> 
> This meant that for load-lanes, we'd cost one vector_load for each
> member of the LDn that was used.  That was ok (in the sense of being
> self-consistent) when all of the results were used.  But it meant
> that the cost of a fully-utilised LD4 was 4 times greater than
> the cost of an LD4 in which only one result was used.  That didn't
> make any sense, since the instruction loaded the same amount of
> data regardless.
> 
> There's no need for that now that we only call vectorizable_load once
> per SLP node, rather than once per scalar stmt, and now that we cost
> the instructions in the same way that we generate them.

Sure, but this patch intentionally avoided changing behavior for older
cost models that was written with this behavior in mind.

In particular almost all cost models without issue information have
the extra permute costs for ldn as 0, including generic tuning.

This is why the change avoided aarch64_tune_params.vec_costs->issue_info.

Now I can fix generic tuning, but for every other core I'll have to add
a very arbitrary value (probably the same as generic tuning).

So If we're ok with that I can do that (and hopefully other targets don't 
object)

Thanks,
Tamar

> 
> Thanks,
> Richard
> 
> >
> > Any objections?
> >
> > Thanks,
> > Tamar
> >
> > gcc/ChangeLog:
> >
> >     PR target/124866
> >     * config/aarch64/aarch64.cc
> >     (class aarch64_vector_costs): Add m_load_lanes.
> >     (aarch64_vector_costs::add_stmt_cost): Use it.
> >
> > gcc/testsuite/ChangeLog:
> >
> >     PR target/124866
> >     * gcc.target/aarch64/sve/cost_model_20.c: New test.
> >     * gcc.target/aarch64/sve/cost_model_21.c: New test.
> >     * gcc.target/aarch64/sve/cost_model_22.c: New test.
> >     * gcc.target/aarch64/sve/cost_model_23.c: New test.
> >     * gcc.target/aarch64/sve/pr124866.c: New test.
> >
> > ---
> >
> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> > index
> 37c28c8f2f8f58f6bb16774e8d5a788cdc800154..aeeef4aa47eb81e771b0ba
> b5f5eb03b19800d16e 100644
> > --- a/gcc/config/aarch64/aarch64.cc
> > +++ b/gcc/config/aarch64/aarch64.cc
> > @@ -17582,6 +17582,11 @@ private:
> >    /* If m_loop_fully_scalar_dup this contains the total number of leaf 
> > stmts
> >       found in the SLP tree.  */
> >    uint64_t m_num_total_stmts = 0;
> > +
> > +  /* Used when costing load lanes with gaps as we need to know whether
> we are
> > +     costing the first or the second occurrence.  (i.e.  the gap or the 
> > used
> > +     bits).  */
> > +  hash_set<stmt_vec_info> m_load_lanes;
> >  };
> >
> >  aarch64_vector_costs::aarch64_vector_costs (vec_info *vinfo,
> > @@ -18779,6 +18784,36 @@ aarch64_vector_costs::add_stmt_cost (int
> count, vect_cost_for_stmt kind,
> >     }
> >      }
> >
> > +  /* When a store with gaps happens the middle-end will cost these as 
> > series
> > +     of x loads (where x is the number of unused vectors) and set count to 
> > x
> > +     which causes the cost model to go off the rails because we cost e.g.
> > +     an LD4 as doing 4 LDRs + permutes.  Detect this and cost the load as a
> > +     single LDn.  The middle-end costs this as GAPS + 1 loads so the 
> > costing
> > +     for load lanes goes completely wrong.  On AARCH64 the gaps don't
> matter
> > +     and LOAD_LANES has the same cost so just ignore the GAP costing if
> we're
> > +     doing throughput based costing.  */
> > +  stmt_vec_info first_stmt_info;
> > +  if (node
> > +      && stmt_info
> > +      && aarch64_tune_params.vec_costs->issue_info
> > +      && SLP_TREE_MEMORY_ACCESS_TYPE (node) ==
> VMAT_LOAD_STORE_LANES
> > +      && STMT_VINFO_GROUPED_ACCESS (stmt_info)
> > +      && (first_stmt_info = DR_GROUP_FIRST_ELEMENT (stmt_info))
> > +      && first_stmt_info == stmt_info)
> > +    {
> > +      /* Figure out of we're a group load with gaps.  */
> > +      unsigned int gaps = DR_GROUP_SIZE (first_stmt_info);
> > +      stmt_vec_info next_stmt_info = first_stmt_info;
> > +      do
> > +   {
> > +     gaps -= 1;
> > +   }
> > +      while ((next_stmt_info = DR_GROUP_NEXT_ELEMENT
> (next_stmt_info)));
> > +
> > +      if (gaps)
> > +   count = m_load_lanes.add (stmt_info) ? 1 : 0;
> > +    }
> > +
> >    /* If we're recording a nonzero vector loop body cost for the
> >       innermost loop, also estimate the operations that would need
> >       to be issued by all relevant implementations of the loop.  */
> > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cost_model_20.c
> b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_20.c
> > new file mode 100644
> > index
> 0000000000000000000000000000000000000000..ffb51b5a1a0c59ddf4e
> bcb71d0036b0f0f785e3f
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_20.c
> > @@ -0,0 +1,20 @@
> > +/* { dg-do compile } */
> > +/* { dg-additional-options "-O3 -march=armv9-a -fdump-tree-vect-details"
> } */
> > +
> > +int reduce(int *a, int n)
> > +{
> > +    int sum = 0;
> > +    for (int i = 0; i < n; i=i+2) {
> > +        sum += a[i];
> > +    }
> > +    return sum;
> > +}
> > +
> > +/* { dg-final { scan-tree-dump "\\*_3 1 times vector_load costs 0 in body"
> "vect" } } */
> > +/* Sadly we can't get this fully right because the middle-end will re-cost
> based on ncopies, but we don't
> > +   know ncopies and recalculating it in the backend would replicate a lot 
> > of
> code.  */
> > +/* { dg-final { scan-tree-dump-not "\\*_3 1 times vector_load costs 6 in
> body" "vect" { xfail *-*-* } } } */
> > +/* { dg-final { scan-tree-dump "\\*_3 1 times vector_load costs 6 in body"
> "vect" } } */
> > +
> > +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
> > +/* { dg-final { scan-tree-dump "\.(MASKED_LOAD)?_LANES" "vect" } } */
> > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cost_model_21.c
> b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_21.c
> > new file mode 100644
> > index
> 0000000000000000000000000000000000000000..7157796b11c32314c7
> a0d44e34135173fab82ba1
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_21.c
> > @@ -0,0 +1,21 @@
> > +/* { dg-do compile } */
> > +/* { dg-additional-options "-O3 -march=armv9-a -fdump-tree-vect-details"
> } */
> > +
> > +int reduce(int *a, int n)
> > +{
> > +    int sum = 0;
> > +    int sub = 0;
> > +    for (int i = 0; i < n; i=i+4) {
> > +      {
> > +        sum += a[i];
> > +        sub += a[i+3];
> > +      }
> > +    }
> > +    return sum - sub;
> > +}
> > +
> > +/* { dg-final { scan-tree-dump "\\*_3 2 times vector_load costs 0 in body"
> "vect" } } */
> > +/* { dg-final { scan-tree-dump "\\*_3 1 times vector_load costs 7 in body"
> "vect" } } */
> > +
> > +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
> > +/* { dg-final { scan-tree-dump-not "\.(MASKED_LOAD)?_LANES" "vect" } }
> */
> > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cost_model_22.c
> b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_22.c
> > new file mode 100644
> > index
> 0000000000000000000000000000000000000000..68478b55dd40c825e3
> d4e2f0a5351263e9696665
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_22.c
> > @@ -0,0 +1,17 @@
> > +/* { dg-do compile } */
> > +/* { dg-additional-options "-O3 -march=armv9-a -fdump-tree-vect-details"
> } */
> > +
> > +int reduce(int *a, int n)
> > +{
> > +    int sum = 0;
> > +    for (int i = 0; i < n; i=i+4) {
> > +        sum += a[i];
> > +    }
> > +    return sum;
> > +}
> > +
> > +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
> > +/* { dg-final { scan-tree-dump "\.(MASKED_LOAD)?_LANES" "vect" } } */
> > +
> > +/* { dg-final { scan-tree-dump "\\*_3 3 times vector_load costs 0 in body"
> "vect" } } */
> > +/* { dg-final { scan-tree-dump "\\*_3 1 times vector_load costs 7 in body"
> "vect" } } */
> > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cost_model_23.c
> b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_23.c
> > new file mode 100644
> > index
> 0000000000000000000000000000000000000000..cab1073453d097ddd3
> cca4fe5577202d1720030b
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_23.c
> > @@ -0,0 +1,28 @@
> > +/* { dg-do compile } */
> > +/* { dg-additional-options "-O3 -march=armv9-a -fdump-tree-vect-details"
> } */
> > +
> > +typedef struct _argb {
> > +  char r;
> > +  char g;
> > +  char b;
> > +  char a;
> > +} rgb_t;
> > +
> > +int r, g, b, a;
> > +
> > +void reduce(rgb_t *x, int n)
> > +{
> > +    r = g = b = a = 0;
> > +    for (int i = 0; i < n; i++) {
> > +        r += x[i].r;
> > +        g += x[i].g;
> > +        b += x[i].b;
> > +        a += x[i].a;
> > +    }
> > +    return;
> > +}
> > +
> > +/* { dg-final { scan-tree-dump "_3->r 1 times vector_load costs 7 in body"
> "vect" } } */
> > +
> > +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
> > +/* { dg-final { scan-tree-dump "\.(MASKED_LOAD)?_LANES" "vect" } } */
> > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pr124866.c
> b/gcc/testsuite/gcc.target/aarch64/sve/pr124866.c
> > new file mode 100644
> > index
> 0000000000000000000000000000000000000000..d3e48f61c6f4714f92d
> c225d001c11a647c129f1
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/sve/pr124866.c
> > @@ -0,0 +1,14 @@
> > +/* { dg-do compile } */
> > +/* { dg-additional-options "-O3 -march=armv9-a -fdump-tree-vect-details"
> } */
> > +
> > +int reduce(int *a, int n)
> > +{
> > +    int sum = 0;
> > +    for (int i = 0; i < n; i=i+4) {
> > +        sum += a[i];
> > +    }
> > +    return sum;
> > +}
> > +
> > +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
> > +/* { dg-final { scan-tree-dump "\.(MASKED_LOAD)?_LANES" "vect" } } */

Reply via email to