RE: [PATCH]AArch64: account for load_lanes with gaps in costing [PR124866]

Tamar Christina Tue, 12 May 2026 07:00:33 -0700

> -----Original Message-----
> From: Richard Sandiford <[email protected]>
> Sent: 12 May 2026 14:09
> To: Tamar Christina <[email protected]>
> Cc: [email protected]; nd <[email protected]>; Richard Earnshaw
> <[email protected]>; [email protected]; Alex Coplan
> <[email protected]>; [email protected]; Wilco Dijkstra
> <[email protected]>; Alice Carlotti <[email protected]>
> Subject: Re: [PATCH]AArch64: account for load_lanes with gaps in costing
> [PR124866]
> 
> Tamar Christina <[email protected]> writes:
> >>      if (costing_p)
> >>        {
> >>          /* An IFN_LOAD_LANES will load all its vector results,
> >>             regardless of which ones we actually need.  Account
> >>             for the cost of unused results.  */
> >>          if (first_stmt_info == stmt_info)
> >>            {
> >>              unsigned int gaps = DR_GROUP_SIZE (first_stmt_info);
> >>              stmt_vec_info next_stmt_info = first_stmt_info;
> >>              do
> >>                {
> >>                  gaps -= 1;
> >>                  next_stmt_info = DR_GROUP_NEXT_ELEMENT
> >> (next_stmt_info);
> >>                }
> >>              while (next_stmt_info);
> >>              if (gaps)
> >>                {
> >>                  if (dump_enabled_p ())
> >>                    dump_printf_loc (MSG_NOTE, vect_location,
> >>                                     "vect_model_load_cost: %d "
> >>                                     "unused vectors.\n",
> >>                                     gaps);
> >>                  vect_get_load_cost (vinfo, stmt_info, slp_node, gaps,
> >>                                      alignment_support_scheme,
> >>                                      misalignment, false, &inside_cost,
> >>                                      &prologue_cost, cost_vec, cost_vec,
> >>                                      true);
> >>                }
> >>            }
> >>          n_adjacent_loads++;
> >>          continue;
> >>        }
> >>
> >> counts one load for the LDN itself plus one load for each unused result.
> >> That doesn't make conceptual sense.  The first_stmt_info == stmt_info
> >> block is counting one vector_load *per vector*, but n_adjacent_loads++
> >> is counting one vector_load *per LDN*.  In other words, we're mixing units.
> >>
> >
> > Correct, but even with the GAPs costing code it never made sense.  So even
> > with the old code costing the gaps made absolutely no sense because that's
> > not how LDn's work. And that's not how it was costed when there is no
> gaps.
> > Whether there's a gap or not is really irrelevant for an LDn.  It's not
> > like the
> > Instruction does less or more work if you magically don't use some vectors.
> 
> Do you mean GCC 11 never made sense, or some intermediate state?
>


I meant more the state of the costing for SLP. For non-SLP this makes sense
since the loads aren't grouped. Sure.

> I tried to explain how things worked in GCC 11 before.  There, normal
> non-SLP loop vectorisation would call vectorizable_load for each scalar
> statement.  Thus for:
> 
>    for (...)
>      {
>        a = x[i * 3];
>        b = x[i * 3 + 2];
>        ...
>      }
> 
> it would call vectorizable_load for both a and b.  Each vectorizable_load
> would cost one vector_load.
> 
> But on its own, that would only count 2 vector loads, even though LD3
> loads 3 vectors.  Which I think we agree is wrong.  We should count
> the same number of loads regardless of how many load results are used.
> 

Agreed.

> So r11-6662-ge45c41988bfd65 added code to count the gaps.  This was
> applied once per LDN, rather than once per scalar load.  At the time,
> the way to do that was to protect the code with:
> 
>   first_stmt_info == stmt_info
> 
> i.e. attach the extra cost to the first scalar load in the group.
> 
> With that in place, we counted 3 vector loads for the example above:
> 1 vector load for each scalar load and 1 vector load for the gap.
> 
> If the example had been:
> 
>    for (...)
>      {
>        a = x[i * 3];
>        ...
>      }
> 
> we would have counted 1 vector load for the scalar load and 2
> vector loads for the gaps.  For:
> 
>    for (...)
>      {
>        a = x[i * 3];
>        b = x[i * 3 + 1];
>        c = x[i * 3 + 2];
>      }
> 
> we would have counted 1 vector load for each of the 3 scalar loads,
> with no gap.
> 
> But if I'm not explaining it well, or you don't believe me, it might be
> easier to check out GCC 11 and play around with the code.

So far I'm with you, wrt to the non-SLP costing.

> 
> Everything changed when we moved to SLP generation of LDN, and moved
> to calling vectorizable_load once per SLP node rather than once per
> scalar statement.  At that point, counting gaps made no sense.
> The thing that it was compensating for -- i.e. the lack of scalar
> stmts for the unused vectors -- is no longer relevant, because
> we no longer count things on a per-scalar-stmt basis.
> 
> >> So in table form, the handling of LD4 seems to be:
> >>
> >> Number of used results     GCC 11          Now
> >> ----------------------     ------          ----
> >> 1                          4 vector_loads  4 vector_loads
> >> 2                          4 vector_loads  3 vector_loads
> >> 3                          4 vector_loads  2 vector_loads
> >> 4                          4 vector_loads  1 vector_load
> >>
> >> If we remove the first_stmt_info == stmt_info block above, we'll get
> >> one vector_load for all cases.  AArch64 can then multiply that by
> >> aarch64_ld234_st234_vectors to get the number of vectors.
> >
> > That only makes sense to model throughput, for latency it makes zero
> > sense because the cost of an LD4 is not 4x vector loads. Never was
> > otherwise the instruction would be pointless to begin with.
> 
> But remember that "latency" (i.e. the normal vector cost used on all
> targets) is really "serial latency".  We just add up the individual
> cost of each operation, regardless of what can execute in parallel.
> The cost of a load from x[] is added to the cost of a load from y[].
> 

Agreed. But that's my point though.. Let me try to explain

But this is where things become inconsistent (at least in the SLP
world we are now).

I wrote testcase  gcc/testsuite/gcc.target/aarch64/sve/cost_model_23.c
to check specifically what happens with the different cases where we
use LDn.

So lets break them down:

1. Loads with GAPs, we already covered this. And I agree that the * is to
     so that when the weighted cycles per iterations are taken that we still
     have the actual cost of 1 LDn per cycle.  Or at least roughly account for
     it.

2. Without gaps, on trunk today cost_model_23.c says for

a) armv8-a (which has no issue information)

_3->r 1 times vector_load costs 1 in body

Which ok, not really sure what to do with that.

b) armv9-a (which has issue information)

_3->r 1 times vector_load costs 7 in body

OK, makes more sense, but then

note:  Original vector body cost = 15
 note:  Scalar issue estimate:
 note:    load operations = 4
 note:    store operations = 0
 note:    general operations = 4
 note:    reduction latency = 1
 note:    estimated min cycles per iteration = 1.333333
 note:    estimated cycles per vector iteration (for VF 16) = 21.333333
 note:  SVE issue estimate:
 note:    load operations = 1
 note:    store operations = 0
 note:    general operations = 7
 note:    predicate operations = 0
 note:    reduction latency = 2
 note:    estimated min cycles per iteration = 3.500000

 with the summary

 Vector inside of loop cost: 15
 Vector prologue cost: 16
 Vector epilogue cost: 197
 Scalar iteration cost: 20
 Scalar outside cost: 2
 Vector outside cost: 213
 prologue iterations: 0
 epilogue iterations: 8
 Calculated minimum iters for profitability: 11

So it looks like for this case we cost a LD4 as just 1 load.

c) low iteration counts with gaps

for low known iteration counts we switch to latency only costing.
So for this costing we don't have a weighted cycles, so the 4 * LD4
cost adjustment seems quite off.

So

int reduce(int *a, int n) 
{
    int sum = 0;
    for (int i = 0; i < n; i=i+4) {
        sum += a[i];
    }
    return sum;
}

Gets costed as

note:  Original vector body cost = 30
 note:  Scalar issue estimate:
 note:    load operations = 1
 note:    store operations = 0
 note:    general operations = 1
 note:    reduction latency = 1
 note:    estimated min cycles per iteration = 1.000000
 note:    estimated cycles per vector iteration (for VF 4) = 4.000000
 note:  SVE issue estimate:
 note:    load operations = 4
 note:    store operations = 0
 note:    general operations = 13
 note:    predicate operations = 2
 note:    reduction latency = 2
 note:    estimated min cycles per iteration without predication = 6.500000
 note:    estimated min cycles per iteration for predication = 1.000000
 note:    estimated min cycles per iteration = 6.500000
 note:  Increasing body cost to 49 because scalar code would issue more quickly
 note:  Cost model analysis: 
  Vector inside of loop cost: 49
  Vector prologue cost: 6
  Vector epilogue cost: 14
  Scalar iteration cost: 5
  Scalar outside cost: 2
  Vector outside cost: 20
  prologue iterations: 0
  epilogue iterations: 1

but

int reduce(int *a, int n) 
{
    int sum = 0;
#pragma GCC unroll 0
    for (int i = 0; i < 65; i=i+4) {
        sum += a[i];
    }
    return sum;
}

As

note:  Original vector body cost = 30
 note:  Vector loop iterates at most 5 times
 note:  Scalar issue estimate:
 note:    load operations = 1
 note:    store operations = 0
 note:    general operations = 1
 note:    reduction latency = 1
 note:    estimated min cycles per iteration = 1.000000
 note:    estimated cycles per vector iteration (for VF 4) = 4.000000
 note:  SVE issue estimate:
 note:    load operations = 4
 note:    store operations = 0
 note:    general operations = 13
 note:    predicate operations = 2
 note:    reduction latency = 2
 note:    estimated min cycles per iteration without predication = 6.500000
 note:    estimated min cycles per iteration for predication = 1.000000
 note:    estimated min cycles per iteration = 6.500000
 note:  Low iteration count, so using pure latency costs
 note:  Cost model analysis: 
  Vector inside of loop cost: 30
  Vector prologue cost: 6
  Vector epilogue cost: 14
  Scalar iteration cost: 5
  Scalar outside cost: 0
  Vector outside cost: 20
  prologue iterations: 0
  epilogue iterations: 1

This is the one I find weird. For both cases 49 and 30 are way too high
in terms of serial latency cost for the loop. It makes the instruction really
really expensive.

And if you compare it with the case above where we have no gaps, we
cost it as 15.

So while the * 4 makes sense for determining the weighted cycles for
cross mode testing, the final body costs are inconsistent.

Now granted, removing the GAPs costing code will align the costing
Between 1 and 2, by doing * 4 you now increase the costing for 2.b
making the loop unprofitable, while on every single uArch it's profitable.

But the cost of 30 and 45 for the loop is unrealistic which is why we
don't vectorize it today (and really what the PR is trying to report).

This is why I don't think (at least blindly *4) is correct.

And when I said I have to modify the cost models with a random factor,
the discrepancies in costing above make armv8-a almost always use LDn,
but the newer cost models need a large loop body to use them.

But as benchmarked that's not correct.

Hopefully that explains what I'm trying to say.

> So from that POV, adding 4 vector loads as the latency cost for LD4
> does made sense, since LD4 does do the same number of loads as
> 4 loads from separate arrays.
> 
> In terms of the costing model, and I think in reality, the thing
> the instruction saves on is permutation cost.  That's especially
> true for LD3.

Mostly agreed, yes.

> 
> That is, LD3 is "slower" than 3 LD1s, but is "faster" than the
> combination of 3 LD1s followed by permutations that simulate
> the shuffling that an LD3 does.
> 

Agree.

Thanks,
Tamar

> > For throughput it makes sense as a rough estimate. Take Neoverse V3
> > for instance. An LD4 costs 8 cycles whereas an LDR 6. Costing 4x 6
> > just doesn't make sense.
> >
> > So I don't think multiplying by number of vectors should be done
> > for latency and we don't do that today when there is no gap.
> >
> >>
> >> Alternatively, if we want generic code to present the same interface
> >> as before, we could both remove the first_stmt_info == stmt_info block
> >> and change:
> >>
> >>          n_adjacent_loads++;
> >>
> >> to:
> >>
> >>          n_adjacent_loads += group_size;
> >>
> >> But I can see the argument that counting 1 per LDN makes more sense
> >> for generic code, so personally I prefer the first approach
> >> (i.e. multiplying by aarch64_ld234_st234_vectors before applying
> >> the per-vector costs).
> >>
> >
> > n_adjacent_loads is also pointless, the variable is shadowed inside the
> > if (memory_access_type == VMAT_LOAD_STORE_LANES)
> >
> > and literally is just counting ncopies. So it's just ncopies.
> >
> > So I've already removed it.
> 
> Yeah, agreed.
> 
> Thanks,
> Richard

RE: [PATCH]AArch64: account for load_lanes with gaps in costing [PR124866]

Reply via email to