> -----Original Message-----
> From: Richard Sandiford <[email protected]>
> Sent: 12 May 2026 12:31
> To: Tamar Christina <[email protected]>
> Cc: [email protected]; nd <[email protected]>; Richard Earnshaw
> <[email protected]>; [email protected]; Alex Coplan
> <[email protected]>; [email protected]; Wilco Dijkstra
> <[email protected]>; Alice Carlotti <[email protected]>
> Subject: Re: [PATCH]AArch64: account for load_lanes with gaps in costing
> [PR124866]
>
> Tamar Christina <[email protected]> writes:
> >> -----Original Message-----
> >> From: Richard Sandiford <[email protected]>
> >> Sent: 06 May 2026 23:12
> >> To: Tamar Christina <[email protected]>
> >> Cc: [email protected]; nd <[email protected]>; Richard Earnshaw
> >> <[email protected]>; [email protected]; Alex Coplan
> >> <[email protected]>; [email protected]; Wilco
> Dijkstra
> >> <[email protected]>; Alice Carlotti <[email protected]>
> >> Subject: Re: [PATCH]AArch64: account for load_lanes with gaps in costing
> >> [PR124866]
> >>
> >> Tamar Christina <[email protected]> writes:
> >> > The vectorizer when costing LOAD_LANES with gaps breaks the costing
> into
> >> > multiple calls where the first call wants to cost GAPS loads and second
> >> > GROUP_SIZE - GAPS loads. However:
> >> >
> >> > 1. Both VMAT types are LOAD_LANES, we we cost effectively
> >> > GROUP_SIZE * LOAD_LANES. This increases the latency costs
> immensely.
> >> >
> >> > 2. The GROUP_SIZE costing also increases the throughput costs *
> >> GROUP_SIZE,
> >> > which has the knock on effect of increasing the costs for the
> > latency again
> >> > as the compiler applies a latency penalty due to the scalar code
> > issuing
> >> > quicker.
> >> >
> >> > What the vectorizer is trying to cost is simply 1x LOAD_LANES but
> > it's asking
> >> to
> >> > cost the invidual gaps as well. However for an LDn operation this
> > costs is
> >> > always fixed and is the cost of the LDn itself.
> >> >
> >> > when there isn't any gaps we only cost 1x LOAD_LANES which is the
> >> expected
> >> > amount. Unfortunately we can't really tell in the backend easily
> > what chain
> >> of
> >> > costing belongs to any one operation and so treat them as individual
> ones.
> >> >
> >> > This patch keeps track of the LOAD_LANES we've seen already when there
> is
> >> a gap
> >> > and
> >> >
> >> > 1. marks the cost of the first GAPS costing to 0.
> >> > 2. costs the cost of the second GROUP_SIZE - GAPS loads as a single
> >> LOAD_LANES.
> >> >
> >> > This is the best we can do until Richi's changes to cost operations
> > like these
> >> > as a group.
> >> >
> >> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> >>
> >> I think we should just remove the:
> >>
> >> /* An IFN_LOAD_LANES will load all its vector results,
> >> regardless of which ones we actually need. Account
> >> for the cost of unused results. */
> >> if (first_stmt_info == stmt_info)
> >> {
> >> unsigned int gaps = DR_GROUP_SIZE (first_stmt_info);
> >> stmt_vec_info next_stmt_info = first_stmt_info;
> >> do
> >> {
> >> gaps -= 1;
> >> next_stmt_info = DR_GROUP_NEXT_ELEMENT
> >> (next_stmt_info);
> >> }
> >> while (next_stmt_info);
> >> if (gaps)
> >> {
> >> if (dump_enabled_p ())
> >> dump_printf_loc (MSG_NOTE, vect_location,
> >> "vect_model_load_cost: %d "
> >> "unused vectors.\n",
> >> gaps);
> >> vect_get_load_cost (vinfo, stmt_info, slp_node, gaps,
> >> alignment_support_scheme,
> >> misalignment, false, &inside_cost,
> >> &prologue_cost, cost_vec, cost_vec,
> >> true);
> >> }
> >> }
> >>
> >> that you pointed to in the PR. This predates the "all SLP" model.
> >> It was originally added in r11-6662-ge45c41988bfd65, at a time when
> >> vectorizable_load (and thus vect_model_load_cost) was called for each
> >> individual *scalar* stmt. As a result, we generally costed one vector
> >> load for each scalar load. One-off costs that applied to the whole
> >> group were added only when costing the first stmt_info in the group.
> >>
> >> This meant that for load-lanes, we'd cost one vector_load for each
> >> member of the LDn that was used. That was ok (in the sense of being
> >> self-consistent) when all of the results were used. But it meant
> >> that the cost of a fully-utilised LD4 was 4 times greater than
> >> the cost of an LD4 in which only one result was used. That didn't
> >> make any sense, since the instruction loaded the same amount of
> >> data regardless.
> >>
> >> There's no need for that now that we only call vectorizable_load once
> >> per SLP node, rather than once per scalar stmt, and now that we cost
> >> the instructions in the same way that we generate them.
> >
> > Sure, but this patch intentionally avoided changing behavior for older
> > cost models that was written with this behavior in mind.
> >
> > In particular almost all cost models without issue information have
> > the extra permute costs for ldn as 0, including generic tuning.
> >
> > This is why the change avoided aarch64_tune_params.vec_costs-
> >issue_info.
> >
> > Now I can fix generic tuning, but for every other core I'll have to add
> > a very arbitrary value (probably the same as generic tuning).
> >
> > So If we're ok with that I can do that (and hopefully other targets
> > don't object)
>
> But my point is that the existing AArch64 costs weren't written for
> what the generic code currently does. The AArch64 costs were written
> for the assumption that, in aggregate, N vector_loads would be costed
> for each LDN, regardless of how many results are used. That's what used
> to happen in GCC 11.
>
> This is why, for example:
>
> /* Per-vector cost of permuting vectors after an LD2, LD3 or LD4,
> as well as the per-vector cost of permuting vectors before
> an ST2, ST3 or ST4. */
> int ld2_st2_permute_cost;
> int ld3_st3_permute_cost;
> int ld4_st4_permute_cost;
>
> is per vector rather than per LDN.
>
> But if I'm following correctly, we now have the worst of both worlds.
> This code:
>
Yes, and that's why the original patch attempted to ignore it in the backend.
But I digress.
> if (costing_p)
> {
> /* An IFN_LOAD_LANES will load all its vector results,
> regardless of which ones we actually need. Account
> for the cost of unused results. */
> if (first_stmt_info == stmt_info)
> {
> unsigned int gaps = DR_GROUP_SIZE (first_stmt_info);
> stmt_vec_info next_stmt_info = first_stmt_info;
> do
> {
> gaps -= 1;
> next_stmt_info = DR_GROUP_NEXT_ELEMENT
> (next_stmt_info);
> }
> while (next_stmt_info);
> if (gaps)
> {
> if (dump_enabled_p ())
> dump_printf_loc (MSG_NOTE, vect_location,
> "vect_model_load_cost: %d "
> "unused vectors.\n",
> gaps);
> vect_get_load_cost (vinfo, stmt_info, slp_node, gaps,
> alignment_support_scheme,
> misalignment, false, &inside_cost,
> &prologue_cost, cost_vec, cost_vec,
> true);
> }
> }
> n_adjacent_loads++;
> continue;
> }
>
> counts one load for the LDN itself plus one load for each unused result.
> That doesn't make conceptual sense. The first_stmt_info == stmt_info
> block is counting one vector_load *per vector*, but n_adjacent_loads++
> is counting one vector_load *per LDN*. In other words, we're mixing units.
>
Correct, but even with the GAPs costing code it never made sense. So even
with the old code costing the gaps made absolutely no sense because that's
not how LDn's work. And that's not how it was costed when there is no gaps.
Whether there's a gap or not is really irrelevant for an LDn. It's not like the
Instruction does less or more work if you magically don't use some vectors.
> So in table form, the handling of LD4 seems to be:
>
> Number of used results GCC 11 Now
> ---------------------- ------ ----
> 1 4 vector_loads 4 vector_loads
> 2 4 vector_loads 3 vector_loads
> 3 4 vector_loads 2 vector_loads
> 4 4 vector_loads 1 vector_load
>
> If we remove the first_stmt_info == stmt_info block above, we'll get
> one vector_load for all cases. AArch64 can then multiply that by
> aarch64_ld234_st234_vectors to get the number of vectors.
That only makes sense to model throughput, for latency it makes zero
sense because the cost of an LD4 is not 4x vector loads. Never was
otherwise the instruction would be pointless to begin with.
For throughput it makes sense as a rough estimate. Take Neoverse V3
for instance. An LD4 costs 8 cycles whereas an LDR 6. Costing 4x 6
just doesn't make sense.
So I don't think multiplying by number of vectors should be done
for latency and we don't do that today when there is no gap.
>
> Alternatively, if we want generic code to present the same interface
> as before, we could both remove the first_stmt_info == stmt_info block
> and change:
>
> n_adjacent_loads++;
>
> to:
>
> n_adjacent_loads += group_size;
>
> But I can see the argument that counting 1 per LDN makes more sense
> for generic code, so personally I prefer the first approach
> (i.e. multiplying by aarch64_ld234_st234_vectors before applying
> the per-vector costs).
>
n_adjacent_loads is also pointless, the variable is shadowed inside the
if (memory_access_type == VMAT_LOAD_STORE_LANES)
and literally is just counting ncopies. So it's just ncopies.
So I've already removed it.
Thanks,
Tamar
> This is one of the cases where the "new" tree-vect*.cc approach to
> costing avoids a lot of the complexity associated with the old approach,
> in the sense that we can now perform the same calculation in a much
> simpler way. But some of the code associated with the old approach
> has been kept, even though it no longer applies. And that's messing
> up the calculation.
>
> Thanks,
> Richard