Re: [PATCH]AArch64: account for load_lanes with gaps in costing [PR124866]

Richard Sandiford Tue, 12 May 2026 04:31:43 -0700

Tamar Christina <[email protected]> writes:
>> -----Original Message-----
>> From: Richard Sandiford <[email protected]>
>> Sent: 06 May 2026 23:12
>> To: Tamar Christina <[email protected]>
>> Cc: [email protected]; nd <[email protected]>; Richard Earnshaw
>> <[email protected]>; [email protected]; Alex Coplan
>> <[email protected]>; [email protected]; Wilco Dijkstra
>> <[email protected]>; Alice Carlotti <[email protected]>
>> Subject: Re: [PATCH]AArch64: account for load_lanes with gaps in costing
>> [PR124866]
>> 
>> Tamar Christina <[email protected]> writes:
>> > The vectorizer when costing LOAD_LANES with gaps breaks the costing into
>> > multiple calls where the first call wants to cost GAPS loads and second
>> > GROUP_SIZE - GAPS loads. However:
>> >
>> > 1. Both VMAT types are LOAD_LANES, we we cost effectively
>> >    GROUP_SIZE * LOAD_LANES.  This increases the latency costs immensely.
>> >
>> > 2. The GROUP_SIZE costing also increases the throughput costs *
>> GROUP_SIZE,
>> > which has the knock on effect of increasing the costs for the
> latency again
>> > as the compiler applies a latency penalty due to the scalar code
> issuing
>> >    quicker.
>> >
>> > What the vectorizer is trying to cost is simply 1x LOAD_LANES but
> it's asking
>> to
>> > cost the invidual gaps as well.  However for an LDn operation this
> costs is
>> > always fixed and is the cost of the LDn itself.
>> >
>> > when there isn't any gaps we only cost 1x LOAD_LANES which is the
>> expected
>> > amount.  Unfortunately we can't really tell in the backend easily
> what chain
>> of
>> > costing belongs to any one operation and so treat them as individual ones.
>> >
>> > This patch keeps track of the LOAD_LANES we've seen already when there is
>> a gap
>> > and
>> >
>> > 1. marks the cost of the first GAPS costing to 0.
>> > 2. costs the cost of the second GROUP_SIZE - GAPS loads as a single
>> LOAD_LANES.
>> >
>> > This is the best we can do until Richi's changes to cost operations
> like these
>> > as a group.
>> >
>> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>> 
>> I think we should just remove the:
>> 
>>            /* An IFN_LOAD_LANES will load all its vector results,
>>               regardless of which ones we actually need.  Account
>>               for the cost of unused results.  */
>>            if (first_stmt_info == stmt_info)
>>              {
>>                unsigned int gaps = DR_GROUP_SIZE (first_stmt_info);
>>                stmt_vec_info next_stmt_info = first_stmt_info;
>>                do
>>                  {
>>                    gaps -= 1;
>>                    next_stmt_info = DR_GROUP_NEXT_ELEMENT
>> (next_stmt_info);
>>                  }
>>                while (next_stmt_info);
>>                if (gaps)
>>                  {
>>                    if (dump_enabled_p ())
>>                      dump_printf_loc (MSG_NOTE, vect_location,
>>                                       "vect_model_load_cost: %d "
>>                                       "unused vectors.\n",
>>                                       gaps);
>>                    vect_get_load_cost (vinfo, stmt_info, slp_node, gaps,
>>                                        alignment_support_scheme,
>>                                        misalignment, false, &inside_cost,
>>                                        &prologue_cost, cost_vec, cost_vec,
>>                                        true);
>>                  }
>>              }
>> 
>> that you pointed to in the PR.  This predates the "all SLP" model.
>> It was originally added in r11-6662-ge45c41988bfd65, at a time when
>> vectorizable_load (and thus vect_model_load_cost) was called for each
>> individual *scalar* stmt.  As a result, we generally costed one vector
>> load for each scalar load.  One-off costs that applied to the whole
>> group were added only when costing the first stmt_info in the group.
>> 
>> This meant that for load-lanes, we'd cost one vector_load for each
>> member of the LDn that was used.  That was ok (in the sense of being
>> self-consistent) when all of the results were used.  But it meant
>> that the cost of a fully-utilised LD4 was 4 times greater than
>> the cost of an LD4 in which only one result was used.  That didn't
>> make any sense, since the instruction loaded the same amount of
>> data regardless.
>> 
>> There's no need for that now that we only call vectorizable_load once
>> per SLP node, rather than once per scalar stmt, and now that we cost
>> the instructions in the same way that we generate them.
>
> Sure, but this patch intentionally avoided changing behavior for older
> cost models that was written with this behavior in mind.
>
> In particular almost all cost models without issue information have
> the extra permute costs for ldn as 0, including generic tuning.
>
> This is why the change avoided aarch64_tune_params.vec_costs->issue_info.
>
> Now I can fix generic tuning, but for every other core I'll have to add
> a very arbitrary value (probably the same as generic tuning).
>
> So If we're ok with that I can do that (and hopefully other targets
> don't object)


But my point is that the existing AArch64 costs weren't written for
what the generic code currently does.  The AArch64 costs were written
for the assumption that, in aggregate, N vector_loads would be costed
for each LDN, regardless of how many results are used.  That's what used
to happen in GCC 11.

This is why, for example:

  /* Per-vector cost of permuting vectors after an LD2, LD3 or LD4,
     as well as the per-vector cost of permuting vectors before
     an ST2, ST3 or ST4.  */
  int ld2_st2_permute_cost;
  int ld3_st3_permute_cost;
  int ld4_st4_permute_cost;

is per vector rather than per LDN.

But if I'm following correctly, we now have the worst of both worlds.
This code:

          if (costing_p)
            {
              /* An IFN_LOAD_LANES will load all its vector results,
                 regardless of which ones we actually need.  Account
                 for the cost of unused results.  */
              if (first_stmt_info == stmt_info)
                {
                  unsigned int gaps = DR_GROUP_SIZE (first_stmt_info);
                  stmt_vec_info next_stmt_info = first_stmt_info;
                  do
                    {
                      gaps -= 1;
                      next_stmt_info = DR_GROUP_NEXT_ELEMENT (next_stmt_info);
                    }
                  while (next_stmt_info);
                  if (gaps)
                    {
                      if (dump_enabled_p ())
                        dump_printf_loc (MSG_NOTE, vect_location,
                                         "vect_model_load_cost: %d "
                                         "unused vectors.\n",
                                         gaps);
                      vect_get_load_cost (vinfo, stmt_info, slp_node, gaps,
                                          alignment_support_scheme,
                                          misalignment, false, &inside_cost,
                                          &prologue_cost, cost_vec, cost_vec,
                                          true);
                    }
                }
              n_adjacent_loads++;
              continue;
            }

counts one load for the LDN itself plus one load for each unused result.
That doesn't make conceptual sense.  The first_stmt_info == stmt_info
block is counting one vector_load *per vector*, but n_adjacent_loads++
is counting one vector_load *per LDN*.  In other words, we're mixing units.

So in table form, the handling of LD4 seems to be:

Number of used results     GCC 11          Now
----------------------     ------          ----
1                          4 vector_loads  4 vector_loads
2                          4 vector_loads  3 vector_loads
3                          4 vector_loads  2 vector_loads
4                          4 vector_loads  1 vector_load

If we remove the first_stmt_info == stmt_info block above, we'll get
one vector_load for all cases.  AArch64 can then multiply that by
aarch64_ld234_st234_vectors to get the number of vectors.

Alternatively, if we want generic code to present the same interface
as before, we could both remove the first_stmt_info == stmt_info block
and change:

              n_adjacent_loads++;

to:

              n_adjacent_loads += group_size;

But I can see the argument that counting 1 per LDN makes more sense
for generic code, so personally I prefer the first approach
(i.e. multiplying by aarch64_ld234_st234_vectors before applying
the per-vector costs).

This is one of the cases where the "new" tree-vect*.cc approach to
costing avoids a lot of the complexity associated with the old approach,
in the sense that we can now perform the same calculation in a much
simpler way.  But some of the code associated with the old approach
has been kept, even though it no longer applies.  And that's messing
up the calculation.

Thanks,
Richard

Re: [PATCH]AArch64: account for load_lanes with gaps in costing [PR124866]

Reply via email to