On Tue, May 12, 2026 at 1:32 PM Richard Sandiford
<[email protected]> wrote:
>
> Tamar Christina <[email protected]> writes:
> >> -----Original Message-----
> >> From: Richard Sandiford <[email protected]>
> >> Sent: 06 May 2026 23:12
> >> To: Tamar Christina <[email protected]>
> >> Cc: [email protected]; nd <[email protected]>; Richard Earnshaw
> >> <[email protected]>; [email protected]; Alex Coplan
> >> <[email protected]>; [email protected]; Wilco Dijkstra
> >> <[email protected]>; Alice Carlotti <[email protected]>
> >> Subject: Re: [PATCH]AArch64: account for load_lanes with gaps in costing
> >> [PR124866]
> >>
> >> Tamar Christina <[email protected]> writes:
> >> > The vectorizer when costing LOAD_LANES with gaps breaks the costing into
> >> > multiple calls where the first call wants to cost GAPS loads and second
> >> > GROUP_SIZE - GAPS loads. However:
> >> >
> >> > 1. Both VMAT types are LOAD_LANES, we we cost effectively
> >> >    GROUP_SIZE * LOAD_LANES.  This increases the latency costs immensely.
> >> >
> >> > 2. The GROUP_SIZE costing also increases the throughput costs *
> >> GROUP_SIZE,
> >> > which has the knock on effect of increasing the costs for the
> > latency again
> >> > as the compiler applies a latency penalty due to the scalar code
> > issuing
> >> >    quicker.
> >> >
> >> > What the vectorizer is trying to cost is simply 1x LOAD_LANES but
> > it's asking
> >> to
> >> > cost the invidual gaps as well.  However for an LDn operation this
> > costs is
> >> > always fixed and is the cost of the LDn itself.
> >> >
> >> > when there isn't any gaps we only cost 1x LOAD_LANES which is the
> >> expected
> >> > amount.  Unfortunately we can't really tell in the backend easily
> > what chain
> >> of
> >> > costing belongs to any one operation and so treat them as individual 
> >> > ones.
> >> >
> >> > This patch keeps track of the LOAD_LANES we've seen already when there is
> >> a gap
> >> > and
> >> >
> >> > 1. marks the cost of the first GAPS costing to 0.
> >> > 2. costs the cost of the second GROUP_SIZE - GAPS loads as a single
> >> LOAD_LANES.
> >> >
> >> > This is the best we can do until Richi's changes to cost operations
> > like these
> >> > as a group.
> >> >
> >> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> >>
> >> I think we should just remove the:
> >>
> >>            /* An IFN_LOAD_LANES will load all its vector results,
> >>               regardless of which ones we actually need.  Account
> >>               for the cost of unused results.  */
> >>            if (first_stmt_info == stmt_info)
> >>              {
> >>                unsigned int gaps = DR_GROUP_SIZE (first_stmt_info);
> >>                stmt_vec_info next_stmt_info = first_stmt_info;
> >>                do
> >>                  {
> >>                    gaps -= 1;
> >>                    next_stmt_info = DR_GROUP_NEXT_ELEMENT
> >> (next_stmt_info);
> >>                  }
> >>                while (next_stmt_info);
> >>                if (gaps)
> >>                  {
> >>                    if (dump_enabled_p ())
> >>                      dump_printf_loc (MSG_NOTE, vect_location,
> >>                                       "vect_model_load_cost: %d "
> >>                                       "unused vectors.\n",
> >>                                       gaps);
> >>                    vect_get_load_cost (vinfo, stmt_info, slp_node, gaps,
> >>                                        alignment_support_scheme,
> >>                                        misalignment, false, &inside_cost,
> >>                                        &prologue_cost, cost_vec, cost_vec,
> >>                                        true);
> >>                  }
> >>              }
> >>
> >> that you pointed to in the PR.  This predates the "all SLP" model.
> >> It was originally added in r11-6662-ge45c41988bfd65, at a time when
> >> vectorizable_load (and thus vect_model_load_cost) was called for each
> >> individual *scalar* stmt.  As a result, we generally costed one vector
> >> load for each scalar load.  One-off costs that applied to the whole
> >> group were added only when costing the first stmt_info in the group.
> >>
> >> This meant that for load-lanes, we'd cost one vector_load for each
> >> member of the LDn that was used.  That was ok (in the sense of being
> >> self-consistent) when all of the results were used.  But it meant
> >> that the cost of a fully-utilised LD4 was 4 times greater than
> >> the cost of an LD4 in which only one result was used.  That didn't
> >> make any sense, since the instruction loaded the same amount of
> >> data regardless.
> >>
> >> There's no need for that now that we only call vectorizable_load once
> >> per SLP node, rather than once per scalar stmt, and now that we cost
> >> the instructions in the same way that we generate them.
> >
> > Sure, but this patch intentionally avoided changing behavior for older
> > cost models that was written with this behavior in mind.
> >
> > In particular almost all cost models without issue information have
> > the extra permute costs for ldn as 0, including generic tuning.
> >
> > This is why the change avoided aarch64_tune_params.vec_costs->issue_info.
> >
> > Now I can fix generic tuning, but for every other core I'll have to add
> > a very arbitrary value (probably the same as generic tuning).
> >
> > So If we're ok with that I can do that (and hopefully other targets
> > don't object)
>
> But my point is that the existing AArch64 costs weren't written for
> what the generic code currently does.  The AArch64 costs were written
> for the assumption that, in aggregate, N vector_loads would be costed
> for each LDN, regardless of how many results are used.  That's what used
> to happen in GCC 11.
>
> This is why, for example:
>
>   /* Per-vector cost of permuting vectors after an LD2, LD3 or LD4,
>      as well as the per-vector cost of permuting vectors before
>      an ST2, ST3 or ST4.  */
>   int ld2_st2_permute_cost;
>   int ld3_st3_permute_cost;
>   int ld4_st4_permute_cost;
>
> is per vector rather than per LDN.
>
> But if I'm following correctly, we now have the worst of both worlds.
> This code:
>
>           if (costing_p)
>             {
>               /* An IFN_LOAD_LANES will load all its vector results,
>                  regardless of which ones we actually need.  Account
>                  for the cost of unused results.  */
>               if (first_stmt_info == stmt_info)
>                 {
>                   unsigned int gaps = DR_GROUP_SIZE (first_stmt_info);
>                   stmt_vec_info next_stmt_info = first_stmt_info;
>                   do
>                     {
>                       gaps -= 1;
>                       next_stmt_info = DR_GROUP_NEXT_ELEMENT (next_stmt_info);
>                     }
>                   while (next_stmt_info);
>                   if (gaps)
>                     {
>                       if (dump_enabled_p ())
>                         dump_printf_loc (MSG_NOTE, vect_location,
>                                          "vect_model_load_cost: %d "
>                                          "unused vectors.\n",
>                                          gaps);
>                       vect_get_load_cost (vinfo, stmt_info, slp_node, gaps,
>                                           alignment_support_scheme,
>                                           misalignment, false, &inside_cost,
>                                           &prologue_cost, cost_vec, cost_vec,
>                                           true);
>                     }
>                 }
>               n_adjacent_loads++;
>               continue;
>             }
>
> counts one load for the LDN itself plus one load for each unused result.
> That doesn't make conceptual sense.  The first_stmt_info == stmt_info
> block is counting one vector_load *per vector*, but n_adjacent_loads++
> is counting one vector_load *per LDN*.  In other words, we're mixing units.
>
> So in table form, the handling of LD4 seems to be:
>
> Number of used results     GCC 11          Now
> ----------------------     ------          ----
> 1                          4 vector_loads  4 vector_loads
> 2                          4 vector_loads  3 vector_loads
> 3                          4 vector_loads  2 vector_loads
> 4                          4 vector_loads  1 vector_load
>
> If we remove the first_stmt_info == stmt_info block above, we'll get
> one vector_load for all cases.  AArch64 can then multiply that by
> aarch64_ld234_st234_vectors to get the number of vectors.
>
> Alternatively, if we want generic code to present the same interface
> as before, we could both remove the first_stmt_info == stmt_info block
> and change:
>
>               n_adjacent_loads++;
>
> to:
>
>               n_adjacent_loads += group_size;
>
> But I can see the argument that counting 1 per LDN makes more sense
> for generic code, so personally I prefer the first approach
> (i.e. multiplying by aarch64_ld234_st234_vectors before applying
> the per-vector costs).
>
> This is one of the cases where the "new" tree-vect*.cc approach to
> costing avoids a lot of the complexity associated with the old approach,
> in the sense that we can now perform the same calculation in a much
> simpler way.  But some of the code associated with the old approach
> has been kept, even though it no longer applies.  And that's messing
> up the calculation.

Btw, I agree with removing the block.  What we have now makes no
sense.

Richard.

>
> Thanks,
> Richard

Reply via email to