Tamar Christina <[email protected]> writes:
>> -----Original Message-----
>> From: Richard Sandiford <[email protected]>
>> Sent: 06 May 2026 23:12
>> To: Tamar Christina <[email protected]>
>> Cc: [email protected]; nd <[email protected]>; Richard Earnshaw
>> <[email protected]>; [email protected]; Alex Coplan
>> <[email protected]>; [email protected]; Wilco Dijkstra
>> <[email protected]>; Alice Carlotti <[email protected]>
>> Subject: Re: [PATCH]AArch64: account for load_lanes with gaps in costing
>> [PR124866]
>>
>> Tamar Christina <[email protected]> writes:
>> > The vectorizer when costing LOAD_LANES with gaps breaks the costing into
>> > multiple calls where the first call wants to cost GAPS loads and second
>> > GROUP_SIZE - GAPS loads. However:
>> >
>> > 1. Both VMAT types are LOAD_LANES, we we cost effectively
>> > GROUP_SIZE * LOAD_LANES. This increases the latency costs immensely.
>> >
>> > 2. The GROUP_SIZE costing also increases the throughput costs *
>> GROUP_SIZE,
>> > which has the knock on effect of increasing the costs for the
> latency again
>> > as the compiler applies a latency penalty due to the scalar code
> issuing
>> > quicker.
>> >
>> > What the vectorizer is trying to cost is simply 1x LOAD_LANES but
> it's asking
>> to
>> > cost the invidual gaps as well. However for an LDn operation this
> costs is
>> > always fixed and is the cost of the LDn itself.
>> >
>> > when there isn't any gaps we only cost 1x LOAD_LANES which is the
>> expected
>> > amount. Unfortunately we can't really tell in the backend easily
> what chain
>> of
>> > costing belongs to any one operation and so treat them as individual ones.
>> >
>> > This patch keeps track of the LOAD_LANES we've seen already when there is
>> a gap
>> > and
>> >
>> > 1. marks the cost of the first GAPS costing to 0.
>> > 2. costs the cost of the second GROUP_SIZE - GAPS loads as a single
>> LOAD_LANES.
>> >
>> > This is the best we can do until Richi's changes to cost operations
> like these
>> > as a group.
>> >
>> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>>
>> I think we should just remove the:
>>
>> /* An IFN_LOAD_LANES will load all its vector results,
>> regardless of which ones we actually need. Account
>> for the cost of unused results. */
>> if (first_stmt_info == stmt_info)
>> {
>> unsigned int gaps = DR_GROUP_SIZE (first_stmt_info);
>> stmt_vec_info next_stmt_info = first_stmt_info;
>> do
>> {
>> gaps -= 1;
>> next_stmt_info = DR_GROUP_NEXT_ELEMENT
>> (next_stmt_info);
>> }
>> while (next_stmt_info);
>> if (gaps)
>> {
>> if (dump_enabled_p ())
>> dump_printf_loc (MSG_NOTE, vect_location,
>> "vect_model_load_cost: %d "
>> "unused vectors.\n",
>> gaps);
>> vect_get_load_cost (vinfo, stmt_info, slp_node, gaps,
>> alignment_support_scheme,
>> misalignment, false, &inside_cost,
>> &prologue_cost, cost_vec, cost_vec,
>> true);
>> }
>> }
>>
>> that you pointed to in the PR. This predates the "all SLP" model.
>> It was originally added in r11-6662-ge45c41988bfd65, at a time when
>> vectorizable_load (and thus vect_model_load_cost) was called for each
>> individual *scalar* stmt. As a result, we generally costed one vector
>> load for each scalar load. One-off costs that applied to the whole
>> group were added only when costing the first stmt_info in the group.
>>
>> This meant that for load-lanes, we'd cost one vector_load for each
>> member of the LDn that was used. That was ok (in the sense of being
>> self-consistent) when all of the results were used. But it meant
>> that the cost of a fully-utilised LD4 was 4 times greater than
>> the cost of an LD4 in which only one result was used. That didn't
>> make any sense, since the instruction loaded the same amount of
>> data regardless.
>>
>> There's no need for that now that we only call vectorizable_load once
>> per SLP node, rather than once per scalar stmt, and now that we cost
>> the instructions in the same way that we generate them.
>
> Sure, but this patch intentionally avoided changing behavior for older
> cost models that was written with this behavior in mind.
>
> In particular almost all cost models without issue information have
> the extra permute costs for ldn as 0, including generic tuning.
>
> This is why the change avoided aarch64_tune_params.vec_costs->issue_info.
>
> Now I can fix generic tuning, but for every other core I'll have to add
> a very arbitrary value (probably the same as generic tuning).
>
> So If we're ok with that I can do that (and hopefully other targets
> don't object)
But my point is that the existing AArch64 costs weren't written for
what the generic code currently does. The AArch64 costs were written
for the assumption that, in aggregate, N vector_loads would be costed
for each LDN, regardless of how many results are used. That's what used
to happen in GCC 11.
This is why, for example:
/* Per-vector cost of permuting vectors after an LD2, LD3 or LD4,
as well as the per-vector cost of permuting vectors before
an ST2, ST3 or ST4. */
int ld2_st2_permute_cost;
int ld3_st3_permute_cost;
int ld4_st4_permute_cost;
is per vector rather than per LDN.
But if I'm following correctly, we now have the worst of both worlds.
This code:
if (costing_p)
{
/* An IFN_LOAD_LANES will load all its vector results,
regardless of which ones we actually need. Account
for the cost of unused results. */
if (first_stmt_info == stmt_info)
{
unsigned int gaps = DR_GROUP_SIZE (first_stmt_info);
stmt_vec_info next_stmt_info = first_stmt_info;
do
{
gaps -= 1;
next_stmt_info = DR_GROUP_NEXT_ELEMENT (next_stmt_info);
}
while (next_stmt_info);
if (gaps)
{
if (dump_enabled_p ())
dump_printf_loc (MSG_NOTE, vect_location,
"vect_model_load_cost: %d "
"unused vectors.\n",
gaps);
vect_get_load_cost (vinfo, stmt_info, slp_node, gaps,
alignment_support_scheme,
misalignment, false, &inside_cost,
&prologue_cost, cost_vec, cost_vec,
true);
}
}
n_adjacent_loads++;
continue;
}
counts one load for the LDN itself plus one load for each unused result.
That doesn't make conceptual sense. The first_stmt_info == stmt_info
block is counting one vector_load *per vector*, but n_adjacent_loads++
is counting one vector_load *per LDN*. In other words, we're mixing units.
So in table form, the handling of LD4 seems to be:
Number of used results GCC 11 Now
---------------------- ------ ----
1 4 vector_loads 4 vector_loads
2 4 vector_loads 3 vector_loads
3 4 vector_loads 2 vector_loads
4 4 vector_loads 1 vector_load
If we remove the first_stmt_info == stmt_info block above, we'll get
one vector_load for all cases. AArch64 can then multiply that by
aarch64_ld234_st234_vectors to get the number of vectors.
Alternatively, if we want generic code to present the same interface
as before, we could both remove the first_stmt_info == stmt_info block
and change:
n_adjacent_loads++;
to:
n_adjacent_loads += group_size;
But I can see the argument that counting 1 per LDN makes more sense
for generic code, so personally I prefer the first approach
(i.e. multiplying by aarch64_ld234_st234_vectors before applying
the per-vector costs).
This is one of the cases where the "new" tree-vect*.cc approach to
costing avoids a lot of the complexity associated with the old approach,
in the sense that we can now perform the same calculation in a much
simpler way. But some of the code associated with the old approach
has been kept, even though it no longer applies. And that's messing
up the calculation.
Thanks,
Richard