Tamar Christina <[email protected]> writes:
>> if (costing_p)
>> {
>> /* An IFN_LOAD_LANES will load all its vector results,
>> regardless of which ones we actually need. Account
>> for the cost of unused results. */
>> if (first_stmt_info == stmt_info)
>> {
>> unsigned int gaps = DR_GROUP_SIZE (first_stmt_info);
>> stmt_vec_info next_stmt_info = first_stmt_info;
>> do
>> {
>> gaps -= 1;
>> next_stmt_info = DR_GROUP_NEXT_ELEMENT
>> (next_stmt_info);
>> }
>> while (next_stmt_info);
>> if (gaps)
>> {
>> if (dump_enabled_p ())
>> dump_printf_loc (MSG_NOTE, vect_location,
>> "vect_model_load_cost: %d "
>> "unused vectors.\n",
>> gaps);
>> vect_get_load_cost (vinfo, stmt_info, slp_node, gaps,
>> alignment_support_scheme,
>> misalignment, false, &inside_cost,
>> &prologue_cost, cost_vec, cost_vec,
>> true);
>> }
>> }
>> n_adjacent_loads++;
>> continue;
>> }
>>
>> counts one load for the LDN itself plus one load for each unused result.
>> That doesn't make conceptual sense. The first_stmt_info == stmt_info
>> block is counting one vector_load *per vector*, but n_adjacent_loads++
>> is counting one vector_load *per LDN*. In other words, we're mixing units.
>>
>
> Correct, but even with the GAPs costing code it never made sense. So even
> with the old code costing the gaps made absolutely no sense because that's
> not how LDn's work. And that's not how it was costed when there is no gaps.
> Whether there's a gap or not is really irrelevant for an LDn. It's not
> like the
> Instruction does less or more work if you magically don't use some vectors.
Do you mean GCC 11 never made sense, or some intermediate state?
I tried to explain how things worked in GCC 11 before. There, normal
non-SLP loop vectorisation would call vectorizable_load for each scalar
statement. Thus for:
for (...)
{
a = x[i * 3];
b = x[i * 3 + 2];
...
}
it would call vectorizable_load for both a and b. Each vectorizable_load
would cost one vector_load.
But on its own, that would only count 2 vector loads, even though LD3
loads 3 vectors. Which I think we agree is wrong. We should count
the same number of loads regardless of how many load results are used.
So r11-6662-ge45c41988bfd65 added code to count the gaps. This was
applied once per LDN, rather than once per scalar load. At the time,
the way to do that was to protect the code with:
first_stmt_info == stmt_info
i.e. attach the extra cost to the first scalar load in the group.
With that in place, we counted 3 vector loads for the example above:
1 vector load for each scalar load and 1 vector load for the gap.
If the example had been:
for (...)
{
a = x[i * 3];
...
}
we would have counted 1 vector load for the scalar load and 2
vector loads for the gaps. For:
for (...)
{
a = x[i * 3];
b = x[i * 3 + 1];
c = x[i * 3 + 2];
}
we would have counted 1 vector load for each of the 3 scalar loads,
with no gap.
But if I'm not explaining it well, or you don't believe me, it might be
easier to check out GCC 11 and play around with the code.
Everything changed when we moved to SLP generation of LDN, and moved
to calling vectorizable_load once per SLP node rather than once per
scalar statement. At that point, counting gaps made no sense.
The thing that it was compensating for -- i.e. the lack of scalar
stmts for the unused vectors -- is no longer relevant, because
we no longer count things on a per-scalar-stmt basis.
>> So in table form, the handling of LD4 seems to be:
>>
>> Number of used results GCC 11 Now
>> ---------------------- ------ ----
>> 1 4 vector_loads 4 vector_loads
>> 2 4 vector_loads 3 vector_loads
>> 3 4 vector_loads 2 vector_loads
>> 4 4 vector_loads 1 vector_load
>>
>> If we remove the first_stmt_info == stmt_info block above, we'll get
>> one vector_load for all cases. AArch64 can then multiply that by
>> aarch64_ld234_st234_vectors to get the number of vectors.
>
> That only makes sense to model throughput, for latency it makes zero
> sense because the cost of an LD4 is not 4x vector loads. Never was
> otherwise the instruction would be pointless to begin with.
But remember that "latency" (i.e. the normal vector cost used on all
targets) is really "serial latency". We just add up the individual
cost of each operation, regardless of what can execute in parallel.
The cost of a load from x[] is added to the cost of a load from y[].
So from that POV, adding 4 vector loads as the latency cost for LD4
does made sense, since LD4 does do the same number of loads as
4 loads from separate arrays.
In terms of the costing model, and I think in reality, the thing
the instruction saves on is permutation cost. That's especially
true for LD3.
That is, LD3 is "slower" than 3 LD1s, but is "faster" than the
combination of 3 LD1s followed by permutations that simulate
the shuffling that an LD3 does.
> For throughput it makes sense as a rough estimate. Take Neoverse V3
> for instance. An LD4 costs 8 cycles whereas an LDR 6. Costing 4x 6
> just doesn't make sense.
>
> So I don't think multiplying by number of vectors should be done
> for latency and we don't do that today when there is no gap.
>
>>
>> Alternatively, if we want generic code to present the same interface
>> as before, we could both remove the first_stmt_info == stmt_info block
>> and change:
>>
>> n_adjacent_loads++;
>>
>> to:
>>
>> n_adjacent_loads += group_size;
>>
>> But I can see the argument that counting 1 per LDN makes more sense
>> for generic code, so personally I prefer the first approach
>> (i.e. multiplying by aarch64_ld234_st234_vectors before applying
>> the per-vector costs).
>>
>
> n_adjacent_loads is also pointless, the variable is shadowed inside the
> if (memory_access_type == VMAT_LOAD_STORE_LANES)
>
> and literally is just counting ncopies. So it's just ncopies.
>
> So I've already removed it.
Yeah, agreed.
Thanks,
Richard