Tamar Christina <[email protected]> writes:
> I wrote testcase gcc/testsuite/gcc.target/aarch64/sve/cost_model_23.c
> to check specifically what happens with the different cases where we
> use LDn.
FAOD, to compare with your results, I used:
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index 8754e648b89..89559f948b3 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -18631,6 +18626,10 @@ aarch64_vector_costs::add_stmt_cost (int count,
vect_cost_for_stmt kind,
vectype = TREE_TYPE (lhs);
}
+ if (stmt_info)
+ if (auto nv = aarch64_ld234_st234_vectors (kind, stmt_info, node))
+ count *= nv;
+
fractional_cost stmt_cost
= aarch64_builtin_vectorization_cost (kind, vectype, misalign);
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index a74e03cc0f6..bab32c85514 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -10705,33 +10705,6 @@ vectorizable_load (vec_info *vinfo,
{
if (costing_p)
{
- /* An IFN_LOAD_LANES will load all its vector results,
- regardless of which ones we actually need. Account
- for the cost of unused results. */
- if (first_stmt_info == stmt_info)
- {
- unsigned int gaps = DR_GROUP_SIZE (first_stmt_info);
- stmt_vec_info next_stmt_info = first_stmt_info;
- do
- {
- gaps -= 1;
- next_stmt_info = DR_GROUP_NEXT_ELEMENT (next_stmt_info);
- }
- while (next_stmt_info);
- if (gaps)
- {
- if (dump_enabled_p ())
- dump_printf_loc (MSG_NOTE, vect_location,
- "vect_model_load_cost: %d "
- "unused vectors.\n",
- gaps);
- vect_get_load_cost (vinfo, stmt_info, slp_node, gaps,
- alignment_support_scheme,
- misalignment, false, &inside_cost,
- &prologue_cost, cost_vec, cost_vec,
- true);
- }
- }
n_adjacent_loads++;
continue;
}
(without the simplification you mentioned to get rid of n_adjacent_loads).
That seems to match your numbers for (c) below.
> So lets break them down:
>
> 1. Loads with GAPs, we already covered this. And I agree that the * is to
> so that when the weighted cycles per iterations are taken that we still
> have the actual cost of 1 LDn per cycle. Or at least roughly account for
> it.
>
> 2. Without gaps, on trunk today cost_model_23.c says for
>
> a) armv8-a (which has no issue information)
>
> _3->r 1 times vector_load costs 1 in body
>
> Which ok, not really sure what to do with that.
>
> b) armv9-a (which has issue information)
>
> _3->r 1 times vector_load costs 7 in body
>
> OK, makes more sense, but then
>
> note: Original vector body cost = 15
> note: Scalar issue estimate:
> note: load operations = 4
> note: store operations = 0
> note: general operations = 4
> note: reduction latency = 1
> note: estimated min cycles per iteration = 1.333333
> note: estimated cycles per vector iteration (for VF 16) = 21.333333
> note: SVE issue estimate:
> note: load operations = 1
> note: store operations = 0
> note: general operations = 7
> note: predicate operations = 0
> note: reduction latency = 2
> note: estimated min cycles per iteration = 3.500000
>
> with the summary
>
> Vector inside of loop cost: 15
> Vector prologue cost: 16
> Vector epilogue cost: 197
> Scalar iteration cost: 20
> Scalar outside cost: 2
> Vector outside cost: 213
> prologue iterations: 0
> epilogue iterations: 8
> Calculated minimum iters for profitability: 11
>
> So it looks like for this case we cost a LD4 as just 1 load.
>
> c) low iteration counts with gaps
>
> for low known iteration counts we switch to latency only costing.
> So for this costing we don't have a weighted cycles, so the 4 * LD4
> cost adjustment seems quite off.
>
> So
>
> int reduce(int *a, int n)
> {
> int sum = 0;
> for (int i = 0; i < n; i=i+4) {
> sum += a[i];
> }
> return sum;
> }
>
> Gets costed as
>
> note: Original vector body cost = 30
> note: Scalar issue estimate:
> note: load operations = 1
> note: store operations = 0
> note: general operations = 1
> note: reduction latency = 1
> note: estimated min cycles per iteration = 1.000000
> note: estimated cycles per vector iteration (for VF 4) = 4.000000
> note: SVE issue estimate:
> note: load operations = 4
> note: store operations = 0
> note: general operations = 13
> note: predicate operations = 2
> note: reduction latency = 2
> note: estimated min cycles per iteration without predication = 6.500000
> note: estimated min cycles per iteration for predication = 1.000000
> note: estimated min cycles per iteration = 6.500000
> note: Increasing body cost to 49 because scalar code would issue more
> quickly
> note: Cost model analysis:
> Vector inside of loop cost: 49
> Vector prologue cost: 6
> Vector epilogue cost: 14
> Scalar iteration cost: 5
> Scalar outside cost: 2
> Vector outside cost: 20
> prologue iterations: 0
> epilogue iterations: 1
>
> but
>
> int reduce(int *a, int n)
> {
> int sum = 0;
> #pragma GCC unroll 0
> for (int i = 0; i < 65; i=i+4) {
> sum += a[i];
> }
> return sum;
> }
>
> As
>
> note: Original vector body cost = 30
> note: Vector loop iterates at most 5 times
> note: Scalar issue estimate:
> note: load operations = 1
> note: store operations = 0
> note: general operations = 1
> note: reduction latency = 1
> note: estimated min cycles per iteration = 1.000000
> note: estimated cycles per vector iteration (for VF 4) = 4.000000
> note: SVE issue estimate:
> note: load operations = 4
> note: store operations = 0
> note: general operations = 13
> note: predicate operations = 2
> note: reduction latency = 2
> note: estimated min cycles per iteration without predication = 6.500000
> note: estimated min cycles per iteration for predication = 1.000000
> note: estimated min cycles per iteration = 6.500000
> note: Low iteration count, so using pure latency costs
> note: Cost model analysis:
> Vector inside of loop cost: 30
> Vector prologue cost: 6
> Vector epilogue cost: 14
> Scalar iteration cost: 5
> Scalar outside cost: 0
> Vector outside cost: 20
> prologue iterations: 0
> epilogue iterations: 1
>
> This is the one I find weird. For both cases 49 and 30 are way too high
> in terms of serial latency cost for the loop. It makes the instruction really
> really expensive.
The 49 one is just an artificial cost created by:
note: Increasing body cost to 49 because scalar code would issue more quickly
with the intention of forcing scalar costs, so it's deliberately excessive.
30 is the unfiltered cost. It comes from:
*_3 1 times vector_load costs 28 in body
_4 + sum_13 1 times vector_stmt costs 2 in body
And that 28 comes from:
align_load_cost: 4
ld4_st4_permute_cost: 3
so 7 * 4. That is (to me) the expected behaviour for the current cost
tables in this situation.
> And if you compare it with the case above where we have no gaps, we
> cost it as 15.
I get 36 with the patch above, from:
_3->r 1 times vector_load costs 28 in body
_16 w+ a_lsm.13_34 1 times vector_stmt costs 2 in body
_12 w+ b_lsm.12_35 1 times vector_stmt costs 2 in body
_8 w+ g_lsm.11_20 1 times vector_stmt costs 2 in body
_4 w+ r_lsm.10_21 1 times vector_stmt costs 2 in body
That too is what I'd expect, so...
> So while the * 4 makes sense for determining the weighted cycles for
> cross mode testing, the final body costs are inconsistent.
...I think they are consistent with the change above, but maybe not in
the way you want :)
> Now granted, removing the GAPs costing code will align the costing
> Between 1 and 2, by doing * 4 you now increase the costing for 2.b
> making the loop unprofitable, while on every single uArch it's profitable.
>
> But the cost of 30 and 45 for the loop is unrealistic which is why we
> don't vectorize it today (and really what the PR is trying to report).
I'm surprised that vectorisation is profitable for cost_model_22.c.
But obviously you can't share your results with me these days :)
and I can no longer play around with simulators and traces.
The costs above show that, according to the current cost tables,
the vector code requires "6.5" cycles per iteration while the
scalar code (correctly) requires 4 cycles. That 6.5 of course
comes from 13 general operations / 2 general_ops_per_cycle.
And the 13 general operations comes from:
3 ld4_st4_general_ops * 4
1 addition
That too is how I'd expect the cost tables to be used, even if that
gives the wrong end result.
So my suggestion would be to start with the equivalent of the patch above,
both to "restore consistency" and to use the aarch64_simd_vec_issue_info
information in the documented way, then work from there.
For example, with the new improved costing flow, there's no real need to
make ld4_st4_general_ops & co, or ld4_st4_permute_cost & co, per-vector.
They could be per-LDN instead. That would be a scriptable change with
no behavioural effect. It would then be easier to reduce generic_armv9a's
per-LD4 general op count from 12, if 12 is too much.
Thanks,
Richard