Tamar Christina <[email protected]> writes:
> The vectorizer when costing LOAD_LANES with gaps breaks the costing into
> multiple calls where the first call wants to cost GAPS loads and second
> GROUP_SIZE - GAPS loads. However:
>
> 1. Both VMAT types are LOAD_LANES, we we cost effectively
> GROUP_SIZE * LOAD_LANES. This increases the latency costs immensely.
>
> 2. The GROUP_SIZE costing also increases the throughput costs * GROUP_SIZE,
> which has the knock on effect of increasing the costs for the latency again
> as the compiler applies a latency penalty due to the scalar code issuing
> quicker.
>
> What the vectorizer is trying to cost is simply 1x LOAD_LANES but it's asking
> to
> cost the invidual gaps as well. However for an LDn operation this costs is
> always fixed and is the cost of the LDn itself.
>
> when there isn't any gaps we only cost 1x LOAD_LANES which is the expected
> amount. Unfortunately we can't really tell in the backend easily what chain
> of
> costing belongs to any one operation and so treat them as individual ones.
>
> This patch keeps track of the LOAD_LANES we've seen already when there is a
> gap
> and
>
> 1. marks the cost of the first GAPS costing to 0.
> 2. costs the cost of the second GROUP_SIZE - GAPS loads as a single
> LOAD_LANES.
>
> This is the best we can do until Richi's changes to cost operations like these
> as a group.
>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
I think we should just remove the:
/* An IFN_LOAD_LANES will load all its vector results,
regardless of which ones we actually need. Account
for the cost of unused results. */
if (first_stmt_info == stmt_info)
{
unsigned int gaps = DR_GROUP_SIZE (first_stmt_info);
stmt_vec_info next_stmt_info = first_stmt_info;
do
{
gaps -= 1;
next_stmt_info = DR_GROUP_NEXT_ELEMENT (next_stmt_info);
}
while (next_stmt_info);
if (gaps)
{
if (dump_enabled_p ())
dump_printf_loc (MSG_NOTE, vect_location,
"vect_model_load_cost: %d "
"unused vectors.\n",
gaps);
vect_get_load_cost (vinfo, stmt_info, slp_node, gaps,
alignment_support_scheme,
misalignment, false, &inside_cost,
&prologue_cost, cost_vec, cost_vec,
true);
}
}
that you pointed to in the PR. This predates the "all SLP" model.
It was originally added in r11-6662-ge45c41988bfd65, at a time when
vectorizable_load (and thus vect_model_load_cost) was called for each
individual *scalar* stmt. As a result, we generally costed one vector
load for each scalar load. One-off costs that applied to the whole
group were added only when costing the first stmt_info in the group.
This meant that for load-lanes, we'd cost one vector_load for each
member of the LDn that was used. That was ok (in the sense of being
self-consistent) when all of the results were used. But it meant
that the cost of a fully-utilised LD4 was 4 times greater than
the cost of an LD4 in which only one result was used. That didn't
make any sense, since the instruction loaded the same amount of
data regardless.
There's no need for that now that we only call vectorizable_load once
per SLP node, rather than once per scalar stmt, and now that we cost
the instructions in the same way that we generate them.
Thanks,
Richard
>
> Any objections?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
> PR target/124866
> * config/aarch64/aarch64.cc
> (class aarch64_vector_costs): Add m_load_lanes.
> (aarch64_vector_costs::add_stmt_cost): Use it.
>
> gcc/testsuite/ChangeLog:
>
> PR target/124866
> * gcc.target/aarch64/sve/cost_model_20.c: New test.
> * gcc.target/aarch64/sve/cost_model_21.c: New test.
> * gcc.target/aarch64/sve/cost_model_22.c: New test.
> * gcc.target/aarch64/sve/cost_model_23.c: New test.
> * gcc.target/aarch64/sve/pr124866.c: New test.
>
> ---
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index
> 37c28c8f2f8f58f6bb16774e8d5a788cdc800154..aeeef4aa47eb81e771b0bab5f5eb03b19800d16e
> 100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -17582,6 +17582,11 @@ private:
> /* If m_loop_fully_scalar_dup this contains the total number of leaf stmts
> found in the SLP tree. */
> uint64_t m_num_total_stmts = 0;
> +
> + /* Used when costing load lanes with gaps as we need to know whether we are
> + costing the first or the second occurrence. (i.e. the gap or the used
> + bits). */
> + hash_set<stmt_vec_info> m_load_lanes;
> };
>
> aarch64_vector_costs::aarch64_vector_costs (vec_info *vinfo,
> @@ -18779,6 +18784,36 @@ aarch64_vector_costs::add_stmt_cost (int count,
> vect_cost_for_stmt kind,
> }
> }
>
> + /* When a store with gaps happens the middle-end will cost these as series
> + of x loads (where x is the number of unused vectors) and set count to x
> + which causes the cost model to go off the rails because we cost e.g.
> + an LD4 as doing 4 LDRs + permutes. Detect this and cost the load as a
> + single LDn. The middle-end costs this as GAPS + 1 loads so the costing
> + for load lanes goes completely wrong. On AARCH64 the gaps don't matter
> + and LOAD_LANES has the same cost so just ignore the GAP costing if we're
> + doing throughput based costing. */
> + stmt_vec_info first_stmt_info;
> + if (node
> + && stmt_info
> + && aarch64_tune_params.vec_costs->issue_info
> + && SLP_TREE_MEMORY_ACCESS_TYPE (node) == VMAT_LOAD_STORE_LANES
> + && STMT_VINFO_GROUPED_ACCESS (stmt_info)
> + && (first_stmt_info = DR_GROUP_FIRST_ELEMENT (stmt_info))
> + && first_stmt_info == stmt_info)
> + {
> + /* Figure out of we're a group load with gaps. */
> + unsigned int gaps = DR_GROUP_SIZE (first_stmt_info);
> + stmt_vec_info next_stmt_info = first_stmt_info;
> + do
> + {
> + gaps -= 1;
> + }
> + while ((next_stmt_info = DR_GROUP_NEXT_ELEMENT (next_stmt_info)));
> +
> + if (gaps)
> + count = m_load_lanes.add (stmt_info) ? 1 : 0;
> + }
> +
> /* If we're recording a nonzero vector loop body cost for the
> innermost loop, also estimate the operations that would need
> to be issued by all relevant implementations of the loop. */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cost_model_20.c
> b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_20.c
> new file mode 100644
> index
> 0000000000000000000000000000000000000000..ffb51b5a1a0c59ddf4ebcb71d0036b0f0f785e3f
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_20.c
> @@ -0,0 +1,20 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-O3 -march=armv9-a -fdump-tree-vect-details" } */
> +
> +int reduce(int *a, int n)
> +{
> + int sum = 0;
> + for (int i = 0; i < n; i=i+2) {
> + sum += a[i];
> + }
> + return sum;
> +}
> +
> +/* { dg-final { scan-tree-dump "\\*_3 1 times vector_load costs 0 in body"
> "vect" } } */
> +/* Sadly we can't get this fully right because the middle-end will re-cost
> based on ncopies, but we don't
> + know ncopies and recalculating it in the backend would replicate a lot of
> code. */
> +/* { dg-final { scan-tree-dump-not "\\*_3 1 times vector_load costs 6 in
> body" "vect" { xfail *-*-* } } } */
> +/* { dg-final { scan-tree-dump "\\*_3 1 times vector_load costs 6 in body"
> "vect" } } */
> +
> +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
> +/* { dg-final { scan-tree-dump "\.(MASKED_LOAD)?_LANES" "vect" } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cost_model_21.c
> b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_21.c
> new file mode 100644
> index
> 0000000000000000000000000000000000000000..7157796b11c32314c7a0d44e34135173fab82ba1
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_21.c
> @@ -0,0 +1,21 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-O3 -march=armv9-a -fdump-tree-vect-details" } */
> +
> +int reduce(int *a, int n)
> +{
> + int sum = 0;
> + int sub = 0;
> + for (int i = 0; i < n; i=i+4) {
> + {
> + sum += a[i];
> + sub += a[i+3];
> + }
> + }
> + return sum - sub;
> +}
> +
> +/* { dg-final { scan-tree-dump "\\*_3 2 times vector_load costs 0 in body"
> "vect" } } */
> +/* { dg-final { scan-tree-dump "\\*_3 1 times vector_load costs 7 in body"
> "vect" } } */
> +
> +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
> +/* { dg-final { scan-tree-dump-not "\.(MASKED_LOAD)?_LANES" "vect" } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cost_model_22.c
> b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_22.c
> new file mode 100644
> index
> 0000000000000000000000000000000000000000..68478b55dd40c825e3d4e2f0a5351263e9696665
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_22.c
> @@ -0,0 +1,17 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-O3 -march=armv9-a -fdump-tree-vect-details" } */
> +
> +int reduce(int *a, int n)
> +{
> + int sum = 0;
> + for (int i = 0; i < n; i=i+4) {
> + sum += a[i];
> + }
> + return sum;
> +}
> +
> +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
> +/* { dg-final { scan-tree-dump "\.(MASKED_LOAD)?_LANES" "vect" } } */
> +
> +/* { dg-final { scan-tree-dump "\\*_3 3 times vector_load costs 0 in body"
> "vect" } } */
> +/* { dg-final { scan-tree-dump "\\*_3 1 times vector_load costs 7 in body"
> "vect" } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cost_model_23.c
> b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_23.c
> new file mode 100644
> index
> 0000000000000000000000000000000000000000..cab1073453d097ddd3cca4fe5577202d1720030b
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_23.c
> @@ -0,0 +1,28 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-O3 -march=armv9-a -fdump-tree-vect-details" } */
> +
> +typedef struct _argb {
> + char r;
> + char g;
> + char b;
> + char a;
> +} rgb_t;
> +
> +int r, g, b, a;
> +
> +void reduce(rgb_t *x, int n)
> +{
> + r = g = b = a = 0;
> + for (int i = 0; i < n; i++) {
> + r += x[i].r;
> + g += x[i].g;
> + b += x[i].b;
> + a += x[i].a;
> + }
> + return;
> +}
> +
> +/* { dg-final { scan-tree-dump "_3->r 1 times vector_load costs 7 in body"
> "vect" } } */
> +
> +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
> +/* { dg-final { scan-tree-dump "\.(MASKED_LOAD)?_LANES" "vect" } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pr124866.c
> b/gcc/testsuite/gcc.target/aarch64/sve/pr124866.c
> new file mode 100644
> index
> 0000000000000000000000000000000000000000..d3e48f61c6f4714f92dc225d001c11a647c129f1
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/pr124866.c
> @@ -0,0 +1,14 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-O3 -march=armv9-a -fdump-tree-vect-details" } */
> +
> +int reduce(int *a, int n)
> +{
> + int sum = 0;
> + for (int i = 0; i < n; i=i+4) {
> + sum += a[i];
> + }
> + return sum;
> +}
> +
> +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
> +/* { dg-final { scan-tree-dump "\.(MASKED_LOAD)?_LANES" "vect" } } */