GCC already supports fully-predicated vectorisation for loops, both
using "traditional" loop vectorisation and loop-aware SLP
(superword-level parallelism). For example, GCC can vectorise:
void
foo (char *x)
{
for (int i = 0; i < 6; i += 2)
{
x[i] += 1;
x[i + 1] += 2;
}
}
from which it generates the following assembly code (with -O2
-ftree-vectorize -march=armv9-a+sve -msve-vector-bits=scalable):
foo:
ptrue p7.b, vl6
mov w1, 513
ld1b z31.b, p7/z, [x0]
mov z30.h, w1
add z30.b, z31.b, z30.b
st1b z30.b, p7, [x0]
ret
However, GCC cannot yet vectorise the unrolled form of the same
function, even though it is semantically equivalent:
void
foo (char *x)
{
x[0] += 1;
x[1] += 2;
x[2] += 1;
x[3] += 2;
x[4] += 1;
x[5] += 2;
}
These patches implement support for vectorising the unrolled form of
the above function by enabling use of a predicate mask or length
limit for basic block SLP. For example, it can now be vectorised to
the following assembly code (using the same options as above):
foo:
ptrue p7.b, vl6
ptrue p6.b, all
ld1b z31.b, p7/z, [x0]
adrp x1, .LC0
add x1, x1, :lo12:.LC0
ld1rqb z30.b, p6/z, [x1]
add z30.b, z31.b, z30.b
st1b z30.b, p7, [x0]
ret
Predication is only used for groups whose size is not neatly divisible
into vectors of lengths that can be supported directly by the target.
Bootstrapped and tested on aarch64-linux-gnu.
A list of test regressions that need resolving is as follows:
gcc.target/aarch64/sve/slp_6.c
gcc.target/aarch64/sve/slp_7_costly.c
This patch series changes the compiled output of the above two tests,
causing them to fail because the reduction and epilogue handling now
use SVE masked load/store. Unfortunately, the overall effect on
code generation is observably quite bad. This will be addressed in
a future version.
Changes in v2:
- Updated regexes used by the vect-over-widen-*.c tests so that they
do not accidentally match text that is now part of the dump but
was not previously.
- Updated regexes used by the aarch64/popcnt-sve.c test so that it
expects 'all' as the operand of 'ptrue', instead of some specific
number of active elements.
- Removed a dump_printf from vect_get_num_copies because the
gcc.dg/vect/vect-shift-5.c test relies on a fixed dump layout
spanning multiple lines.
- Fixed a bug in vect_get_vector_types_for_stmt, which had
accidentally been modified to unconditionally set group_size to
zero (even for basic block vectorization).
- Relaxed an overzealous new check in
vect_maybe_update_slp_op_vectype, which now considers the
vectorization factor when checking the number of lanes of external
definitions during loop vectorization. e.g., lanes=11 is not
divisible by subparts=8 with vf=1 but vf*lanes=88 is divisible by
subparts=8 with vf=11.
- Removed the stmts vector ownership changes relating to mishandling
of failure of the vect_analyze_slp_instance function (to be fixed
separately).
- A check in get_vectype_for_scalar_type for whether the natural
choice of vector type satisfies the group size was too simplistic.
Instead of choosing a narrower vector type if the natural vector
type could be long enough but not definitely (variable length, by
proxy), get_len_load_store_mode is now used to explicitly query
whether the target supports mask- or length-limited loads and
stores. With the previous version, GCC preferred the natural vector
type if it was known to be long enough; sometimes that resulted in
better output than a narrower type, but it also caused some
bb-slp-* tests to fail.
- Shuffled dump_printfs around into separate patches.
- An assertion is now used to protect my change to use lower bound
of number of subparts in gimple_build_vector.
Link to v1:
https://inbox.sourceware.org/gcc-patches/[email protected]/
Christopher Bazley (9):
Track the minimum and maximum number of lanes for BB SLP
Preparation to support predicated vector tails for BB SLP
Implement recording/getting of mask/length for BB SLP
Update constant creation for BB SLP with predicated tails
Refactor check_load_store_for_partial_vectors
New parameter for vect_maybe_update_slp_op_vectype
Extend BB SLP vectorization to use predicated tails
AArch64/SVE: Tests for use of predicated vector tails for BB SLP
Add extra conditional dump output to the vectorizer
gcc/gimple-fold.cc | 6 +-
.../gcc.dg/vect/vect-over-widen-10.c | 2 +-
.../gcc.dg/vect/vect-over-widen-13.c | 2 +-
.../gcc.dg/vect/vect-over-widen-14.c | 2 +-
.../gcc.dg/vect/vect-over-widen-17.c | 2 +-
.../gcc.dg/vect/vect-over-widen-18.c | 2 +-
gcc/testsuite/gcc.dg/vect/vect-over-widen-5.c | 2 +-
gcc/testsuite/gcc.dg/vect/vect-over-widen-6.c | 2 +-
gcc/testsuite/gcc.dg/vect/vect-over-widen-7.c | 2 +-
gcc/testsuite/gcc.dg/vect/vect-over-widen-8.c | 2 +-
gcc/testsuite/gcc.dg/vect/vect-over-widen-9.c | 2 +-
gcc/testsuite/gcc.target/aarch64/popcnt-sve.c | 10 +-
.../gcc.target/aarch64/sve/slp_pred_1.c | 33 +
.../gcc.target/aarch64/sve/slp_pred_1_run.c | 6 +
.../gcc.target/aarch64/sve/slp_pred_2.c | 33 +
.../gcc.target/aarch64/sve/slp_pred_3.c | 33 +
.../gcc.target/aarch64/sve/slp_pred_3_run.c | 6 +
.../gcc.target/aarch64/sve/slp_pred_4.c | 33 +
.../gcc.target/aarch64/sve/slp_pred_5.c | 36 +
.../gcc.target/aarch64/sve/slp_pred_6.c | 39 +
.../gcc.target/aarch64/sve/slp_pred_6_run.c | 6 +
.../gcc.target/aarch64/sve/slp_pred_7.c | 38 +
.../gcc.target/aarch64/sve/slp_pred_harness.h | 28 +
gcc/tree-vect-loop.cc | 28 +-
gcc/tree-vect-slp.cc | 377 ++++++---
gcc/tree-vect-stmts.cc | 741 +++++++++++-------
gcc/tree-vectorizer.h | 142 +++-
27 files changed, 1204 insertions(+), 411 deletions(-)
create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_1.c
create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_1_run.c
create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_2.c
create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_3.c
create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_3_run.c
create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_4.c
create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_5.c
create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_6.c
create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_6_run.c
create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_7.c
create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_harness.h
--
2.43.0