GCC already supports fully-predicated vectorisation for loops, both
using "traditional" loop vectorisation and loop-aware SLP
(superword-level parallelism). For example, GCC can vectorise:
void
foo (char *x)
{
for (int i = 0; i < 6; i += 2)
{
x[i] += 1;
x[i + 1] += 2;
}
}
from which it generates the following assembly code (with -O2
-ftree-vectorize -march=armv9-a+sve -msve-vector-bits=scalable):
foo:
ptrue p7.b, vl6
mov w1, 513
ld1b z31.b, p7/z, [x0]
mov z30.h, w1
add z30.b, z31.b, z30.b
st1b z30.b, p7, [x0]
ret
However, GCC cannot yet vectorise the unrolled form of the same
function, even though it is semantically equivalent:
void
foo (char *x)
{
x[0] += 1;
x[1] += 2;
x[2] += 1;
x[3] += 2;
x[4] += 1;
x[5] += 2;
}
These patches implement support for vectorising the unrolled form of
the above function by enabling use of a predicate mask or length
limit for basic block SLP. For example, it can now be vectorised to
the following assembly code (using the same options as above):
foo:
ptrue p7.b, vl6
ptrue p6.b, all
ld1b z31.b, p7/z, [x0]
adrp x1, .LC0
add x1, x1, :lo12:.LC0
ld1rqb z30.b, p6/z, [x1]
add z30.b, z31.b, z30.b
st1b z30.b, p7, [x0]
ret
Predication is only used for groups whose size is not neatly divisible
into vectors of lengths that can be supported directly by the target.
Bootstrapped and tested on aarch64-linux-gnu.
A list of test regressions that need resolving is as follows:
gcc.target/aarch64/sve/slp_6.c
gcc.target/aarch64/sve/slp_7_costly.c
This patch series changes the compiled output of the above two tests,
causing them to fail because the reduction and epilogue handling now
use SVE masked load/store. Unfortunately, the overall effect on
code generation is observably quite bad. This will be addressed in
a future version.
Changes in v2:
- Updated regexes used by the vect-over-widen-*.c tests so that they
do not accidentally match text that is now part of the dump but
was not previously.
- Updated regexes used by the aarch64/popcnt-sve.c test so that it
expects 'all' as the operand of 'ptrue', instead of some specific
number of active elements.
- Removed a dump_printf from vect_get_num_copies because the
gcc.dg/vect/vect-shift-5.c test relies on a fixed dump layout
spanning multiple lines.
- Fixed a bug in vect_get_vector_types_for_stmt, which had
accidentally been modified to unconditionally set group_size to
zero (even for basic block vectorization).
- Relaxed an overzealous new check in
vect_maybe_update_slp_op_vectype, which now considers the
vectorization factor when checking the number of lanes of external
definitions during loop vectorization. e.g., lanes=11 is not
divisible by subparts=8 with vf=1 but vf*lanes=88 is divisible by
subparts=8 with vf=11.
- Removed the stmts vector ownership changes relating to mishandling
of failure of the vect_analyze_slp_instance function (to be fixed
separately).
- A check in get_vectype_for_scalar_type for whether the natural
choice of vector type satisfies the group size was too simplistic.
Instead of choosing a narrower vector type if the natural vector
type could be long enough but not definitely (variable length, by
proxy), get_len_load_store_mode is now used to explicitly query
whether the target supports mask- or length-limited loads and
stores. With the previous version, GCC preferred the natural vector
type if it was known to be long enough; sometimes that resulted in
better output than a narrower type, but it also caused some
bb-slp-* tests to fail.
- Shuffled dump_printfs around into separate patches.
- An assertion is now used to protect my change to use lower bound
of number of subparts in gimple_build_vector.
Changes in v3:
- Wrote changelog entries.
- Created the gimple_build_vector_with_zero_padding function and
used it place of the gimple_build_vector function.
- Reverted my change to use constant_lower_bound of subparts in the
gimple_build_vector function.
- Fixed a check for 'partial vectors required' in vect_analyze_stmt
to include cases in which the minimum bound of the length of a
variable-length vector type equals the number of active lanes in an
SLP tree node. (Maybe less than, instead of known to be less than.)
- Reverted my change to regexes used by the aarch64/popcnt-sve.c test
because it is not needed after the above fix.
- Replaced SLP_TREE_CAN_USE_MASK_P and SLP_TREE_CAN_USE_LEN_P with
SLP_TREE_PARTIAL_VECTORS_STYLE.
- Titivated the documentation of vect_record_nunits.
- Renamed local variable max_nunits to nunits in the
vect_analyze_slp_reduc_chain function.
- Added an assertion in vect_create_constant_vectors to verify a
claimed relationship between group size and number of subparts.
- Fixed a mistake in the description of vect_slp_get_bb_len.
- Clarified the descriptions of vect_record_mask and vect_record_len
(from 'would be required' to 'could be used ... if required').
- Renamed the vect_can_use_mask_p function as vect_fully_masked_p
and vect_can_use_len_p as vect_fully_with_length_p.
- Tightened assertions in vect_record_mask and vect_record_len:
instead of just 'the other function must not have been called', we
now assert that no partial vectors style has already been set.
- Updated the documentation of check_load_store_for_partial_vectors
to describe its role in BB SLP vectorization.
- Renamed masked_loop_p and len_loop_p as masked_p and len_p in
contexts where they may now be used for BB SLP vectorization.
- Guard against a potential null pointer dereference in
LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS when used by the
vectorizable_operation and vectorizable_call functions for BB SLP
vectorization (instead of assuming !vect_fully_with_length_p).
- Guard against calling dump_printf_loc with null instead of
a vector type in vectorizable_comparison_1.
- Clear any partial vectors style that might have been set by callees
of vect_analyze_stmt if it finds that partial vectors aren't needed.
- Revert a change to make the vect_is_simple_use function more robust
when the requested SLP tree child node is not an internal def and
has no scalar operand to return.
- Revert a change to reserve slots for the definitions to be pushed
for narrow FLOAT_EXPR vectorization before using quick_push in the
vectorizable_conversion function.
Link to v1:
https://inbox.sourceware.org/gcc-patches/[email protected]/
Link to v2:
https://inbox.sourceware.org/gcc-patches/[email protected]/
Christopher Bazley (9):
Track the minimum and maximum number of lanes for BB SLP
Preparation to support predicated vector tails for BB SLP
Implement recording/getting of mask/length for BB SLP
Update constant creation for BB SLP with predicated tails
Refactor check_load_store_for_partial_vectors
New parameter for vect_maybe_update_slp_op_vectype
Extend BB SLP vectorization to use predicated tails
AArch64/SVE: Tests for use of predicated vector tails for BB SLP
Add extra conditional dump output to the vectorizer
gcc/gimple-fold.cc | 86 ++
gcc/gimple-fold.h | 15 +
.../gcc.dg/vect/vect-over-widen-10.c | 2 +-
.../gcc.dg/vect/vect-over-widen-13.c | 2 +-
.../gcc.dg/vect/vect-over-widen-14.c | 2 +-
.../gcc.dg/vect/vect-over-widen-17.c | 2 +-
.../gcc.dg/vect/vect-over-widen-18.c | 2 +-
gcc/testsuite/gcc.dg/vect/vect-over-widen-5.c | 2 +-
gcc/testsuite/gcc.dg/vect/vect-over-widen-6.c | 2 +-
gcc/testsuite/gcc.dg/vect/vect-over-widen-7.c | 2 +-
gcc/testsuite/gcc.dg/vect/vect-over-widen-8.c | 2 +-
gcc/testsuite/gcc.dg/vect/vect-over-widen-9.c | 2 +-
.../gcc.target/aarch64/sve/slp_pred_1.c | 33 +
.../gcc.target/aarch64/sve/slp_pred_1_run.c | 6 +
.../gcc.target/aarch64/sve/slp_pred_2.c | 33 +
.../gcc.target/aarch64/sve/slp_pred_3.c | 33 +
.../gcc.target/aarch64/sve/slp_pred_3_run.c | 6 +
.../gcc.target/aarch64/sve/slp_pred_4.c | 33 +
.../gcc.target/aarch64/sve/slp_pred_5.c | 36 +
.../gcc.target/aarch64/sve/slp_pred_6.c | 39 +
.../gcc.target/aarch64/sve/slp_pred_6_run.c | 6 +
.../gcc.target/aarch64/sve/slp_pred_7.c | 38 +
.../gcc.target/aarch64/sve/slp_pred_harness.h | 28 +
gcc/tree-vect-loop.cc | 30 +-
gcc/tree-vect-slp.cc | 388 ++++++---
gcc/tree-vect-stmts.cc | 796 +++++++++++-------
gcc/tree-vectorizer.h | 153 +++-
27 files changed, 1349 insertions(+), 430 deletions(-)
create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_1.c
create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_1_run.c
create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_2.c
create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_3.c
create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_3_run.c
create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_4.c
create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_5.c
create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_6.c
create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_6_run.c
create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_7.c
create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_harness.h
--
2.43.0