GCC already supports fully-predicated vectorisation for loops, both
using "traditional" loop vectorisation and loop-aware SLP
(superword-level parallelism). For example, GCC can vectorise:
void
foo (char *x)
{
for (int i = 0; i < 6; i += 2)
{
x[i] += 1;
x[i + 1] += 2;
}
}
from which it generates the following assembly code (with -O2
-ftree-vectorize -march=armv9-a+sve -msve-vector-bits=scalable):
foo:
ptrue p7.b, vl6
mov w1, 513
ld1b z31.b, p7/z, [x0]
mov z30.h, w1
add z30.b, z31.b, z30.b
st1b z30.b, p7, [x0]
ret
However, GCC cannot yet vectorise the unrolled form of the same
function, even though it is semantically equivalent:
void
foo (char *x)
{
x[0] += 1;
x[1] += 2;
x[2] += 1;
x[3] += 2;
x[4] += 1;
x[5] += 2;
}
These patches implement support for vectorising the unrolled form of
the above function by enabling use of a predicate mask or length
limit for basic block SLP. For example, it can now be vectorised to
the following assembly code (using the same options as above):
foo:
ptrue p7.b, vl6
ptrue p6.b, all
ld1b z31.b, p7/z, [x0]
adrp x1, .LC0
add x1, x1, :lo12:.LC0
ld1rqb z30.b, p6/z, [x1]
add z30.b, z31.b, z30.b
st1b z30.b, p7, [x0]
ret
Predication is only used for groups whose size is not neatly divisible
into vectors of lengths that can be supported directly by the target.
Bootstrapped and tested on aarch64-linux-gnu.
Based on commit e97550a7d0e1a8b31a76b0877c0e90a0163da7ee.
OK for trunk?
Changes in v2:
- Updated regexes used by the vect-over-widen-*.c tests so that they
do not accidentally match text that is now part of the dump but
was not previously.
- Updated regexes used by the aarch64/popcnt-sve.c test so that it
expects 'all' as the operand of 'ptrue', instead of some specific
number of active elements.
- Removed a dump_printf from vect_get_num_copies because the
gcc.dg/vect/vect-shift-5.c test relies on a fixed dump layout
spanning multiple lines.
- Fixed a bug in vect_get_vector_types_for_stmt, which had
accidentally been modified to unconditionally set group_size to
zero (even for basic block vectorization).
- Relaxed an overzealous new check in
vect_maybe_update_slp_op_vectype, which now considers the
vectorization factor when checking the number of lanes of external
definitions during loop vectorization. e.g., lanes=11 is not
divisible by subparts=8 with vf=1 but vf*lanes=88 is divisible by
subparts=8 with vf=11.
- Removed the stmts vector ownership changes relating to mishandling
of failure of the vect_analyze_slp_instance function (to be fixed
separately).
- A check in get_vectype_for_scalar_type for whether the natural
choice of vector type satisfies the group size was too simplistic.
Instead of choosing a narrower vector type if the natural vector
type could be long enough but not definitely (variable length, by
proxy), get_len_load_store_mode is now used to explicitly query
whether the target supports mask- or length-limited loads and
stores. With the previous version, GCC preferred the natural vector
type if it was known to be long enough; sometimes that resulted in
better output than a narrower type, but it also caused some
bb-slp-* tests to fail.
- Shuffled dump_printfs around into separate patches.
- An assertion is now used to protect my change to use lower bound
of number of subparts in gimple_build_vector.
Changes in v3:
- Wrote changelog entries.
- Created the gimple_build_vector_with_zero_padding function and
used it place of the gimple_build_vector function.
- Reverted my change to use constant_lower_bound of subparts in the
gimple_build_vector function.
- Fixed a check for 'partial vectors required' in vect_analyze_stmt
to include cases in which the minimum bound of the length of a
variable-length vector type equals the number of active lanes in an
SLP tree node. (Maybe less than, instead of known to be less than.)
- Reverted my change to regexes used by the aarch64/popcnt-sve.c test
because it is not needed after the above fix.
- Replaced SLP_TREE_CAN_USE_MASK_P and SLP_TREE_CAN_USE_LEN_P with
SLP_TREE_PARTIAL_VECTORS_STYLE.
- Titivated the documentation of vect_record_nunits.
- Renamed local variable max_nunits to nunits in the
vect_analyze_slp_reduc_chain function.
- Added an assertion in vect_create_constant_vectors to verify a
claimed relationship between group size and number of subparts.
- Fixed a mistake in the description of vect_slp_get_bb_len.
- Clarified the descriptions of vect_record_mask and vect_record_len
(from 'would be required' to 'could be used ... if required').
- Renamed the vect_can_use_mask_p function as vect_fully_masked_p
and vect_can_use_len_p as vect_fully_with_length_p.
- Tightened assertions in vect_record_mask and vect_record_len:
instead of just 'the other function must not have been called', we
now assert that no partial vectors style has already been set.
- Updated the documentation of check_load_store_for_partial_vectors
to describe its role in BB SLP vectorization.
- Renamed masked_loop_p and len_loop_p as masked_p and len_p in
contexts where they may now be used for BB SLP vectorization.
- Guard against a potential null pointer dereference in
LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS when used by the
vectorizable_operation and vectorizable_call functions for BB SLP
vectorization (instead of assuming !vect_fully_with_length_p).
- Guard against calling dump_printf_loc with null instead of
a vector type in vectorizable_comparison_1.
- Clear any partial vectors style that might have been set by callees
of vect_analyze_stmt if it finds that partial vectors aren't needed.
- Revert a change to make the vect_is_simple_use function more robust
when the requested SLP tree child node is not an internal def and
has no scalar operand to return.
- Revert a change to reserve slots for the definitions to be pushed
for narrow FLOAT_EXPR vectorization before using quick_push in the
vectorizable_conversion function.
Changes in v4:
- Resolved code generation regressions in
gcc.target/aarch64/sve/slp_6.c and
gcc.target/aarch64/sve/slp_7_costly.c by adding code to
handle variable-length vector types in store_constructor.
(Still had to update expectations for slp_6.c though.)
- Removed a default constructor from the definition of
struct slp_tree_nunits in order to fix a compilation error in
the vect_update_nunits function.
- Renamed the gimple_build_vector_with_zero_padding function as
gimple_build_vector_from_elems.
- Changed gimple_build_vector_from_elems to take a vector type
and list of element values instead of a tree_vector_builder.
- Rewrote the description of gimple_build_vector_from_elems.
- Removed redundant masking from gimple_build_vector_from_elems. It's
now almost equivalent to gimple_build_vector in use cases where
some element value is not constant.
- Assert that the constant lower bound of the number of subparts
in a vector type is a power of 2 instead of merely a multiple of 2
in vect_build_slp_tree_1.
Changes in v5:
- Removed the patch to track the minimum and maximum number of lanes for
BB SLP.
- Added a new function, vect_slp_tree_min_nunits, which is invoked by
vect_analyze_slp_instance to compute the minimum number of subparts for
all of the vector types used in an SLP tree for which an SLP instance is
about to be created.
Link to v1:
https://inbox.sourceware.org/gcc-patches/[email protected]/
Link to v2:
https://inbox.sourceware.org/gcc-patches/[email protected]/
Link to v3:
https://inbox.sourceware.org/gcc-patches/[email protected]/
Link to v4:
https://inbox.sourceware.org/gcc-patches/[email protected]/
Christopher Bazley (10):
Preparation to support predicated vector tails for BB SLP
Implement recording/getting of mask/length for BB SLP
Update constant creation for BB SLP with predicated tails
Refactor check_load_store_for_partial_vectors
New parameter for vect_maybe_update_slp_op_vectype
Handle variable-length vector types in store_constructor
AArch64/SVE: Relax the expectations of the popcnt-sve test
Extend BB SLP vectorization to use predicated tails
AArch64/SVE: Tests for use of predicated vector tails for BB SLP
Add extra conditional dump output to the vectorizer
gcc/expr.cc | 7 +-
gcc/gimple-fold.cc | 54 ++
gcc/gimple-fold.h | 14 +
.../gcc.dg/vect/vect-over-widen-10.c | 2 +-
.../gcc.dg/vect/vect-over-widen-13.c | 2 +-
.../gcc.dg/vect/vect-over-widen-14.c | 2 +-
.../gcc.dg/vect/vect-over-widen-17.c | 2 +-
.../gcc.dg/vect/vect-over-widen-18.c | 2 +-
gcc/testsuite/gcc.dg/vect/vect-over-widen-5.c | 2 +-
gcc/testsuite/gcc.dg/vect/vect-over-widen-6.c | 2 +-
gcc/testsuite/gcc.dg/vect/vect-over-widen-7.c | 2 +-
gcc/testsuite/gcc.dg/vect/vect-over-widen-8.c | 2 +-
gcc/testsuite/gcc.dg/vect/vect-over-widen-9.c | 2 +-
gcc/testsuite/gcc.target/aarch64/popcnt-sve.c | 10 +-
gcc/testsuite/gcc.target/aarch64/sve/slp_6.c | 3 -
.../gcc.target/aarch64/sve/slp_pred_1.c | 33 +
.../gcc.target/aarch64/sve/slp_pred_1_run.c | 6 +
.../gcc.target/aarch64/sve/slp_pred_2.c | 33 +
.../gcc.target/aarch64/sve/slp_pred_3.c | 33 +
.../gcc.target/aarch64/sve/slp_pred_3_run.c | 6 +
.../gcc.target/aarch64/sve/slp_pred_4.c | 33 +
.../gcc.target/aarch64/sve/slp_pred_5.c | 36 +
.../gcc.target/aarch64/sve/slp_pred_6.c | 39 +
.../gcc.target/aarch64/sve/slp_pred_6_run.c | 6 +
.../gcc.target/aarch64/sve/slp_pred_7.c | 38 +
.../gcc.target/aarch64/sve/slp_pred_harness.h | 28 +
gcc/tree-vect-loop.cc | 30 +-
gcc/tree-vect-slp.cc | 278 +++++-
gcc/tree-vect-stmts.cc | 795 +++++++++++-------
gcc/tree-vectorizer.h | 107 ++-
30 files changed, 1251 insertions(+), 358 deletions(-)
create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_1.c
create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_1_run.c
create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_2.c
create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_3.c
create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_3_run.c
create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_4.c
create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_5.c
create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_6.c
create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_6_run.c
create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_7.c
create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_harness.h
--
2.43.0