GCC already supports fully-predicated vectorisation for loops, both
using "traditional" loop vectorisation and loop-aware SLP
(superword-level parallelism). For example, GCC can vectorise:
void
foo (char *x)
{
for (int i = 0; i < 6; i += 2)
{
x[i] += 1;
x[i + 1] += 2;
}
}
from which it generates the following assembly code (with -O2
-ftree-vectorize -march=armv9-a+sve -msve-vector-bits=scalable):
foo:
ptrue p7.b, vl6
mov w1, 513
ld1b z31.b, p7/z, [x0]
mov z30.h, w1
add z30.b, z31.b, z30.b
st1b z30.b, p7, [x0]
ret
However, GCC cannot yet vectorise the unrolled form of the same
function, even though it is semantically equivalent:
void
foo (char *x)
{
x[0] += 1;
x[1] += 2;
x[2] += 1;
x[3] += 2;
x[4] += 1;
x[5] += 2;
}
These patches implement support for vectorising the unrolled form of
the above function by enabling use of a predicate mask or length
limit for basic block SLP. For example, it can now be vectorised to
the following assembly code (using the same options as above):
foo:
ptrue p7.b, vl6
ptrue p6.b, all
ld1b z31.b, p7/z, [x0]
adrp x1, .LC0
add x1, x1, :lo12:.LC0
ld1rqb z30.b, p6/z, [x1]
add z30.b, z31.b, z30.b
st1b z30.b, p7, [x0]
ret
The initial vector mode for an SLP region is "autodetected" by calling
aarch64_preferred_simd_mode, which prefers SVE modes if supported and
unless configured otherwise (e.g. VNx4SI for int). If at least one
profitable subgraph can be scheduled then GCC does not try to vectorise
the region using any other modes, even though their estimated costs
might otherwise have been lower.
That is mitigated by the fact that a sequence of GIMPLE stmts such as:
vectp.14_86 = x_50(D) + 16;
slp_mask_87 = .WHILE_ULT (0, 8, { 0, ... });
.MASK_STORE (vectp.14_86, 8B, slp_mask_87, vect__34.12_85);
are lowered to a fixed-length vector store (e.g., str d30, [x0, 16]) if
possible, instead of a more literal interpretation such as:
add x0, x0, 16
ptrue p7.b, vl7
st1b z30.b, p7, [x0]
Bootstrapped and tested on aarch64-linux-gnu.
Based on commit a28bb06b3e20a26579e06dc1b5bd6344ce4f88f0.
Changes in v2:
- Updated regexes used by the vect-over-widen-*.c tests so that they
do not accidentally match text that is now part of the dump but
was not previously.
- Updated regexes used by the aarch64/popcnt-sve.c test so that it
expects 'all' as the operand of 'ptrue', instead of some specific
number of active elements.
- Removed a dump_printf from vect_get_num_copies because the
gcc.dg/vect/vect-shift-5.c test relies on a fixed dump layout
spanning multiple lines.
- Fixed a bug in vect_get_vector_types_for_stmt, which had
accidentally been modified to unconditionally set group_size to
zero (even for basic block vectorization).
- Relaxed an overzealous new check in
vect_maybe_update_slp_op_vectype, which now considers the
vectorization factor when checking the number of lanes of external
definitions during loop vectorization. e.g., lanes=11 is not
divisible by subparts=8 with vf=1 but vf*lanes=88 is divisible by
subparts=8 with vf=11.
- Removed the stmts vector ownership changes relating to mishandling
of failure of the vect_analyze_slp_instance function (to be fixed
separately).
- A check in get_vectype_for_scalar_type for whether the natural
choice of vector type satisfies the group size was too simplistic.
Instead of choosing a narrower vector type if the natural vector
type could be long enough but not definitely (variable length, by
proxy), get_len_load_store_mode is now used to explicitly query
whether the target supports mask- or length-limited loads and
stores. With the previous version, GCC preferred the natural vector
type if it was known to be long enough; sometimes that resulted in
better output than a narrower type, but it also caused some
bb-slp-* tests to fail.
- Shuffled dump_printfs around into separate patches.
- An assertion is now used to protect my change to use lower bound
of number of subparts in gimple_build_vector.
Changes in v3:
- Wrote changelog entries.
- Created the gimple_build_vector_with_zero_padding function and
used it place of the gimple_build_vector function.
- Reverted my change to use constant_lower_bound of subparts in the
gimple_build_vector function.
- Fixed a check for 'partial vectors required' in vect_analyze_stmt
to include cases in which the minimum bound of the length of a
variable-length vector type equals the number of active lanes in an
SLP tree node. (Maybe less than, instead of known to be less than.)
- Reverted my change to regexes used by the aarch64/popcnt-sve.c test
because it is not needed after the above fix.
- Replaced SLP_TREE_CAN_USE_MASK_P and SLP_TREE_CAN_USE_LEN_P with
SLP_TREE_PARTIAL_VECTORS_STYLE.
- Titivated the documentation of vect_record_nunits.
- Renamed local variable max_nunits to nunits in the
vect_analyze_slp_reduc_chain function.
- Added an assertion in vect_create_constant_vectors to verify a
claimed relationship between group size and number of subparts.
- Fixed a mistake in the description of vect_slp_get_bb_len.
- Clarified the descriptions of vect_record_mask and vect_record_len
(from 'would be required' to 'could be used ... if required').
- Renamed the vect_can_use_mask_p function as vect_fully_masked_p
and vect_can_use_len_p as vect_fully_with_length_p.
- Tightened assertions in vect_record_mask and vect_record_len:
instead of just 'the other function must not have been called', we
now assert that no partial vectors style has already been set.
- Updated the documentation of check_load_store_for_partial_vectors
to describe its role in BB SLP vectorization.
- Renamed masked_loop_p and len_loop_p as masked_p and len_p in
contexts where they may now be used for BB SLP vectorization.
- Guard against a potential null pointer dereference in
LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS when used by the
vectorizable_operation and vectorizable_call functions for BB SLP
vectorization (instead of assuming !vect_fully_with_length_p).
- Guard against calling dump_printf_loc with null instead of
a vector type in vectorizable_comparison_1.
- Clear any partial vectors style that might have been set by callees
of vect_analyze_stmt if it finds that partial vectors aren't needed.
- Revert a change to make the vect_is_simple_use function more robust
when the requested SLP tree child node is not an internal def and
has no scalar operand to return.
- Revert a change to reserve slots for the definitions to be pushed
for narrow FLOAT_EXPR vectorization before using quick_push in the
vectorizable_conversion function.
Changes in v4:
- Resolved code generation regressions in
gcc.target/aarch64/sve/slp_6.c and
gcc.target/aarch64/sve/slp_7_costly.c by adding code to
handle variable-length vector types in store_constructor.
(Still had to update expectations for slp_6.c though.)
- Removed a default constructor from the definition of
struct slp_tree_nunits in order to fix a compilation error in
the vect_update_nunits function.
- Renamed the gimple_build_vector_with_zero_padding function as
gimple_build_vector_from_elems.
- Changed gimple_build_vector_from_elems to take a vector type
and list of element values instead of a tree_vector_builder.
- Rewrote the description of gimple_build_vector_from_elems.
- Removed redundant masking from gimple_build_vector_from_elems. It's
now almost equivalent to gimple_build_vector in use cases where
some element value is not constant.
- Assert that the constant lower bound of the number of subparts
in a vector type is a power of 2 instead of merely a multiple of 2
in vect_build_slp_tree_1.
Changes in v5:
- Removed the patch to track the minimum and maximum number of lanes
for BB SLP.
- Added a new function, vect_slp_tree_min_nunits, which is invoked by
vect_analyze_slp_instance to compute the minimum number of subparts
for all of the vector types used in an SLP tree for which an SLP
instance is about to be created.
Changes in v6:
- Moved the BB SLP implementations of vect_record_mask and
vect_record_len into tree-vect-slp.cc.
- Relaxed (failing) assertions: the style of an SLP node is not always
vect_partial_vectors_none on entry to vect_record_mask and
vect_record_len. Allow the same style to be set multiple times.
- Updated an existing pattern named 'Simplify vector extracts' to
guard against invalid invocations of tree_to_uhwi when a
BIT_FIELD_REF has an unsuitable polynomial offset or size.
- Rebased on 43afcb3a83c3648141285d80cd3d8a562047fb43.
Changes in v7:
- Fixed the vectorizable_live_operation function so that it no
longer generates invalid offsets such as BIT_FIELD_REF
<_251, 32, POLY_INT_CST [96, 128]> for BB SLP with a
variable-length vector type.
- Revert changes to the existing pattern named 'Simplify vector
extracts' because those changes to make the pattern more robust
are now redundant.
- Rebased on 630c1bfbb5bc3ba9fafca5e97096263ab8f0a04b.
Changes in v8:
- Fixed a regression in vectorizable_simd_clone_call that was
caused by an error when rebasing. (Silent conflict with
commit cea34ac07e3bd7.)
Changes in v9:
- Optimised vec_init for partial SVE vector modes and added a
test to guard against any return to the pathological stack usage
previously observed.
- Removed -march=armv9-a+sve from the dg-options of new tests
because it was redundant.
- More arguments are now passed through to vect_slp_record_bb_mask
and vect_slp_record_bb_len (from vect_slp_record_mask or
vect_slp_record_len) in anticipation of a future implementation
that shares masks, at least within an SLP sub-tree.
- A new per-SLP-node variable, SLP_TREE_NUM_PARTIAL_VECTORS, is
now used to calculate a worst-case estimate of the cost of
creating masks or lengths for partial vector tails. This aligns
BB SLP with existing costing of vectorised loops.
- Replaced early returns from vect_slp_get_bb_mask and
vect_slp_get_bb_len (when more than one vector, the requested vector
is not the last, or not partial) with code to construct an
appropriate mask or length. The value of NULL_TREE previously
returned in such edge is not a valid mask, so this change makes the
code more robust. vect_slp_record_bb_mask has been seen to be called
with nvectors=2 when running the asrdiv_2 test (msve-vector-bits=128,
group_size=8, nunits=4). The vector type was 128 bit: vector(4) int.
- Added a defensive check for BB SLP as part of the refactoring of
check_load_store_for_partial_vectors that is done by the first patch,
although we don't expect that function to be called for BB SLP until
vect_can_use_partial_vectors_p has been updated to sometimes return
true for BB SLP (which is done by the 'Extend BB SLP vectorization
to use predicated tails' patch).
- Rebased on commit a28bb06b3e20a26579e06dc1b5bd6344ce4f88f0.
Link to v1:
https://inbox.sourceware.org/gcc-patches/[email protected]/
Link to v2:
https://inbox.sourceware.org/gcc-patches/[email protected]/
Link to v3:
https://inbox.sourceware.org/gcc-patches/[email protected]/
Link to v4:
https://inbox.sourceware.org/gcc-patches/[email protected]/
Link to v5:
https://inbox.sourceware.org/gcc-patches/[email protected]/
Link to v6:
https://inbox.sourceware.org/gcc-patches/[email protected]/
Link to v7:
https://inbox.sourceware.org/gcc-patches/[email protected]/
Link to v8:
https://inbox.sourceware.org/gcc-patches/[email protected]/
Christopher Bazley (11):
Preparation to support predicated vector tails for BB SLP
Implement recording/getting of mask/length for BB SLP
Update constant creation for BB SLP with predicated tails
Refactor check_load_store_for_partial_vectors
New parameter for vect_maybe_update_slp_op_vectype
Handle variable-length vector types in store_constructor
AArch64/SVE: Relax the expectations of the popcnt-sve test
AArch64/SVE: Optimize vec_init for partial SVE vector modes
Extend BB SLP vectorization to use predicated tails
AArch64/SVE: Tests for use of predicated vector tails for BB SLP
Add extra conditional dump output to the vectorizer
gcc/config/aarch64/aarch64-sve.md | 16 +-
gcc/expr.cc | 7 +-
gcc/gimple-fold.cc | 54 ++
gcc/gimple-fold.h | 14 +
.../gcc.dg/vect/vect-over-widen-10.c | 2 +-
.../gcc.dg/vect/vect-over-widen-13.c | 2 +-
.../gcc.dg/vect/vect-over-widen-14.c | 2 +-
.../gcc.dg/vect/vect-over-widen-17.c | 2 +-
.../gcc.dg/vect/vect-over-widen-18.c | 2 +-
gcc/testsuite/gcc.dg/vect/vect-over-widen-5.c | 2 +-
gcc/testsuite/gcc.dg/vect/vect-over-widen-6.c | 2 +-
gcc/testsuite/gcc.dg/vect/vect-over-widen-7.c | 2 +-
gcc/testsuite/gcc.dg/vect/vect-over-widen-8.c | 2 +-
gcc/testsuite/gcc.dg/vect/vect-over-widen-9.c | 2 +-
gcc/testsuite/gcc.target/aarch64/popcnt-sve.c | 10 +-
gcc/testsuite/gcc.target/aarch64/sve/slp_6.c | 3 -
.../gcc.target/aarch64/sve/slp_pred_1.c | 33 +
.../gcc.target/aarch64/sve/slp_pred_1_run.c | 6 +
.../gcc.target/aarch64/sve/slp_pred_2.c | 33 +
.../gcc.target/aarch64/sve/slp_pred_3.c | 33 +
.../gcc.target/aarch64/sve/slp_pred_3_run.c | 6 +
.../gcc.target/aarch64/sve/slp_pred_4.c | 33 +
.../gcc.target/aarch64/sve/slp_pred_5.c | 36 +
.../gcc.target/aarch64/sve/slp_pred_6.c | 39 +
.../gcc.target/aarch64/sve/slp_pred_6_run.c | 6 +
.../gcc.target/aarch64/sve/slp_pred_7.c | 38 +
.../gcc.target/aarch64/sve/slp_pred_harness.h | 28 +
.../gcc.target/aarch64/sve/slp_stack.c | 27 +
gcc/tree-vect-loop.cc | 44 +-
gcc/tree-vect-slp.cc | 372 +++++++-
gcc/tree-vect-stmts.cc | 804 +++++++++++-------
gcc/tree-vectorizer.h | 116 ++-
32 files changed, 1397 insertions(+), 381 deletions(-)
create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_1.c
create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_1_run.c
create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_2.c
create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_3.c
create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_3_run.c
create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_4.c
create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_5.c
create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_6.c
create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_6_run.c
create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_7.c
create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_harness.h
create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_stack.c
--
2.43.0