GCC already supports fully-predicated vectorisation for loops, both
using "traditional" loop vectorisation and loop-aware SLP
(superword-level parallelism). For example, GCC can vectorise:

void
foo (char *x)
{
  for (int i = 0; i < 6; i += 2)
    {
      x[i] += 1;
      x[i + 1] += 2;
    }
}

from which it generates the following assembly code (with -O2
-ftree-vectorize -march=armv9-a+sve -msve-vector-bits=scalable):

foo:
    ptrue   p7.b, vl6
    mov     w1, 513
    ld1b    z31.b, p7/z, [x0]
    mov     z30.h, w1
    add     z30.b, z31.b, z30.b
    st1b    z30.b, p7, [x0]
    ret

However, GCC cannot yet vectorise the unrolled form of the same
function, even though it is semantically equivalent:

void
foo (char *x)
{
  x[0] += 1;
  x[1] += 2;
  x[2] += 1;
  x[3] += 2;
  x[4] += 1;
  x[5] += 2;
}

These patches implement support for vectorising the unrolled form of
the above function by enabling use of a predicate mask or length
limit for basic block SLP. For example, it can now be vectorised to
the following assembly code (using the same options as above):

foo:
    ptrue   p7.b, vl6
    ptrue   p6.b, all
    ld1b    z31.b, p7/z, [x0]
    adrp    x1, .LC0
    add     x1, x1, :lo12:.LC0
    ld1rqb  z30.b, p6/z, [x1]
    add     z30.b, z31.b, z30.b
    st1b    z30.b, p7, [x0]
    ret

The initial vector mode for an SLP region is "autodetected" by calling
aarch64_preferred_simd_mode, which prefers SVE modes if supported and
unless configured otherwise (e.g. VNx4SI for int). If at least one
profitable subgraph can be scheduled then GCC does not try to vectorise
the region using any other modes, even though their estimated costs
might otherwise have been lower.

That is mitigated by the fact that a sequence of GIMPLE stmts such as:

  vectp.14_86 = x_50(D) + 16;
  slp_mask_87 = .WHILE_ULT (0, 8, { 0, ... });
  .MASK_STORE (vectp.14_86, 8B, slp_mask_87, vect__34.12_85);

are lowered to a fixed-length vector store (e.g., str d30, [x0, 16]) if
possible, instead of a more literal interpretation such as:

    add     x0, x0, 16
    ptrue   p7.b, vl7
    st1b    z30.b, p7, [x0]

Bootstrapped and tested on aarch64-linux-gnu.
Based on commit a28bb06b3e20a26579e06dc1b5bd6344ce4f88f0.

Changes in v2:
 - Updated regexes used by the vect-over-widen-*.c tests so that they
   do not accidentally match text that is now part of the dump but
   was not previously.
 - Updated regexes used by the aarch64/popcnt-sve.c test so that it
   expects 'all' as the operand of 'ptrue', instead of some specific
   number of active elements.
 - Removed a dump_printf from vect_get_num_copies because the
   gcc.dg/vect/vect-shift-5.c test relies on a fixed dump layout
   spanning multiple lines.
 - Fixed a bug in vect_get_vector_types_for_stmt, which had
   accidentally been modified to unconditionally set group_size to
   zero (even for basic block vectorization).
 - Relaxed an overzealous new check in
   vect_maybe_update_slp_op_vectype, which now considers the
   vectorization factor when checking the number of lanes of external
   definitions during loop vectorization. e.g., lanes=11 is not
   divisible by subparts=8 with vf=1 but vf*lanes=88 is divisible by
   subparts=8 with vf=11.
 - Removed the stmts vector ownership changes relating to mishandling
   of failure of the vect_analyze_slp_instance function (to be fixed
   separately).
 - A check in get_vectype_for_scalar_type for whether the natural
   choice of vector type satisfies the group size was too simplistic.
   Instead of choosing a narrower vector type if the natural vector
   type could be long enough but not definitely (variable length, by
   proxy), get_len_load_store_mode is now used to explicitly query
   whether the target supports mask- or length-limited loads and
   stores. With the previous version, GCC preferred the natural vector
   type if it was known to be long enough; sometimes that resulted in
   better output than a narrower type, but it also caused some
   bb-slp-* tests to fail.
 - Shuffled dump_printfs around into separate patches.
 - An assertion is now used to protect my change to use lower bound
   of number of subparts in gimple_build_vector.

Changes in v3:
 - Wrote changelog entries.
 - Created the gimple_build_vector_with_zero_padding function and
   used it place of the gimple_build_vector function.
 - Reverted my change to use constant_lower_bound of subparts in the
   gimple_build_vector function.
 - Fixed a check for 'partial vectors required' in vect_analyze_stmt
   to include cases in which the minimum bound of the length of a
   variable-length vector type equals the number of active lanes in an
   SLP tree node. (Maybe less than, instead of known to be less than.)
 - Reverted my change to regexes used by the aarch64/popcnt-sve.c test
   because it is not needed after the above fix.
 - Replaced SLP_TREE_CAN_USE_MASK_P and SLP_TREE_CAN_USE_LEN_P with
   SLP_TREE_PARTIAL_VECTORS_STYLE.
 - Titivated the documentation of vect_record_nunits.
 - Renamed local variable max_nunits to nunits in the
   vect_analyze_slp_reduc_chain function.
 - Added an assertion in vect_create_constant_vectors to verify a
   claimed relationship between group size and number of subparts.
 - Fixed a mistake in the description of vect_slp_get_bb_len.
 - Clarified the descriptions of vect_record_mask and vect_record_len
   (from 'would be required' to 'could be used ... if required').
 - Renamed the vect_can_use_mask_p function as vect_fully_masked_p
   and vect_can_use_len_p as vect_fully_with_length_p.
 - Tightened assertions in vect_record_mask and vect_record_len:
   instead of just 'the other function must not have been called', we
   now assert that no partial vectors style has already been set.
 - Updated the documentation of check_load_store_for_partial_vectors
   to describe its role in BB SLP vectorization.
 - Renamed masked_loop_p and len_loop_p as masked_p and len_p in
   contexts where they may now be used for BB SLP vectorization.
 - Guard against a potential null pointer dereference in
   LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS when used by the
   vectorizable_operation and vectorizable_call functions for BB SLP
   vectorization (instead of assuming !vect_fully_with_length_p).
 - Guard against calling dump_printf_loc with null instead of
   a vector type in vectorizable_comparison_1.
 - Clear any partial vectors style that might have been set by callees
   of vect_analyze_stmt if it finds that partial vectors aren't needed.
 - Revert a change to make the vect_is_simple_use function more robust
   when the requested SLP tree child node is not an internal def and
   has no scalar operand to return.
 - Revert a change to reserve slots for the definitions to be pushed
   for narrow FLOAT_EXPR vectorization before using quick_push in the
   vectorizable_conversion function.

Changes in v4:
 - Resolved code generation regressions in
   gcc.target/aarch64/sve/slp_6.c and
   gcc.target/aarch64/sve/slp_7_costly.c by adding code to
   handle variable-length vector types in store_constructor.
   (Still had to update expectations for slp_6.c though.)
 - Removed a default constructor from the definition of
   struct slp_tree_nunits in order to fix a compilation error in
   the vect_update_nunits function.
 - Renamed the gimple_build_vector_with_zero_padding function as
   gimple_build_vector_from_elems.
 - Changed gimple_build_vector_from_elems to take a vector type
   and list of element values instead of a tree_vector_builder.
 - Rewrote the description of gimple_build_vector_from_elems.
 - Removed redundant masking from gimple_build_vector_from_elems. It's
   now almost equivalent to gimple_build_vector in use cases where
   some element value is not constant.
 - Assert that the constant lower bound of the number of subparts
   in a vector type is a power of 2 instead of merely a multiple of 2
   in vect_build_slp_tree_1.

Changes in v5:
 - Removed the patch to track the minimum and maximum number of lanes
   for BB SLP.
 - Added a new function, vect_slp_tree_min_nunits, which is invoked by
   vect_analyze_slp_instance to compute the minimum number of subparts
   for all of the vector types used in an SLP tree for which an SLP
   instance is about to be created.

Changes in v6:
 - Moved the BB SLP implementations of vect_record_mask and
   vect_record_len into tree-vect-slp.cc.
 - Relaxed (failing) assertions: the style of an SLP node is not always
   vect_partial_vectors_none on entry to vect_record_mask and
   vect_record_len. Allow the same style to be set multiple times.
 - Updated an existing pattern named 'Simplify vector extracts' to
   guard against invalid invocations of tree_to_uhwi when a
   BIT_FIELD_REF has an unsuitable polynomial offset or size.
 - Rebased on 43afcb3a83c3648141285d80cd3d8a562047fb43.

Changes in v7:
 - Fixed the vectorizable_live_operation function so that it no
   longer generates invalid offsets such as BIT_FIELD_REF
   <_251, 32, POLY_INT_CST [96, 128]> for BB SLP with a
   variable-length vector type.
 - Revert changes to the existing pattern named 'Simplify vector
   extracts' because those changes to make the pattern more robust
   are now redundant.
 - Rebased on 630c1bfbb5bc3ba9fafca5e97096263ab8f0a04b.

Changes in v8:
 - Fixed a regression in vectorizable_simd_clone_call that was
   caused by an error when rebasing. (Silent conflict with
   commit cea34ac07e3bd7.)

Changes in v9:
 - Optimised vec_init for partial SVE vector modes and added a
   test to guard against any return to the pathological stack usage
   previously observed.
 - Removed -march=armv9-a+sve from the dg-options of new tests
   because it was redundant.
 - More arguments are now passed through to vect_slp_record_bb_mask
   and vect_slp_record_bb_len (from vect_slp_record_mask or
   vect_slp_record_len) in anticipation of a future implementation
   that shares masks, at least within an SLP sub-tree.
 - A new per-SLP-node variable, SLP_TREE_NUM_PARTIAL_VECTORS, is
   now used to calculate a worst-case estimate of the cost of
   creating masks or lengths for partial vector tails. This aligns
   BB SLP with existing costing of vectorised loops.
 - Replaced early returns from vect_slp_get_bb_mask and
   vect_slp_get_bb_len (when more than one vector, the requested vector
   is not the last, or not partial) with code to construct an
   appropriate mask or length. The value of NULL_TREE previously
   returned in such edge is not a valid mask, so this change makes the
   code more robust. vect_slp_record_bb_mask has been seen to be called
   with nvectors=2 when running the asrdiv_2 test (msve-vector-bits=128,
   group_size=8, nunits=4). The vector type was 128 bit: vector(4) int.
 - Added a defensive check for BB SLP as part of the refactoring of
   check_load_store_for_partial_vectors that is done by the first patch,
   although we don't expect that function to be called for BB SLP until
   vect_can_use_partial_vectors_p has been updated to sometimes return
   true for BB SLP (which is done by the 'Extend BB SLP vectorization
   to use predicated tails' patch).
 - Rebased on commit a28bb06b3e20a26579e06dc1b5bd6344ce4f88f0.

Link to v1:
https://inbox.sourceware.org/gcc-patches/[email protected]/

Link to v2:
https://inbox.sourceware.org/gcc-patches/[email protected]/

Link to v3:
https://inbox.sourceware.org/gcc-patches/[email protected]/

Link to v4:
https://inbox.sourceware.org/gcc-patches/[email protected]/

Link to v5:
https://inbox.sourceware.org/gcc-patches/[email protected]/

Link to v6:
https://inbox.sourceware.org/gcc-patches/[email protected]/

Link to v7:
https://inbox.sourceware.org/gcc-patches/[email protected]/

Link to v8:
https://inbox.sourceware.org/gcc-patches/[email protected]/

Christopher Bazley (11):
  Preparation to support predicated vector tails for BB SLP
  Implement recording/getting of mask/length for BB SLP
  Update constant creation for BB SLP with predicated tails
  Refactor check_load_store_for_partial_vectors
  New parameter for vect_maybe_update_slp_op_vectype
  Handle variable-length vector types in store_constructor
  AArch64/SVE: Relax the expectations of the popcnt-sve test
  AArch64/SVE: Optimize vec_init for partial SVE vector modes
  Extend BB SLP vectorization to use predicated tails
  AArch64/SVE: Tests for use of predicated vector tails for BB SLP
  Add extra conditional dump output to the vectorizer

 gcc/config/aarch64/aarch64-sve.md             |  16 +-
 gcc/expr.cc                                   |   7 +-
 gcc/gimple-fold.cc                            |  54 ++
 gcc/gimple-fold.h                             |  14 +
 .../gcc.dg/vect/vect-over-widen-10.c          |   2 +-
 .../gcc.dg/vect/vect-over-widen-13.c          |   2 +-
 .../gcc.dg/vect/vect-over-widen-14.c          |   2 +-
 .../gcc.dg/vect/vect-over-widen-17.c          |   2 +-
 .../gcc.dg/vect/vect-over-widen-18.c          |   2 +-
 gcc/testsuite/gcc.dg/vect/vect-over-widen-5.c |   2 +-
 gcc/testsuite/gcc.dg/vect/vect-over-widen-6.c |   2 +-
 gcc/testsuite/gcc.dg/vect/vect-over-widen-7.c |   2 +-
 gcc/testsuite/gcc.dg/vect/vect-over-widen-8.c |   2 +-
 gcc/testsuite/gcc.dg/vect/vect-over-widen-9.c |   2 +-
 gcc/testsuite/gcc.target/aarch64/popcnt-sve.c |  10 +-
 gcc/testsuite/gcc.target/aarch64/sve/slp_6.c  |   3 -
 .../gcc.target/aarch64/sve/slp_pred_1.c       |  33 +
 .../gcc.target/aarch64/sve/slp_pred_1_run.c   |   6 +
 .../gcc.target/aarch64/sve/slp_pred_2.c       |  33 +
 .../gcc.target/aarch64/sve/slp_pred_3.c       |  33 +
 .../gcc.target/aarch64/sve/slp_pred_3_run.c   |   6 +
 .../gcc.target/aarch64/sve/slp_pred_4.c       |  33 +
 .../gcc.target/aarch64/sve/slp_pred_5.c       |  36 +
 .../gcc.target/aarch64/sve/slp_pred_6.c       |  39 +
 .../gcc.target/aarch64/sve/slp_pred_6_run.c   |   6 +
 .../gcc.target/aarch64/sve/slp_pred_7.c       |  38 +
 .../gcc.target/aarch64/sve/slp_pred_harness.h |  28 +
 .../gcc.target/aarch64/sve/slp_stack.c        |  27 +
 gcc/tree-vect-loop.cc                         |  44 +-
 gcc/tree-vect-slp.cc                          | 372 +++++++-
 gcc/tree-vect-stmts.cc                        | 804 +++++++++++-------
 gcc/tree-vectorizer.h                         | 116 ++-
 32 files changed, 1397 insertions(+), 381 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_1.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_1_run.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_2.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_3.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_3_run.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_4.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_5.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_6.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_6_run.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_7.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_harness.h
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_stack.c

-- 
2.43.0

Reply via email to