[PATCH v5 00/10] Extend BB SLP vectorization to use predicated tails

Christopher Bazley Tue, 02 Dec 2025 10:14:25 -0800

GCC already supports fully-predicated vectorisation for loops, both
using "traditional" loop vectorisation and loop-aware SLP
(superword-level parallelism). For example, GCC can vectorise:


void
foo (char *x)
{
  for (int i = 0; i < 6; i += 2)
    {
      x[i] += 1;
      x[i + 1] += 2;
    }
}

from which it generates the following assembly code (with -O2
-ftree-vectorize -march=armv9-a+sve -msve-vector-bits=scalable):

foo:
        ptrue   p7.b, vl6
        mov     w1, 513
        ld1b    z31.b, p7/z, [x0]
        mov     z30.h, w1
        add     z30.b, z31.b, z30.b
        st1b    z30.b, p7, [x0]
        ret

However, GCC cannot yet vectorise the unrolled form of the same
function, even though it is semantically equivalent:

void
foo (char *x)
{
  x[0] += 1;
  x[1] += 2;
  x[2] += 1;
  x[3] += 2;
  x[4] += 1;
  x[5] += 2;
}

These patches implement support for vectorising the unrolled form of
the above function by enabling use of a predicate mask or length
limit for basic block SLP. For example, it can now be vectorised to
the following assembly code (using the same options as above):

foo:
        ptrue   p7.b, vl6
        ptrue   p6.b, all
        ld1b    z31.b, p7/z, [x0]
        adrp    x1, .LC0
        add     x1, x1, :lo12:.LC0
        ld1rqb  z30.b, p6/z, [x1]
        add     z30.b, z31.b, z30.b
        st1b    z30.b, p7, [x0]
        ret

Predication is only used for groups whose size is not neatly divisible
into vectors of lengths that can be supported directly by the target.

Bootstrapped and tested on aarch64-linux-gnu.
Based on commit e97550a7d0e1a8b31a76b0877c0e90a0163da7ee.

OK for trunk?

Changes in v2:
 - Updated regexes used by the vect-over-widen-*.c tests so that they
   do not accidentally match text that is now part of the dump but
   was not previously.
 - Updated regexes used by the aarch64/popcnt-sve.c test so that it
   expects 'all' as the operand of 'ptrue', instead of some specific
   number of active elements.
 - Removed a dump_printf from vect_get_num_copies because the
   gcc.dg/vect/vect-shift-5.c test relies on a fixed dump layout
   spanning multiple lines.
 - Fixed a bug in vect_get_vector_types_for_stmt, which had
   accidentally been modified to unconditionally set group_size to
   zero (even for basic block vectorization).
 - Relaxed an overzealous new check in
   vect_maybe_update_slp_op_vectype, which now considers the
   vectorization factor when checking the number of lanes of external
   definitions during loop vectorization. e.g., lanes=11 is not
   divisible by subparts=8 with vf=1 but vf*lanes=88 is divisible by
   subparts=8 with vf=11.
 - Removed the stmts vector ownership changes relating to mishandling
   of failure of the vect_analyze_slp_instance function (to be fixed
   separately).
 - A check in get_vectype_for_scalar_type for whether the natural
   choice of vector type satisfies the group size was too simplistic.
   Instead of choosing a narrower vector type if the natural vector
   type could be long enough but not definitely (variable length, by
   proxy), get_len_load_store_mode is now used to explicitly query
   whether the target supports mask- or length-limited loads and
   stores. With the previous version, GCC preferred the natural vector
   type if it was known to be long enough; sometimes that resulted in
   better output than a narrower type, but it also caused some
   bb-slp-* tests to fail.
 - Shuffled dump_printfs around into separate patches.
 - An assertion is now used to protect my change to use lower bound
   of number of subparts in gimple_build_vector.

Changes in v3:
 - Wrote changelog entries.
 - Created the gimple_build_vector_with_zero_padding function and
   used it place of the gimple_build_vector function.
 - Reverted my change to use constant_lower_bound of subparts in the
   gimple_build_vector function.
 - Fixed a check for 'partial vectors required' in vect_analyze_stmt
   to include cases in which the minimum bound of the length of a
   variable-length vector type equals the number of active lanes in an
   SLP tree node. (Maybe less than, instead of known to be less than.)
 - Reverted my change to regexes used by the aarch64/popcnt-sve.c test
   because it is not needed after the above fix.
 - Replaced SLP_TREE_CAN_USE_MASK_P and SLP_TREE_CAN_USE_LEN_P with
   SLP_TREE_PARTIAL_VECTORS_STYLE.
 - Titivated the documentation of vect_record_nunits.
 - Renamed local variable max_nunits to nunits in the
   vect_analyze_slp_reduc_chain function.
 - Added an assertion in vect_create_constant_vectors to verify a
   claimed relationship between group size and number of subparts.
 - Fixed a mistake in the description of vect_slp_get_bb_len.
 - Clarified the descriptions of vect_record_mask and vect_record_len
   (from 'would be required' to 'could be used ... if required').
 - Renamed the vect_can_use_mask_p function as vect_fully_masked_p
   and vect_can_use_len_p as vect_fully_with_length_p.
 - Tightened assertions in vect_record_mask and vect_record_len:
   instead of just 'the other function must not have been called', we
   now assert that no partial vectors style has already been set.
 - Updated the documentation of check_load_store_for_partial_vectors
   to describe its role in BB SLP vectorization.
 - Renamed masked_loop_p and len_loop_p as masked_p and len_p in
   contexts where they may now be used for BB SLP vectorization.
 - Guard against a potential null pointer dereference in
   LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS when used by the
   vectorizable_operation and vectorizable_call functions for BB SLP
   vectorization (instead of assuming !vect_fully_with_length_p).
 - Guard against calling dump_printf_loc with null instead of
   a vector type in vectorizable_comparison_1.
 - Clear any partial vectors style that might have been set by callees
   of vect_analyze_stmt if it finds that partial vectors aren't needed.
 - Revert a change to make the vect_is_simple_use function more robust
   when the requested SLP tree child node is not an internal def and
   has no scalar operand to return.
 - Revert a change to reserve slots for the definitions to be pushed
   for narrow FLOAT_EXPR vectorization before using quick_push in the
   vectorizable_conversion function.

Changes in v4:
 - Resolved code generation regressions in
   gcc.target/aarch64/sve/slp_6.c and
   gcc.target/aarch64/sve/slp_7_costly.c by adding code to
   handle variable-length vector types in store_constructor.
   (Still had to update expectations for slp_6.c though.)
 - Removed a default constructor from the definition of
   struct slp_tree_nunits in order to fix a compilation error in
   the vect_update_nunits function.
 - Renamed the gimple_build_vector_with_zero_padding function as
   gimple_build_vector_from_elems.
 - Changed gimple_build_vector_from_elems to take a vector type
   and list of element values instead of a tree_vector_builder.
 - Rewrote the description of gimple_build_vector_from_elems.
 - Removed redundant masking from gimple_build_vector_from_elems. It's
   now almost equivalent to gimple_build_vector in use cases where
   some element value is not constant.
 - Assert that the constant lower bound of the number of subparts
   in a vector type is a power of 2 instead of merely a multiple of 2
   in vect_build_slp_tree_1.

Changes in v5:
 - Removed the patch to track the minimum and maximum number of lanes for
   BB SLP.
 - Added a new function, vect_slp_tree_min_nunits, which is invoked by
   vect_analyze_slp_instance to compute the minimum number of subparts for
   all of the vector types used in an SLP tree for which an SLP instance is
   about to be created.

Link to v1:
https://inbox.sourceware.org/gcc-patches/[email protected]/
 
Link to v2:
https://inbox.sourceware.org/gcc-patches/[email protected]/
 
Link to v3:
https://inbox.sourceware.org/gcc-patches/[email protected]/
 
Link to v4:
https://inbox.sourceware.org/gcc-patches/[email protected]/

Christopher Bazley (10):
  Preparation to support predicated vector tails for BB SLP
  Implement recording/getting of mask/length for BB SLP
  Update constant creation for BB SLP with predicated tails
  Refactor check_load_store_for_partial_vectors
  New parameter for vect_maybe_update_slp_op_vectype
  Handle variable-length vector types in store_constructor
  AArch64/SVE: Relax the expectations of the popcnt-sve test
  Extend BB SLP vectorization to use predicated tails
  AArch64/SVE: Tests for use of predicated vector tails for BB SLP
  Add extra conditional dump output to the vectorizer

 gcc/expr.cc                                   |   7 +-
 gcc/gimple-fold.cc                            |  54 ++
 gcc/gimple-fold.h                             |  14 +
 .../gcc.dg/vect/vect-over-widen-10.c          |   2 +-
 .../gcc.dg/vect/vect-over-widen-13.c          |   2 +-
 .../gcc.dg/vect/vect-over-widen-14.c          |   2 +-
 .../gcc.dg/vect/vect-over-widen-17.c          |   2 +-
 .../gcc.dg/vect/vect-over-widen-18.c          |   2 +-
 gcc/testsuite/gcc.dg/vect/vect-over-widen-5.c |   2 +-
 gcc/testsuite/gcc.dg/vect/vect-over-widen-6.c |   2 +-
 gcc/testsuite/gcc.dg/vect/vect-over-widen-7.c |   2 +-
 gcc/testsuite/gcc.dg/vect/vect-over-widen-8.c |   2 +-
 gcc/testsuite/gcc.dg/vect/vect-over-widen-9.c |   2 +-
 gcc/testsuite/gcc.target/aarch64/popcnt-sve.c |  10 +-
 gcc/testsuite/gcc.target/aarch64/sve/slp_6.c  |   3 -
 .../gcc.target/aarch64/sve/slp_pred_1.c       |  33 +
 .../gcc.target/aarch64/sve/slp_pred_1_run.c   |   6 +
 .../gcc.target/aarch64/sve/slp_pred_2.c       |  33 +
 .../gcc.target/aarch64/sve/slp_pred_3.c       |  33 +
 .../gcc.target/aarch64/sve/slp_pred_3_run.c   |   6 +
 .../gcc.target/aarch64/sve/slp_pred_4.c       |  33 +
 .../gcc.target/aarch64/sve/slp_pred_5.c       |  36 +
 .../gcc.target/aarch64/sve/slp_pred_6.c       |  39 +
 .../gcc.target/aarch64/sve/slp_pred_6_run.c   |   6 +
 .../gcc.target/aarch64/sve/slp_pred_7.c       |  38 +
 .../gcc.target/aarch64/sve/slp_pred_harness.h |  28 +
 gcc/tree-vect-loop.cc                         |  30 +-
 gcc/tree-vect-slp.cc                          | 278 +++++-
 gcc/tree-vect-stmts.cc                        | 795 +++++++++++-------
 gcc/tree-vectorizer.h                         | 107 ++-
 30 files changed, 1251 insertions(+), 358 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_1.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_1_run.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_2.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_3.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_3_run.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_4.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_5.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_6.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_6_run.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_7.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_harness.h

-- 
2.43.0

[PATCH v5 00/10] Extend BB SLP vectorization to use predicated tails

Reply via email to