[RFC v3 0/9] Extend BB SLP vectorization to use predicated tails

Christopher Bazley Mon, 24 Nov 2025 10:52:06 -0800

GCC already supports fully-predicated vectorisation for loops, both
using "traditional" loop vectorisation and loop-aware SLP
(superword-level parallelism). For example, GCC can vectorise:


void
foo (char *x)
{
  for (int i = 0; i < 6; i += 2)
    {
      x[i] += 1;
      x[i + 1] += 2;
    }
}

from which it generates the following assembly code (with -O2
-ftree-vectorize -march=armv9-a+sve -msve-vector-bits=scalable):

foo:
        ptrue   p7.b, vl6
        mov     w1, 513
        ld1b    z31.b, p7/z, [x0]
        mov     z30.h, w1
        add     z30.b, z31.b, z30.b
        st1b    z30.b, p7, [x0]
        ret

However, GCC cannot yet vectorise the unrolled form of the same
function, even though it is semantically equivalent:

void
foo (char *x)
{
  x[0] += 1;
  x[1] += 2;
  x[2] += 1;
  x[3] += 2;
  x[4] += 1;
  x[5] += 2;
}

These patches implement support for vectorising the unrolled form of
the above function by enabling use of a predicate mask or length
limit for basic block SLP. For example, it can now be vectorised to
the following assembly code (using the same options as above):

foo:
        ptrue   p7.b, vl6
        ptrue   p6.b, all
        ld1b    z31.b, p7/z, [x0]
        adrp    x1, .LC0
        add     x1, x1, :lo12:.LC0
        ld1rqb  z30.b, p6/z, [x1]
        add     z30.b, z31.b, z30.b
        st1b    z30.b, p7, [x0]
        ret

Predication is only used for groups whose size is not neatly divisible
into vectors of lengths that can be supported directly by the target.

Bootstrapped and tested on aarch64-linux-gnu.

A list of test regressions that need resolving is as follows:

gcc.target/aarch64/sve/slp_6.c
gcc.target/aarch64/sve/slp_7_costly.c

This patch series changes the compiled output of the above two tests,
causing them to fail because the reduction and epilogue handling now
use SVE masked load/store. Unfortunately, the overall effect on
code generation is observably quite bad. This will be addressed in
a future version.

Changes in v2:
 - Updated regexes used by the vect-over-widen-*.c tests so that they
   do not accidentally match text that is now part of the dump but
   was not previously.
 - Updated regexes used by the aarch64/popcnt-sve.c test so that it
   expects 'all' as the operand of 'ptrue', instead of some specific
   number of active elements.
 - Removed a dump_printf from vect_get_num_copies because the
   gcc.dg/vect/vect-shift-5.c test relies on a fixed dump layout
   spanning multiple lines.
 - Fixed a bug in vect_get_vector_types_for_stmt, which had
   accidentally been modified to unconditionally set group_size to
   zero (even for basic block vectorization).
 - Relaxed an overzealous new check in
   vect_maybe_update_slp_op_vectype, which now considers the
   vectorization factor when checking the number of lanes of external
   definitions during loop vectorization. e.g., lanes=11 is not
   divisible by subparts=8 with vf=1 but vf*lanes=88 is divisible by
   subparts=8 with vf=11.
 - Removed the stmts vector ownership changes relating to mishandling
   of failure of the vect_analyze_slp_instance function (to be fixed
   separately).
 - A check in get_vectype_for_scalar_type for whether the natural
   choice of vector type satisfies the group size was too simplistic.
   Instead of choosing a narrower vector type if the natural vector
   type could be long enough but not definitely (variable length, by
   proxy), get_len_load_store_mode is now used to explicitly query
   whether the target supports mask- or length-limited loads and
   stores. With the previous version, GCC preferred the natural vector
   type if it was known to be long enough; sometimes that resulted in
   better output than a narrower type, but it also caused some
   bb-slp-* tests to fail.
 - Shuffled dump_printfs around into separate patches.
 - An assertion is now used to protect my change to use lower bound
   of number of subparts in gimple_build_vector.

Changes in v3:
 - Wrote changelog entries.
 - Created the gimple_build_vector_with_zero_padding function and
   used it place of the gimple_build_vector function.
 - Reverted my change to use constant_lower_bound of subparts in the
   gimple_build_vector function.
 - Fixed a check for 'partial vectors required' in vect_analyze_stmt
   to include cases in which the minimum bound of the length of a
   variable-length vector type equals the number of active lanes in an
   SLP tree node. (Maybe less than, instead of known to be less than.)
 - Reverted my change to regexes used by the aarch64/popcnt-sve.c test
   because it is not needed after the above fix.
 - Replaced SLP_TREE_CAN_USE_MASK_P and SLP_TREE_CAN_USE_LEN_P with
   SLP_TREE_PARTIAL_VECTORS_STYLE.
 - Titivated the documentation of vect_record_nunits.
 - Renamed local variable max_nunits to nunits in the
   vect_analyze_slp_reduc_chain function.
 - Added an assertion in vect_create_constant_vectors to verify a
   claimed relationship between group size and number of subparts.
 - Fixed a mistake in the description of vect_slp_get_bb_len.
 - Clarified the descriptions of vect_record_mask and vect_record_len
   (from 'would be required' to 'could be used ... if required').
 - Renamed the vect_can_use_mask_p function as vect_fully_masked_p
   and vect_can_use_len_p as vect_fully_with_length_p.
 - Tightened assertions in vect_record_mask and vect_record_len:
   instead of just 'the other function must not have been called', we
   now assert that no partial vectors style has already been set.
 - Updated the documentation of check_load_store_for_partial_vectors
   to describe its role in BB SLP vectorization.
 - Renamed masked_loop_p and len_loop_p as masked_p and len_p in
   contexts where they may now be used for BB SLP vectorization.
 - Guard against a potential null pointer dereference in
   LOOP_VINFO_PARTIAL_LOAD_STORE_BIAS when used by the
   vectorizable_operation and vectorizable_call functions for BB SLP
   vectorization (instead of assuming !vect_fully_with_length_p).
 - Guard against calling dump_printf_loc with null instead of
   a vector type in vectorizable_comparison_1.
 - Clear any partial vectors style that might have been set by callees
   of vect_analyze_stmt if it finds that partial vectors aren't needed.
 - Revert a change to make the vect_is_simple_use function more robust
   when the requested SLP tree child node is not an internal def and
   has no scalar operand to return.
 - Revert a change to reserve slots for the definitions to be pushed
   for narrow FLOAT_EXPR vectorization before using quick_push in the
   vectorizable_conversion function.

Link to v1:
https://inbox.sourceware.org/gcc-patches/[email protected]/
 
Link to v2:
https://inbox.sourceware.org/gcc-patches/[email protected]/


Christopher Bazley (9):
  Track the minimum and maximum number of lanes for BB SLP
  Preparation to support predicated vector tails for BB SLP
  Implement recording/getting of mask/length for BB SLP
  Update constant creation for BB SLP with predicated tails
  Refactor check_load_store_for_partial_vectors
  New parameter for vect_maybe_update_slp_op_vectype
  Extend BB SLP vectorization to use predicated tails
  AArch64/SVE: Tests for use of predicated vector tails for BB SLP
  Add extra conditional dump output to the vectorizer

 gcc/gimple-fold.cc                            |  86 ++
 gcc/gimple-fold.h                             |  15 +
 .../gcc.dg/vect/vect-over-widen-10.c          |   2 +-
 .../gcc.dg/vect/vect-over-widen-13.c          |   2 +-
 .../gcc.dg/vect/vect-over-widen-14.c          |   2 +-
 .../gcc.dg/vect/vect-over-widen-17.c          |   2 +-
 .../gcc.dg/vect/vect-over-widen-18.c          |   2 +-
 gcc/testsuite/gcc.dg/vect/vect-over-widen-5.c |   2 +-
 gcc/testsuite/gcc.dg/vect/vect-over-widen-6.c |   2 +-
 gcc/testsuite/gcc.dg/vect/vect-over-widen-7.c |   2 +-
 gcc/testsuite/gcc.dg/vect/vect-over-widen-8.c |   2 +-
 gcc/testsuite/gcc.dg/vect/vect-over-widen-9.c |   2 +-
 .../gcc.target/aarch64/sve/slp_pred_1.c       |  33 +
 .../gcc.target/aarch64/sve/slp_pred_1_run.c   |   6 +
 .../gcc.target/aarch64/sve/slp_pred_2.c       |  33 +
 .../gcc.target/aarch64/sve/slp_pred_3.c       |  33 +
 .../gcc.target/aarch64/sve/slp_pred_3_run.c   |   6 +
 .../gcc.target/aarch64/sve/slp_pred_4.c       |  33 +
 .../gcc.target/aarch64/sve/slp_pred_5.c       |  36 +
 .../gcc.target/aarch64/sve/slp_pred_6.c       |  39 +
 .../gcc.target/aarch64/sve/slp_pred_6_run.c   |   6 +
 .../gcc.target/aarch64/sve/slp_pred_7.c       |  38 +
 .../gcc.target/aarch64/sve/slp_pred_harness.h |  28 +
 gcc/tree-vect-loop.cc                         |  30 +-
 gcc/tree-vect-slp.cc                          | 388 ++++++---
 gcc/tree-vect-stmts.cc                        | 796 +++++++++++-------
 gcc/tree-vectorizer.h                         | 153 +++-
 27 files changed, 1349 insertions(+), 430 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_1.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_1_run.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_2.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_3.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_3_run.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_4.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_5.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_6.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_6_run.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_7.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_harness.h

-- 
2.43.0

[RFC v3 0/9] Extend BB SLP vectorization to use predicated tails

Reply via email to