GCC already supports fully-predicated vectorisation for loops, both
using "traditional" loop vectorisation and loop-aware SLP
(superword-level parallelism). For example, GCC can vectorise:

void
foo (char *x)
{
  for (int i = 0; i < 6; i += 2)
    {
      x[i] += 1;
      x[i + 1] += 2;
    }
}

from which it generates the following assembly code (with -O2
-ftree-vectorize -march=armv9-a+sve -msve-vector-bits=scalable):

foo:
        ptrue   p7.b, vl6
        mov     w1, 513
        ld1b    z31.b, p7/z, [x0]
        mov     z30.h, w1
        add     z30.b, z31.b, z30.b
        st1b    z30.b, p7, [x0]
        ret

However, GCC cannot yet vectorise the unrolled form of the same
function, even though it is semantically equivalent:

void
foo (char *x)
{
  x[0] += 1;
  x[1] += 2;
  x[2] += 1;
  x[3] += 2;
  x[4] += 1;
  x[5] += 2;
}

These patches implement support for vectorising the unrolled form of
the above function by enabling use of a predicate mask or length
limit for basic block SLP. For example, it can now be vectorised to
the following assembly code (using the same options as above):

foo:
        ptrue   p7.b, vl6
        ptrue   p6.b, all
        ld1b    z31.b, p7/z, [x0]
        adrp    x1, .LC0
        add     x1, x1, :lo12:.LC0
        ld1rqb  z30.b, p6/z, [x1]
        add     z30.b, z31.b, z30.b
        st1b    z30.b, p7, [x0]
        ret

Predication is only used for groups whose size is not neatly divisible
into vectors of lengths that can be supported directly by the target.

Bootstrapped and tested on aarch64-linux-gnu.

A list of test regressions that need resolving is as follows:

gcc.target/aarch64/sve/slp_6.c
gcc.target/aarch64/sve/slp_7_costly.c

This patch series changes the compiled output of the above two tests,
causing them to fail because the reduction and epilogue handling now
use SVE masked load/store. Unfortunately, the overall effect on
code generation is observably quite bad. This will be addressed in
a future version.

Changes in v2:
 - Updated regexes used by the vect-over-widen-*.c tests so that they
   do not accidentally match text that is now part of the dump but
   was not previously.
 - Updated regexes used by the aarch64/popcnt-sve.c test so that it
   expects 'all' as the operand of 'ptrue', instead of some specific
   number of active elements.
 - Removed a dump_printf from vect_get_num_copies because the
   gcc.dg/vect/vect-shift-5.c test relies on a fixed dump layout
   spanning multiple lines.
 - Fixed a bug in vect_get_vector_types_for_stmt, which had
   accidentally been modified to unconditionally set group_size to
   zero (even for basic block vectorization).
 - Relaxed an overzealous new check in
   vect_maybe_update_slp_op_vectype, which now considers the
   vectorization factor when checking the number of lanes of external
   definitions during loop vectorization. e.g., lanes=11 is not
   divisible by subparts=8 with vf=1 but vf*lanes=88 is divisible by
   subparts=8 with vf=11.
 - Removed the stmts vector ownership changes relating to mishandling
   of failure of the vect_analyze_slp_instance function (to be fixed
   separately).
 - A check in get_vectype_for_scalar_type for whether the natural
   choice of vector type satisfies the group size was too simplistic.
   Instead of choosing a narrower vector type if the natural vector
   type could be long enough but not definitely (variable length, by
   proxy), get_len_load_store_mode is now used to explicitly query
   whether the target supports mask- or length-limited loads and
   stores. With the previous version, GCC preferred the natural vector
   type if it was known to be long enough; sometimes that resulted in
   better output than a narrower type, but it also caused some
   bb-slp-* tests to fail.
 - Shuffled dump_printfs around into separate patches.
 - An assertion is now used to protect my change to use lower bound
   of number of subparts in gimple_build_vector.

Link to v1:
https://inbox.sourceware.org/gcc-patches/[email protected]/

Christopher Bazley (9):
  Track the minimum and maximum number of lanes for BB SLP
  Preparation to support predicated vector tails for BB SLP
  Implement recording/getting of mask/length for BB SLP
  Update constant creation for BB SLP with predicated tails
  Refactor check_load_store_for_partial_vectors
  New parameter for vect_maybe_update_slp_op_vectype
  Extend BB SLP vectorization to use predicated tails
  AArch64/SVE: Tests for use of predicated vector tails for BB SLP
  Add extra conditional dump output to the vectorizer

 gcc/gimple-fold.cc                            |   6 +-
 .../gcc.dg/vect/vect-over-widen-10.c          |   2 +-
 .../gcc.dg/vect/vect-over-widen-13.c          |   2 +-
 .../gcc.dg/vect/vect-over-widen-14.c          |   2 +-
 .../gcc.dg/vect/vect-over-widen-17.c          |   2 +-
 .../gcc.dg/vect/vect-over-widen-18.c          |   2 +-
 gcc/testsuite/gcc.dg/vect/vect-over-widen-5.c |   2 +-
 gcc/testsuite/gcc.dg/vect/vect-over-widen-6.c |   2 +-
 gcc/testsuite/gcc.dg/vect/vect-over-widen-7.c |   2 +-
 gcc/testsuite/gcc.dg/vect/vect-over-widen-8.c |   2 +-
 gcc/testsuite/gcc.dg/vect/vect-over-widen-9.c |   2 +-
 gcc/testsuite/gcc.target/aarch64/popcnt-sve.c |  10 +-
 .../gcc.target/aarch64/sve/slp_pred_1.c       |  33 +
 .../gcc.target/aarch64/sve/slp_pred_1_run.c   |   6 +
 .../gcc.target/aarch64/sve/slp_pred_2.c       |  33 +
 .../gcc.target/aarch64/sve/slp_pred_3.c       |  33 +
 .../gcc.target/aarch64/sve/slp_pred_3_run.c   |   6 +
 .../gcc.target/aarch64/sve/slp_pred_4.c       |  33 +
 .../gcc.target/aarch64/sve/slp_pred_5.c       |  36 +
 .../gcc.target/aarch64/sve/slp_pred_6.c       |  39 +
 .../gcc.target/aarch64/sve/slp_pred_6_run.c   |   6 +
 .../gcc.target/aarch64/sve/slp_pred_7.c       |  38 +
 .../gcc.target/aarch64/sve/slp_pred_harness.h |  28 +
 gcc/tree-vect-loop.cc                         |  28 +-
 gcc/tree-vect-slp.cc                          | 377 ++++++---
 gcc/tree-vect-stmts.cc                        | 741 +++++++++++-------
 gcc/tree-vectorizer.h                         | 142 +++-
 27 files changed, 1204 insertions(+), 411 deletions(-)
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_1.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_1_run.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_2.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_3.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_3_run.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_4.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_5.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_6.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_6_run.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_7.c
 create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_pred_harness.h

-- 
2.43.0

Reply via email to