Re: [PATCH v11 09/12] Extend BB SLP vectorization to use predicated tails

Richard Biener Wed, 24 Jun 2026 03:43:29 -0700

On Wed, Jun 3, 2026 at 5:21 PM Christopher Bazley <[email protected]> wrote:
>
> This enables use of a predicate mask or length limit for
> vectorization of basic blocks in cases where previously only the
> equivalent rolled (i.e. loop) form of some source code would have
> been vectorized. Predication is used for groups whose size
> is not neatly divisible into vectors of lengths that can be
> supported directly by the target.
>
> The initial vector mode for an SLP region is "autodetected" by calling
> aarch64_preferred_simd_mode, which prefers SVE modes if supported and
> unless configured otherwise (e.g. VNx4SI for int). If at least one
> profitable subgraph can be scheduled then GCC does not try to vectorise
> the region using any other modes, even though their estimated costs
> might otherwise have been lower.
>
> For example, if analysis of a 24-byte group succeeds with vector mode
> V16QI (using types vector(16) and vector(8) char) then the estimated
> cost of the vectorised code is 11+11=22. If analysis of the same group
> succeeds with vector mode VNx16QI (using type vector([16,16]) char for
> both subtrees) then the estimated cost is 15+15=30. In both cases, the
> estimated vectorised cost would beat the estimated scalar cost of
> 96+48=144, so vector([16,16]) wins because VNx16QI is tried first.
>
> This is mitigated by the fact that a sequence of GIMPLE stmts such as:
>
>   vectp.14_86 = x_50(D) + 16;
>   slp_mask_87 = .WHILE_ULT (0, 8, { 0, ... });
>   .MASK_STORE (vectp.14_86, 8B, slp_mask_87, vect__34.12_85);
>
> are lowered to a fixed-length vector store (e.g., str d30, [x0, 16]) if
> possible, instead of a more literal interpretation such as:
>
>   add   x0, x0, 16
>   ptrue p7.b, vl7
>   st1b  z30.b, p7, [x0]
>
> The vect_record_nunits function used during building of an SLP
> tree is updated to prevent it returning failure for BB SLP if the
> group size is not an integral multiple of the number of lanes in the
> vector type; it now allows such cases if the vector type might be more
> than long enough.
>
> Instead of giving up if vect_get_vector_types_for_stmt
> fails for the specified group size, vect_build_slp_tree_1
> now calls vect_get_vector_types_for_stmt again without
> a group size (which defaults to 0) as a fallback.
> If this succeeds then the initial failure is treated as a
> 'soft' failure that results in the group being split.
> Consequently, assertions that "For BB vectorization, we
> should always have a group size once we've constructed the
> SLP tree" were deleted in get_vectype_for_scalar_type and
> vect_get_vector_types_for_stmt.
>
> For BB SLP, vect_analyze_slp_instance previously gave up after
> building an SLP tree if it could not prove that the group size was
> at least the maximum lane count across all of the vector types in
> the SLP tree (which is unprovable for scalable vector types), or
> attempted to split the group if it could prove that the group size
> was greater than this maximum but not exactly divisible by it
> (which is also unprovable for scalable vector types).
>
> This function will now provisionally create a new SLP instance if the
> group size definitely does not exceed the minimum number of lanes,
> even if the group size otherwise satisfies conditions that would
> require a loop to be unrolled (e.g., a group of size 3 that uses a
> mixture of V4SI and V8HI types). If the group size lies between the
> minimum and maximum number of lanes then vectorization is still
> abandoned (e.g., a group of size 3 that uses a mixture of
> V2DI and V4SI types).
>
> With BB SLP, there is no need for agreement between different SLP
> nodes about whether to use masks or lengths to support partial vectors.
> Instead, that decision is made early and per individual SLP node, by
> vect_analyze_stmt. If a partial vector is required (i.e. if the number
> of subparts in the vector type may be greater than the number of active
> lanes for the node) then vect_analyze_stmt now requires
> SLP_TREE_CAN_USE_PARTIAL_VECTORS_P to be true; otherwise it clears any
> SLP_TREE_PARTIAL_VECTORS_STYLE that could have been set.
>
> The vect_get_num_copies function used during statement analysis
> is updated to return early with 1 if a vector type is long enough for
> the specified SLP tree node. This avoids an ICE in vect_get_num_vectors,
> which cannot cope with SVE vector types.
>
> When checking whether a value that is used outside the vectorized
> region can be supported, the vectorizable_live_operation function
> calculates which vector contains the result, and which lane of that
> vector we need. Previously, this calculation gave the wrong answer
> for BB SLP with a variable-length vector type (eventually generating
> invalid offsets such as BIT_FIELD_REF <_251, 32, POLY_INT_CST
> [96, 128]> to access the third element of a group using type VNx4SI)
> because it reused logic intended for loop vectorization, which selects
> the 'last' occurrence of a scalar index relative to the group size
> (which is a multiple of the vector length). For BB SLP with a
> predicate mask, only the first SLP_TREE_LANES elements are well
> defined.
>
> vect_create_vectorized_promotion_stmts no longer pushes
> more stmts than implied by vect_get_num_copies because it could
> previously overrun the number of slots allocated for an SLP node
> (based on its number of lanes and type). e.g., four defs were
> pushed for a promotion of V8HI to V2DI (8/2=4) even if only two
> lanes of the V8HI were active. Allowing it later caused ICE in
> vectorizable_operation for a parent node, because binary ops
> require both operands to be the same length.
>
> Since promotion no longer produces redundant definitions,
> vectorizable_conversion also had to be modified so that demotion no
> longer relies on an even number of defs being produced. If
> necessary, it now pushes a single constant zero def.
>
> The whole change is enabled by wiring the wrapper function
> vect_can_use_partial_vectors_p to SLP_TREE_CAN_USE_PARTIAL_VECTORS_P
> when invoked for BB SLP vectorization.
>
> Update test expectations for gcc.dg/vect/vect-over-widen-*.c,
> gcc.target/aarch64/sve/slp_6.c and
> gcc.target/aarch64/sve/vec_construct_*.c.
>
> The vec_construct_*.c tests previously expected their output
> to use Advanced SIMD instead of SVE despite their use of
> vector length agnostic types such as svint16_t and despite
> the fact that they are in the aarch64/sve directory. Since
> BB SLP can now vectorize these tests using VLA types such
> as 'vector([8,8]) char', and because (with one exception) the
> resultant code is deemed profitable relative to scalar code,
> GCC no longer considers vectorizing using non-VLA types such
> as 'vector(8) char' (although the estimated cost with non-VLA
> types might have been lower, had it been calculated).
> Instruction selection is not the focus of these tests therefore
> I updated them to expect SVE instead (e.g. st1b instead of str)
> and added --param=aarch64-autovec-preference=sve-only to reduce
> future churn.
>
> Because the cost model takes into account predicate mask
> generation for BB SLP with VLA types, the threshold at which
> vectorized code wins against scalar code is higher than
> before. The number of elements stored by vec_construct_3.c was
> increased just enough to allow for that.


Some (possibly incomplete) review pieces below.

> gcc/ChangeLog:
>
>         * tree-vect-loop.cc (vectorizable_live_operation): Simplify the
>         calculation of the index of the final result to avoid
>         generating invalid polynomial offsets relative to the end of
>         variable-length vector types, which is what happens if the code
>         for loop vectorization is reused for basic block SLP.
>         * tree-vect-slp.cc (vect_record_nunits): Allow group sizes that
>         are indivisible by the vector length.
>         (vect_build_slp_tree_1): In case of failure of
>         vect_get_vector_types_for_stmt, try to get fallback vector
>         types and continue analysis to allow splitting of groups.
>         (vect_build_slp_tree_2): Don't call
>         can_duplicate_and_interleave_p when doing basic block SLP
>         vectorization.
>         (vect_update_slp_min_nunits_for_node): New recursive function.
>         Update min_nunits to reflect the minimum number of subparts for
>         all of the vector types used by an SLP subgraph.
>         (vect_slp_tree_min_nunits): New function. Initialize min_nunits
>         then call vect_update_slp_min_nunits_for_node.
>         (vect_analyze_slp_instance): For BB SLP vectorization, create
>         a new SLP instance if the group size definitely does not exceed
>         the minimum number of subparts for all of the vector types used
>         in the SLP tree, even if the group size otherwise satisfies
>         conditions that would require a loop to be unrolled.
>         (vectorizable_slp_permutation_1): Instead of asserting that an
>         SLP tree node's number of lanes is compatible with the chosen
>         vector width, return a failure indication if incompatible.
>         * tree-vect-stmts.cc (check_load_store_for_partial_vectors):
>         When calculating the number of vectors, get the group size from
>         SLP_TREE_LANES instead of a parameter (e.g., DR_GROUP_SIZE) if
>         doing BB SLP vectorization. Don't assume it can be divided by
>         the number of subparts in the vector type to get a compile-time
>         constant.
>         (vect_get_data_ptr_increment): Require a parameter of type
>         loop_vec_info instead of vec_info *.
>         (vect_create_vectorized_promotion_stmts): Require an SLP tree
>         node to be passed by the caller, for use by
>         vect_get_num_copies.
>         Stop pushing more stmts than implied by vect_get_num_copies.
>         (vectorizable_conversion): Pass SLP tree node to
>         vect_create_vectorized_promotion_stmts.
>         Demotion no longer relies on an even number of definitions
>         being produced by promotion. If necessary, push a single constant
>         zero definition.
>         (vectorizable_load): Pass loop_vec_info instead of vec_info *
>         when calling vect_get_data_ptr_increment.
>         (vect_analyze_stmt): For BB SLP vectorization, check whether
>         the group needs partial vectors. If it does then return a
>         failure indication if SLP_TREE_CAN_USE_PARTIAL_VECTORS_P was
>         cleared by a callee of this function; if it doesn't need
>         partial vectors then clear any partial vectors style that might
>         have been chosen by callees of this function.
>         (get_vectype_for_scalar_type): For BB SLP vectorization, allow
>         invocation of this function with a group size of zero even if
>         one or more SLP instances have been created.
>         If the number of subparts in the natural choice of vector type
>         could be greater than the group size then pick a shorter vector
>         type only if the target does not support partial vectors.
>         (vect_maybe_update_slp_op_vectype): Reject external definitions
>         that have a number of lanes not divisible by the number of
>         subparts in a vector type naively inferred from the scalar
>         type.
>         (vect_get_vector_types_for_stmt): Add a new output parameter of
>         Boolean type. Set it to true if the statement can't be
>         vectorized because it uses a data type that the target doesn't
>         support in vector form for a group of the given size, otherwise
>         false.
>         * tree-vectorizer.h (vect_get_num_copies): Return early with 1
>         if a vector type is long enough for the specified SLP tree
>         node to avoid an ICE in vect_get_num_vectors.
>         (vect_get_vector_types_for_stmt): Update function declaration.
>         (vect_can_use_partial_vectors_p): Handle the BB SLP use-case by
>         returning the result of SLP_TREE_CAN_USE_PARTIAL_VECTORS_P.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.dg/vect/vect-over-widen-10.c: Update test expectations to
>         avoid spurious matching of scan-tree-dump-not pattern.
>         * gcc.dg/vect/vect-over-widen-13.c: As above.
>         * gcc.dg/vect/vect-over-widen-14.c: As above.
>         * gcc.dg/vect/vect-over-widen-17.c: As above.
>         * gcc.dg/vect/vect-over-widen-18.c: As above.
>         * gcc.dg/vect/vect-over-widen-5.c: As above.
>         * gcc.dg/vect/vect-over-widen-6.c: As above.
>         * gcc.dg/vect/vect-over-widen-7.c: As above.
>         * gcc.dg/vect/vect-over-widen-8.c: As above.
>         * gcc.dg/vect/vect-over-widen-9.c: As above.
>         * gcc.target/aarch64/sve/vec_construct_1.c:
>           Expect SVE instead of ASIMD instructions and add
>           --param=aarch64-autovec-preference=sve-only to stop
>           flip-flopping.
>         * gcc.target/aarch64/sve/vec_construct_1.c: As above.
>         * gcc.target/aarch64/sve/vec_construct_2.c: As above.
>         * gcc.target/aarch64/sve/vec_construct_3.c:
>           Expect SVE instead of ASIMD instructions and add
>           --param=aarch64-autovec-preference=sve-only to avoid
>           flip-flopping. Increase the number of elements
>           stored to ensure vectorization using SVE is deemed
>           profitable despite predicate mask costs.
>         * gcc.target/aarch64/sve/vec_construct_4.c:
>           Expect SVE instead of ASIMD instructions and add
>           --param=aarch64-autovec-preference=sve-only to stop
>           flip-flopping.
>         * gcc.target/aarch64/sve/vec_construct_5.c: As above.
> ---
>  .../gcc.dg/vect/vect-over-widen-10.c          |   2 +-
>  .../gcc.dg/vect/vect-over-widen-13.c          |   2 +-
>  .../gcc.dg/vect/vect-over-widen-14.c          |   2 +-
>  .../gcc.dg/vect/vect-over-widen-17.c          |   2 +-
>  .../gcc.dg/vect/vect-over-widen-18.c          |   2 +-
>  gcc/testsuite/gcc.dg/vect/vect-over-widen-5.c |   2 +-
>  gcc/testsuite/gcc.dg/vect/vect-over-widen-6.c |   2 +-
>  gcc/testsuite/gcc.dg/vect/vect-over-widen-7.c |   2 +-
>  gcc/testsuite/gcc.dg/vect/vect-over-widen-8.c |   2 +-
>  gcc/testsuite/gcc.dg/vect/vect-over-widen-9.c |   2 +-
>  gcc/testsuite/gcc.target/aarch64/sve/slp_6.c  |   3 -
>  .../gcc.target/aarch64/sve/vec_construct_1.c  |   6 +-
>  .../gcc.target/aarch64/sve/vec_construct_2.c  |   4 +-
>  .../gcc.target/aarch64/sve/vec_construct_3.c  |  20 +-
>  .../gcc.target/aarch64/sve/vec_construct_4.c  |   5 +-
>  .../gcc.target/aarch64/sve/vec_construct_5.c  |   6 +-
>  gcc/tree-vect-loop.cc                         |  14 +-
>  gcc/tree-vect-slp.cc                          | 141 ++++++++++--
>  gcc/tree-vect-stmts.cc                        | 202 +++++++++++++-----
>  gcc/tree-vectorizer.h                         |  13 +-
>  20 files changed, 317 insertions(+), 117 deletions(-)
>
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-10.c 
> b/gcc/testsuite/gcc.dg/vect/vect-over-widen-10.c
> index f0140e4ef6d..6efcf739db9 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-10.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-10.c
> @@ -16,5 +16,5 @@
>  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: 
> detected:[^\n]* >> 1} "vect" } } */
>  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: 
> detected:[^\n]* >> 2} "vect" } } */
>  /* { dg-final { scan-tree-dump {vect_recog_cast_forwprop_pattern: 
> detected:[^\n]* \(unsigned char\)} "vect" } } */
> -/* { dg-final { scan-tree-dump-not {vector[^ ]* int} "vect" } } */
> +/* { dg-final { scan-tree-dump-not {vector[^ ]* int vect__} "vect" } } */
>  /* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-13.c 
> b/gcc/testsuite/gcc.dg/vect/vect-over-widen-13.c
> index 08a65ea5518..720353716cf 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-13.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-13.c
> @@ -48,5 +48,5 @@ main (void)
>  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: 
> detected:[^\n]* \+} "vect" } } */
>  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: 
> detected:[^\n]* / 2} "vect" } } */
>  /* { dg-final { scan-tree-dump {vect_recog_cast_forwprop_pattern: 
> detected:[^\n]* = \(signed char\)} "vect" } } */
> -/* { dg-final { scan-tree-dump-not {vector[^ ]* int} "vect" } } */
> +/* { dg-final { scan-tree-dump-not {vector[^ ]* int vect__} "vect" } } */
>  /* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-14.c 
> b/gcc/testsuite/gcc.dg/vect/vect-over-widen-14.c
> index dfa09f5d2ca..f1d5f95c543 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-14.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-14.c
> @@ -15,5 +15,5 @@
>  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: 
> detected:[^\n]* \+} "vect" } } */
>  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: 
> detected:[^\n]* >> 1} "vect" } } */
>  /* { dg-final { scan-tree-dump {vect_recog_cast_forwprop_pattern: 
> detected:[^\n]* = \(unsigned char\)} "vect" } } */
> -/* { dg-final { scan-tree-dump-not {vector[^ ]* int} "vect" } } */
> +/* { dg-final { scan-tree-dump-not {vector[^ ]* int vect__} "vect" } } */
>  /* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-17.c 
> b/gcc/testsuite/gcc.dg/vect/vect-over-widen-17.c
> index 53fcfd0c06c..ac1a0f86727 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-17.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-17.c
> @@ -46,5 +46,5 @@ main (void)
>     adopts realign_load scheme.  It requires rs6000_builtin_mask_for_load to
>     generate mask whose return type is vector char.  */
>  /* { dg-final { scan-tree-dump-not {vector[^\n]*char} "vect" { target 
> vect_hw_misalign } } } */
> -/* { dg-final { scan-tree-dump-not {vector[^ ]* int} "vect" } } */
> +/* { dg-final { scan-tree-dump-not {vector[^ ]* int vect__} "vect" } } */
>  /* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-18.c 
> b/gcc/testsuite/gcc.dg/vect/vect-over-widen-18.c
> index aa58cd1c957..3ebfaa78270 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-18.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-18.c
> @@ -47,5 +47,5 @@ main (void)
>  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: 
> detected:[^\n]* |} "vect" } } */
>  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: 
> detected:[^\n]* <<} "vect" } } */
>  /* { dg-final { scan-tree-dump {vector[^\n]*char} "vect" } } */
> -/* { dg-final { scan-tree-dump-not {vector[^ ]* int} "vect" } } */
> +/* { dg-final { scan-tree-dump-not {vector[^ ]* int vect__} "vect" } } */
>  /* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-5.c 
> b/gcc/testsuite/gcc.dg/vect/vect-over-widen-5.c
> index c2ab11a9d32..1d89789a86d 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-5.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-5.c
> @@ -49,5 +49,5 @@ main (void)
>  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: 
> detected:[^\n]* \+ } "vect" } } */
>  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: 
> detected:[^\n]* >> 1} "vect" } } */
>  /* { dg-final { scan-tree-dump {vect_recog_cast_forwprop_pattern: 
> detected:[^\n]* \(signed char\)} "vect" } } */
> -/* { dg-final { scan-tree-dump-not {vector[^ ]* int} "vect" } } */
> +/* { dg-final { scan-tree-dump-not {vector[^ ]* int vect__} "vect" } } */
>  /* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-6.c 
> b/gcc/testsuite/gcc.dg/vect/vect-over-widen-6.c
> index bda92c965e0..62d5a52587e 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-6.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-6.c
> @@ -13,5 +13,5 @@
>  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: 
> detected:[^\n]* \+ } "vect" } } */
>  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: 
> detected:[^\n]* >> 1} "vect" } } */
>  /* { dg-final { scan-tree-dump {vect_recog_cast_forwprop_pattern: 
> detected:[^\n]* \(unsigned char\)} "vect" } } */
> -/* { dg-final { scan-tree-dump-not {vector[^ ]* int} "vect" } } */
> +/* { dg-final { scan-tree-dump-not {vector[^ ]* int vect__} "vect" } } */
>  /* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-7.c 
> b/gcc/testsuite/gcc.dg/vect/vect-over-widen-7.c
> index 1d55e13fb1f..6e09631009a 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-7.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-7.c
> @@ -51,5 +51,5 @@ main (void)
>  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: 
> detected:[^\n]* \+ } "vect" } } */
>  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: 
> detected:[^\n]* >> 2} "vect" } } */
>  /* { dg-final { scan-tree-dump {vect_recog_cast_forwprop_pattern: 
> detected:[^\n]* \(signed char\)} "vect" } } */
> -/* { dg-final { scan-tree-dump-not {vector[^ ]* int} "vect" } } */
> +/* { dg-final { scan-tree-dump-not {vector[^ ]* int vect__} "vect" } } */
>  /* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-8.c 
> b/gcc/testsuite/gcc.dg/vect/vect-over-widen-8.c
> index 553c0712a79..b6d650beab4 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-8.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-8.c
> @@ -16,5 +16,5 @@
>  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: 
> detected:[^\n]* \+ } "vect" } } */
>  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: 
> detected:[^\n]* >> 2} "vect" } } */
>  /* { dg-final { scan-tree-dump {vect_recog_cast_forwprop_pattern: 
> detected:[^\n]* \(unsigned char\)} "vect" } } */
> -/* { dg-final { scan-tree-dump-not {vector[^ ]* int} "vect" } } */
> +/* { dg-final { scan-tree-dump-not {vector[^ ]* int vect__} "vect" } } */
>  /* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-9.c 
> b/gcc/testsuite/gcc.dg/vect/vect-over-widen-9.c
> index 36bfc68e053..e82f8a571da 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-9.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-9.c
> @@ -56,5 +56,5 @@ main (void)
>  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: 
> detected:[^\n]* >> 1} "vect" } } */
>  /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern: 
> detected:[^\n]* >> 2} "vect" } } */
>  /* { dg-final { scan-tree-dump {vect_recog_cast_forwprop_pattern: 
> detected:[^\n]* \(signed char\)} "vect" } } */
> -/* { dg-final { scan-tree-dump-not {vector[^ ]* int} "vect" } } */
> +/* { dg-final { scan-tree-dump-not {vector[^ ]* int vect__} "vect" } } */
>  /* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/slp_6.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/slp_6.c
> index 44d128477d2..1c9ac15a699 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/slp_6.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_6.c
> @@ -37,9 +37,6 @@ vec_slp_##TYPE (TYPE *restrict a, TYPE *restrict b, int n)  
>   \
>  TEST_ALL (VEC_PERM)
>
>  /* These loops can't use SLP.  */
> -/* { dg-final { scan-assembler-not {\tld1b\t} } } */
> -/* { dg-final { scan-assembler-not {\tld1h\t} } } */
> -/* { dg-final { scan-assembler-not {\tld1w\t} } } */
>  /* { dg-final { scan-assembler-not {\tld1d\t} } } */
>  /* { dg-final { scan-assembler {\tld3b\t} } } */
>  /* { dg-final { scan-assembler {\tld3h\t} } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_1.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_1.c
> index 2f8ce6808a9..eea13c28e49 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_1.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_1.c
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -ftree-slp-vectorize" } */
> +/* { dg-options "-O2 -ftree-slp-vectorize 
> --param=aarch64-autovec-preference=sve-only" } */
>
>  /* Test that a group of stores of 8 elements derived from a horizontal
>     reduction is vectorized by constructing a vector and storing it.
> @@ -30,8 +30,8 @@ foo (svint8_t src0, svint8_t src1, svint8_t src2, svint8_t 
> src3, svint8_t src4,
>    s.h = svaddv_s8 (all, src7);
>  }
>
> -/* { dg-final { scan-assembler-times {\tins\tv[0-9]+\.b\[[0-9]+\], 
> v[0-9]+\.b\[[0-9]+\]\n} 7 } } */
> -/* { dg-final { scan-assembler-times {\tstr\td[0-9]+, } 1 } } */
> +/* { dg-final { scan-assembler-times {\tinsr\tz[0-9]+\.h, h[0-9]+\n} 7 } } */
> +/* { dg-final { scan-assembler-times {\tst1b\tz[0-9]+\.h, p[0-9]+, 
> \[x[0-9]+\]\n} 1 } } */
>
>  /* { dg-final { scan-assembler-not {\tstr\tb[0-9]+, } } } */
>  /* { dg-final { scan-assembler-not {\tstrb\tw[0-9]+, } } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_2.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_2.c
> index 6715118d7b0..2bf537e13e2 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_2.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_2.c
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -ftree-slp-vectorize" } */
> +/* { dg-options "-O2 -ftree-slp-vectorize 
> --param=aarch64-autovec-preference=sve-only" } */
>
>  /* Test that a group of stores of 8 elements derived from the results of 
> calls
>     to a function that has only vector parameters and returns a scalar result 
> is
> @@ -40,3 +40,5 @@ foo (svint8_t src0, svint8_t src1, svint8_t src2, svint8_t 
> src3, svint8_t src4,
>
>  /* { dg-final { scan-assembler-not {\tins\tv[0-9]+\.b\[[0-9]+\], w[0-9]+\n} 
> } } */
>  /* { dg-final { scan-assembler-not {\tstr\td[0-9]+, } } } */
> +/* { dg-final { scan-assembler-not {\tfmov\th[0-9]+, h[0-9]+\n} } } */
> +/* { dg-final { scan-assembler-not {\tinsr\tz[0-9]+\.b, w[0-9]+\n} } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_3.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_3.c
> index 8143d0050ad..ccadaccbcb4 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_3.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_3.c
> @@ -1,7 +1,7 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -ftree-slp-vectorize" } */
> +/* { dg-options "-O2 -ftree-slp-vectorize 
> --param=aarch64-autovec-preference=sve-only" } */
>
> -/* Test that a group of stores of 8 elements derived from a horizontal
> +/* Test that a group of stores of 14 elements derived from a horizontal
>     reduction is vectorized by constructing a vector and storing it
>     even if the results of the reductions are narrowed.
>     Since there are no GPR-to-SIMD register transfers, there is no
> @@ -13,12 +13,14 @@
>
>  struct S
>  {
> -  char a, b, c, d, e, f, g, h;
> +  char a, b, c, d, e, f, g, h, i, j, k, l, m, n;
>  } s;
>
>  void
>  foo (svint16_t src0, svint32_t src1, svint16_t src2, svint32_t src3,
> -     svint32_t src4, svint16_t src5, svint32_t src6, svint16_t src7)
> +     svint32_t src4, svint16_t src5, svint32_t src6, svint16_t src7,
> +     svint16_t src8, svint32_t src9, svint16_t src10, svint32_t src11,
> +     svint32_t src12, svint16_t src13)
>  {
>    svbool_t all16 = svptrue_b16 ();
>    svbool_t all32 = svptrue_b32 ();
> @@ -30,10 +32,16 @@ foo (svint16_t src0, svint32_t src1, svint16_t src2, 
> svint32_t src3,
>    s.f = svminv_s16 (all16, src5);
>    s.g = svlastb_s32 (svptrue_pat_b32 (SV_VL1), src6);
>    s.h = svaddv_s16 (all16, src7);
> +  s.i = svmaxv_s16 (all16, src8);
> +  s.j = svminv_s32 (all32, src9);
> +  s.k = svlastb_s16 (svptrue_pat_b16 (SV_VL1), src10);
> +  s.l = svaddv_s32 (all32, src11);
> +  s.m = svmaxv_s32 (all32, src12);
> +  s.n = svminv_s16 (all16, src13);
>  }
>
> -/* { dg-final { scan-assembler-times {\tins\tv[0-9]+\.b\[[0-9]+\], 
> v[0-9]+\.b\[[0-9]+\]\n} 7 } } */
> -/* { dg-final { scan-assembler-times {\tstr\td[0-9]+, } 1 } } */
> +/* { dg-final { scan-assembler-times {\tinsr\tz[0-9]+\.b, b[0-9]+\n} 13 } } 
> */
> +/* { dg-final { scan-assembler-times {\tst1b\tz[0-9]+\.b, p[0-9]+, 
> \[x[0-9]\]\n} 1 } } */
>
>  /* { dg-final { scan-assembler-not {\tstr\tb[0-9]+, } } } */
>  /* { dg-final { scan-assembler-not {\tstrb\tw[0-9]+, } } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_4.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_4.c
> index 49f8114b64c..3d41af684a3 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_4.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_4.c
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -ftree-slp-vectorize" } */
> +/* { dg-options "-O2 -ftree-slp-vectorize 
> --param=aarch64-autovec-preference=sve-only" } */
>
>  /* Test that a group of stores of 8 elements derived from a horizontal
>     reduction is not vectorized by constructing a vector and storing it
> @@ -33,5 +33,6 @@ foo (svint16_t src0, svint8_t src1, svint16_t src2, 
> svint8_t src3,
>  /* { dg-final { scan-assembler-times {\tstp\tw[0-9]+, w[0-9]+,} 4 } } */
>
>  /* { dg-final { scan-assembler-not {\tins\tv[0-9]+\.s\[[0-9]+\], w[0-9]+\n} 
> } } */
> -/* { dg-final { scan-assembler-not {\tfmov\ts[0-9]+, w[0-9]+\n} } }
> +/* { dg-final { scan-assembler-not {\tfmov\ts[0-9]+, w[0-9]+\n} } } */
>  /* { dg-final { scan-assembler-not {\tstp\tq[0-9]+, q[0-9]+,} } } */
> +/* { dg-final { scan-assembler-not {\tinsr\tz[0-9]+.s, w[0-9]+\n} } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_5.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_5.c
> index 983d6c69ebc..89e57406c0e 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_5.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_5.c
> @@ -1,5 +1,5 @@
>  /* { dg-do compile } */
> -/* { dg-options "-O2 -ftree-slp-vectorize" } */
> +/* { dg-options "-O2 -ftree-slp-vectorize 
> --param=aarch64-autovec-preference=sve-only" } */
>
>  /* Test that a group of stores of 8 elements derived from lane extractions is
>     vectorized by constructing a vector and storing it.  Since there are no
> @@ -30,8 +30,8 @@ foo (svint8_t src0, svint8_t src1, svint8_t src2, svint8_t 
> src3, svint8_t src4,
>    s.h = svlastb_s8 (p, src7);
>  }
>
> -/* { dg-final { scan-assembler-times {\tins\tv[0-9]+\.b\[[0-9]+\], 
> v[0-9]+\.b\[[0-9]+\]\n} 7 } } */
> -/* { dg-final { scan-assembler-times {\tstr\td[0-9]+, } 1 } } */
> +/* { dg-final { scan-assembler-times {\tinsr\tz[0-9]+\.h, h[0-9]+\n} 7 } } */
> +/* { dg-final { scan-assembler-times {\tst1b\tz[0-9]+\.h, p[0-9]+, 
> \[x[0-9]+\]\n} 1 } } */
>
>  /* { dg-final { scan-assembler-not {\tstr\tb[0-9]+, } } } */
>  /* { dg-final { scan-assembler-not {\tstrb\tw[0-9]+, } } } */
> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> index 6d602c67108..7503fd084cf 100644
> --- a/gcc/tree-vect-loop.cc
> +++ b/gcc/tree-vect-loop.cc
> @@ -10227,12 +10227,16 @@ vectorizable_live_operation (vec_info *vinfo, 
> stmt_vec_info stmt_info,
>
>    gcc_assert (slp_index >= 0);
>
> -  /* Get the last occurrence of the scalar index from the concatenation of
> -     all the slp vectors. Calculate which slp vector it is and the index
> -     within.  */
> -  int num_scalar = SLP_TREE_LANES (slp_node);
>    int num_vec = vect_get_num_copies (vinfo, slp_node);
> -  poly_uint64 pos = (num_vec * nunits) - num_scalar + slp_index;
> +  poly_uint64 pos = slp_index;
> +  if (loop_vinfo)
> +    {
> +      /* Get the last occurrence of the scalar index from the concatenation 
> of
> +        all the slp vectors. Calculate which slp vector it is and the index
> +        within.  */
> +      int num_scalar = SLP_TREE_LANES (slp_node);
> +      pos += (num_vec * nunits) - num_scalar;
> +    }

This looks independent enough and is OK to push separately.

>    /* Calculate which vector contains the result, and which lane of
>       that vector we need.  */
> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> index 1850af4e753..6af13e65e19 100644
> --- a/gcc/tree-vect-slp.cc
> +++ b/gcc/tree-vect-slp.cc
> @@ -1117,8 +1117,12 @@ vect_record_max_nunits (vec_info *vinfo, stmt_vec_info 
> stmt_info,
>      }
>
>    /* If populating the vector type requires unrolling then fail
> -     before adjusting *max_nunits for basic-block vectorization.  */
> +     before adjusting *max_nunits for basic-block vectorization.
> +     Allow group sizes that are indivisible by the vector length only if they
> +     are known not to exceed the vector length.  We may be able to support 
> such
> +     cases by generating constant masks.  */
>    if (is_a <bb_vec_info> (vinfo)
> +      && maybe_gt (group_size, TYPE_VECTOR_SUBPARTS (vectype))
>        && !multiple_p (group_size, TYPE_VECTOR_SUBPARTS (vectype)))
>      {
>        if (dump_enabled_p ())

is this hunk still necessary?

> @@ -1170,12 +1174,29 @@ vect_build_slp_tree_1 (vec_info *vinfo, unsigned char 
> *swap,
>    tree soft_fail_nunits_vectype = NULL_TREE;
>
>    tree vectype, nunits_vectype;
> +  bool unsupported_datatype = false;
>    if (!vect_get_vector_types_for_stmt (vinfo, first_stmt_info, &vectype,
> -                                      &nunits_vectype, group_size))
> +                                      &nunits_vectype, &unsupported_datatype,
> +                                      group_size))
>      {
> -      /* Fatal mismatch.  */
> -      matches[0] = false;
> -      return false;
> +      /* Try to get fallback vector types and continue analysis, producing
> +        matches[] as if vectype was not an issue.  This allows splitting of
> +        groups to happen.  */
> +      if (unsupported_datatype
> +         && vect_get_vector_types_for_stmt (vinfo, first_stmt_info, &vectype,
> +                                            &nunits_vectype,
> +                                            &unsupported_datatype))

Can we instead add a "allow_predicated_tail" flag to
vect_get_vector_types_for_stmt?
Btw, I don't remember that we fail here for say a group_size of 7, do
we?  We fail
when there's no vector type with nunits < group_size?

> +       {
> +         gcc_assert (is_a<bb_vec_info> (vinfo));
> +         maybe_soft_fail = true;
> +         soft_fail_nunits_vectype = nunits_vectype;
> +       }
> +      else
> +       {
> +         /* Fatal mismatch.  */
> +         matches[0] = false;
> +         return false;
> +       }
>      }
>    if (is_a <bb_vec_info> (vinfo)
>        && known_le (TYPE_VECTOR_SUBPARTS (vectype), 1U))
> @@ -1705,16 +1726,22 @@ vect_build_slp_tree_1 (vec_info *vinfo, unsigned char 
> *swap,
>
>    if (maybe_soft_fail)
>      {
> -      unsigned HOST_WIDE_INT const_nunits;
> -      if (!TYPE_VECTOR_SUBPARTS
> -           (soft_fail_nunits_vectype).is_constant (&const_nunits)
> -         || const_nunits > group_size)
> +      /* Use the known minimum number of subparts for VLA because we still 
> need
> +        to choose a splitting point although the choice is more arbitrary.  
> */
> +      unsigned HOST_WIDE_INT const_nunits = constant_lower_bound (
> +         TYPE_VECTOR_SUBPARTS (soft_fail_nunits_vectype));
> +
> +      if (const_nunits > group_size)
>         matches[0] = false;
>        else
>         {
>           /* With constant vector elements simulate a mismatch at the
>              point we need to split.  */
> +         gcc_assert ((const_nunits & (const_nunits - 1)) == 0);
>           unsigned tail = group_size & (const_nunits - 1);
> +         if (tail == 0)
> +           tail = const_nunits;
> +         gcc_assert (group_size >= tail);
>           memset (&matches[group_size - tail], 0, sizeof (bool) * tail);
>         }
>        return false;
> @@ -2446,13 +2473,21 @@ vect_build_slp_tree_2 (vec_info *vinfo, slp_tree node,
>                   /* Check whether we can build the invariant.  If we can't
>                      we never will be able to.  */
>                   tree type = TREE_TYPE (chains[0][n].op);
> -                 if (!GET_MODE_SIZE (vinfo->vector_mode).is_constant ()
> -                     && (TREE_CODE (type) == BOOLEAN_TYPE
> -                         || !can_duplicate_and_interleave_p (vinfo, 
> group_size,
> -                                                             type)))
> +                 if (!GET_MODE_SIZE (vinfo->vector_mode).is_constant ())
>                     {
> -                     matches[0] = false;
> -                     goto out;
> +                     if (TREE_CODE (type) == BOOLEAN_TYPE)
> +                       {
> +                         matches[0] = false;
> +                         goto out;
> +                       }
> +
> +                     if (!is_a<bb_vec_info> (vinfo)
> +                         && !can_duplicate_and_interleave_p (vinfo, 
> group_size,
> +                                                             type))
> +                       {
> +                         matches[0] = false;
> +                         goto out;
> +                       }

Hmm, but this would be target specific, right?  It relies on the vec_init
CTOR expansion?  How about constants?  Don't we eventually want to
at least check s/group_size/TYPE_VECTOR_SUBPARTS/ for that
case so we can generate a zero-padded constant?

>                     }
>                 }
>               else if (dt != vect_internal_def)
> @@ -2881,7 +2916,7 @@ out:
>                     uniform_val = NULL_TREE;
>                     break;
>                   }
> -             if (!uniform_val
> +             if (!uniform_val && !is_a<bb_vec_info> (vinfo)
>                   && !can_duplicate_and_interleave_p (vinfo,
>                                                       oprnd_info->ops.length 
> (),
>                                                       TREE_TYPE (op0)))
> @@ -4993,6 +5028,53 @@ vect_analyze_slp_reductions (loop_vec_info loop_vinfo,
>    return true;
>  }
>
> +/* Update MIN_NUNITS to reflect the minimum number of subparts for all of the
> +   vector types used by the SLP subgraph rooted at NODE.  VISITED is used to
> +   avoid reevaluating any node in the subgraph; it thereby prevents infinite
> +   recursion should a cycle be encountered. The value of MIN_NUNITS will 
> only be
> +   updated if any node in the subgraph has a vector type with a number of
> +   subparts that is smaller than the passed-in value of MIN_NUNITS. Before
> +   calling this function for the first time, initialize MIN_NUNITS to
> +   UINT64_MAX.  */
> +
> +static void
> +vect_update_slp_min_nunits_for_node (slp_tree node, poly_uint64 &min_nunits,
> +                                    hash_set<slp_tree> &visited)
> +{
> +  if (!node || SLP_TREE_DEF_TYPE (node) != vect_internal_def)
> +    return;
> +
> +  if (visited.add (node))
> +    return;
> +
> +  for (slp_tree child : SLP_TREE_CHILDREN (node))
> +    vect_update_slp_min_nunits_for_node (child, min_nunits, visited);
> +
> +  tree vectype = SLP_TREE_VECTYPE (node);
> +  if (!vectype)
> +    return;
> +
> +  /* All unit counts have the form vec_info::vector_size * X for some
> +     rational X, therefore we know the values are ordered.  */
> +  poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype);
> +  min_nunits = known_eq (min_nunits, UINT64_MAX)
> +                ? nunits
> +                : ordered_min (min_nunits, nunits);
> +}
> +
> +/* For NODE, return the minimum number of subparts for all of the vector
> +   types used in the given SLP graph.  */
> +
> +static poly_uint64
> +vect_slp_tree_min_nunits (slp_tree node)
> +{
> +  poly_uint64 min_nunits = UINT64_MAX;
> +  hash_set<slp_tree> visited;
> +  vect_update_slp_min_nunits_for_node (node, min_nunits, visited);
> +  gcc_checking_assert (known_ne (min_nunits, UINT64_MAX));
> +  return min_nunits;
> +}
> +
>  /* Analyze an SLP instance starting from a group of grouped stores.  Call
>     vect_build_slp_tree to build a tree of packed stmts if possible.
>     Return FALSE if it's impossible to SLP any stmt in the group.  */
> @@ -5062,8 +5144,8 @@ vect_analyze_slp_instance (vec_info *vinfo,
>        poly_uint64 unrolling_factor
>         = calculate_unrolling_factor (max_nunits, group_size);
>
> -      if (maybe_ne (unrolling_factor, 1U)
> -         && is_a <bb_vec_info> (vinfo))
> +      if (maybe_ne (unrolling_factor, 1U) && is_a<bb_vec_info> (vinfo)
> +         && !known_ge (vect_slp_tree_min_nunits (node), group_size))

please line up &&s in new lines.

Is this really good enough?  'group_size' might not be uniform across
the SLP graph so I'd have expected the check to be done for each SLP
node (as part of the vect_slp_tree_min_nunits walk).

>         {
>           unsigned HOST_WIDE_INT const_max_nunits;
>           if (!max_nunits.is_constant (&const_max_nunits)
> @@ -5148,9 +5230,10 @@ vect_analyze_slp_instance (vec_info *vinfo,
>             = TREE_TYPE (DR_REF (STMT_VINFO_DATA_REF (stmt_info)));
>           tree vectype = get_vectype_for_scalar_type (vinfo, scalar_type,
>                                                       1 << floor_log2 (i));
> -         unsigned HOST_WIDE_INT const_nunits;
> -         if (vectype
> -             && TYPE_VECTOR_SUBPARTS (vectype).is_constant (&const_nunits))
> +         unsigned HOST_WIDE_INT const_nunits
> +           = vectype ? constant_lower_bound (TYPE_VECTOR_SUBPARTS (vectype))
> +                     : 0;
> +         if (const_nunits > 1 && (i % const_nunits) == 0)

I prefer

            if (vectype
                && (const_nunits = constant_lower_bound (...)) > 1
                &&

instead of the conditional assignment.

>             {
>               /* Split into two groups at the first vector boundary.  */
>               gcc_assert ((const_nunits & (const_nunits - 1)) == 0);
> @@ -11596,7 +11679,21 @@ vectorizable_slp_permutation_1 (vec_info *vinfo, 
> gimple_stmt_iterator *gsi,
>        unpack_factor = 1;
>      }
>    unsigned olanes = unpack_factor * ncopies * SLP_TREE_LANES (node);
> -  gcc_assert (repeating_p || multiple_p (olanes, nunits));
> +
> +  /* With fully-predicated BB-SLP, an external node's number of lanes can be
> +     incompatible with the chosen vector width (e.g., lane packs of 3 with a
> +     natural 2-lane vector type).  */
> +  if (!repeating_p && !multiple_p (olanes, nunits))
> +    {
> +      if (dump_p)
> +       dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +                        "unsupported permutation %p: vector type %T,"
> +                        " nunits=" HOST_WIDE_INT_PRINT_UNSIGNED
> +                        " ncopies=%" PRIu64 ", lanes=%u and unpack=%u\n",
> +                        (void *) node, vectype, estimated_poly_value 
> (nunits),
> +                        ncopies, SLP_TREE_LANES (node), unpack_factor);
> +      return -1;
> +    }
>
>    /* Compute the { { SLP operand, vector index}, lane } permutation sequence
>       from the { SLP operand, scalar lane } permutation as recorded in the
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index cd3ba6fa1cb..367a9c63ea4 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -1671,23 +1671,27 @@ check_load_store_for_partial_vectors (vec_info 
> *vinfo, tree vectype,
>      unsigned int nvectors;
>      if (can_div_away_from_zero_p (size, nunits, &nvectors))
>        return nvectors;
> -    gcc_unreachable ();
> +
> +    gcc_assert (known_le (size, nunits));
> +    return 1u;
>    };
>
>    poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype);
> -  poly_uint64 vf = loop_vinfo ? LOOP_VINFO_VECT_FACTOR (loop_vinfo) : 1;
> +  poly_uint64 size = loop_vinfo
> +                      ? group_size * LOOP_VINFO_VECT_FACTOR (loop_vinfo)
> +                      : SLP_TREE_LANES (slp_node);

I'm not sure this is correct if we have a load permutation on
slp_node.  It at least
warrants a comment (or an assert?).

>    unsigned factor;
>    vect_partial_vector_style partial_vector_style
>      = vect_get_partial_vector_style (vectype, is_load, &factor, elsvals);
>
>    if (partial_vector_style == vect_partial_vectors_len)
>      {
> -      nvectors = group_memory_nvectors (group_size * vf, nunits);
> +      nvectors = group_memory_nvectors (size, nunits);
>        vect_record_len (vinfo, slp_node, nvectors, vectype, factor);
>      }
>    else if (partial_vector_style == vect_partial_vectors_while_ult)
>      {
> -      nvectors = group_memory_nvectors (group_size * vf, nunits);
> +      nvectors = group_memory_nvectors (size, nunits);
>        vect_record_mask (vinfo, slp_node, nvectors, vectype, scalar_mask);
>      }
>    else
> @@ -3362,12 +3366,11 @@ vect_get_strided_load_store_ops (stmt_vec_info 
> stmt_info, slp_tree node,
>
>  static tree
>  vect_get_loop_variant_data_ptr_increment (
> -  vec_info *vinfo, tree aggr_type, gimple_stmt_iterator *gsi,
> +  loop_vec_info loop_vinfo, tree aggr_type, gimple_stmt_iterator *gsi,
>    vec_loop_lens *loop_lens, dr_vec_info *dr_info,
>    vect_memory_access_type memory_access_type)
>  {
> -  loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (vinfo);
> -  tree step = vect_dr_behavior (vinfo, dr_info)->step;
> +  tree step = vect_dr_behavior (loop_vinfo, dr_info)->step;
>
>    /* gather/scatter never reach here.  */
>    gcc_assert (!mat_gather_scatter_p (memory_access_type));
> @@ -3411,7 +3414,7 @@ vect_get_data_ptr_increment (vec_info *vinfo, 
> gimple_stmt_iterator *gsi,
>
>    loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (vinfo);
>    if (loop_vinfo && LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo))
> -    return vect_get_loop_variant_data_ptr_increment (vinfo, aggr_type, gsi,
> +    return vect_get_loop_variant_data_ptr_increment (loop_vinfo, aggr_type, 
> gsi,
>                                                      loop_lens, dr_info,
>                                                      memory_access_type);

The two hunks look independent.  They are OK if pushed separately.

>
> @@ -5297,7 +5300,7 @@ vect_create_vectorized_demotion_stmts (vec_info *vinfo, 
> vec<tree> *vec_oprnds,
>     call the function recursively.  */
>
>  static void
> -vect_create_vectorized_promotion_stmts (vec_info *vinfo,
> +vect_create_vectorized_promotion_stmts (vec_info *vinfo, slp_tree slp_node,
>                                         vec<tree> *vec_oprnds0,
>                                         vec<tree> *vec_oprnds1,
>                                         stmt_vec_info stmt_info, tree 
> vec_dest,
> @@ -5310,37 +5313,39 @@ vect_create_vectorized_promotion_stmts (vec_info 
> *vinfo,
>    gimple *new_stmt1, *new_stmt2;
>    vec<tree> vec_tmp = vNULL;
>
> -  vec_tmp.create (vec_oprnds0->length () * 2);
> +  const unsigned ncopies = vect_get_num_copies (vinfo, slp_node);
> +  vec_tmp.create (ncopies);
> +  gcc_assert (vec_oprnds0->length () <= ncopies);
>    FOR_EACH_VEC_ELT (*vec_oprnds0, i, vop0)
>      {
> +      if (vec_tmp.length () >= ncopies)
> +       break;
> +
>        if (op_type == binary_op)
>         vop1 = (*vec_oprnds1)[i];
>        else
>         vop1 = NULL_TREE;
>
>        /* Generate the two halves of promotion operation.  */
> -      new_stmt1 = vect_gen_widened_results_half (vinfo, ch1, vop0, vop1,
> -                                                op_type, vec_dest, gsi,
> -                                                stmt_info);
> -      new_stmt2 = vect_gen_widened_results_half (vinfo, ch2, vop0, vop1,
> -                                                op_type, vec_dest, gsi,
> -                                                stmt_info);
> -      if (is_gimple_call (new_stmt1))
> -       {
> -         new_tmp1 = gimple_call_lhs (new_stmt1);
> -         new_tmp2 = gimple_call_lhs (new_stmt2);
> -       }
> -      else
> +      new_stmt1
> +       = vect_gen_widened_results_half (vinfo, ch1, vop0, vop1, op_type,
> +                                        vec_dest, gsi, stmt_info);
> +      new_tmp1 = is_gimple_call (new_stmt1) ? gimple_call_lhs (new_stmt1)
> +                                           : gimple_assign_lhs (new_stmt1);
> +      vec_tmp.quick_push (new_tmp1);
> +
> +      if (vec_tmp.length () < ncopies)
>         {
> -         new_tmp1 = gimple_assign_lhs (new_stmt1);
> -         new_tmp2 = gimple_assign_lhs (new_stmt2);
> +         new_stmt2
> +           = vect_gen_widened_results_half (vinfo, ch2, vop0, vop1, op_type,
> +                                            vec_dest, gsi, stmt_info);
> +         new_tmp2 = is_gimple_call (new_stmt2) ? gimple_call_lhs (new_stmt2)
> +                                               : gimple_assign_lhs 
> (new_stmt2);
> +         vec_tmp.quick_push (new_tmp2);
>         }
> -
> -      /* Store the results for the next step.  */
> -      vec_tmp.quick_push (new_tmp1);
> -      vec_tmp.quick_push (new_tmp2);
>      }
>
> +  gcc_assert (vec_tmp.length () <= ncopies);
>    vec_oprnds0->release ();
>    *vec_oprnds0 = vec_tmp;
>  }
> @@ -5553,6 +5558,7 @@ vectorizable_conversion (vec_info *vinfo,
>       from the scalar type.  */
>    if (!vectype_in)
>      vectype_in = get_vectype_for_scalar_type (vinfo, rhs_type, slp_node);
> +
>    if (!cost_vec)
>      gcc_assert (vectype_in);
>    if (!vectype_in)
> @@ -5961,12 +5967,15 @@ vectorizable_conversion (vec_info *vinfo,
>                                              stmt_info, this_dest, gsi, c1,
>                                              op_type);
>           else
> -           vect_create_vectorized_promotion_stmts (vinfo, &vec_oprnds0,
> -                                                   &vec_oprnds1, stmt_info,
> -                                                   this_dest, gsi,
> +           vect_create_vectorized_promotion_stmts (vinfo, slp_node,
> +                                                   &vec_oprnds0, 
> &vec_oprnds1,
> +                                                   stmt_info, this_dest, gsi,
>                                                     c1, c2, op_type);
>         }
>
> +      gcc_assert (vec_oprnds0.length ()
> +                 == vect_get_num_copies (vinfo, slp_node));
> +
>        FOR_EACH_VEC_ELT (vec_oprnds0, i, vop0)
>         {
>           gimple *new_stmt;
> @@ -5990,6 +5999,16 @@ vectorizable_conversion (vec_info *vinfo,
>          generate more than one vector stmt - i.e - we need to "unroll"
>          the vector stmt by a factor VF/nunits.  */
>        vect_get_vec_defs (vinfo, slp_node, op0, &vec_oprnds0);
> +
> +      /* Promotion no longer produces redundant defs (since support was
> +       added for length/mask-predicated BB SLP of awkward-sized groups),
> +       therefore demotion now has to handle that case too.  */
> +      if (vec_oprnds0.length () % 2 != 0)
> +       {
> +         tree vectype = TREE_TYPE (vec_oprnds0[0]);
> +         vec_oprnds0.safe_push (build_zero_cst (vectype));
> +       }
> +

Please split out the promotion/demotion related required changes.  They are
useful elsewhere (getting rid of max_nunits) and we can iterate on those
separately.

>        /* Arguments are ready.  Create the new vector stmts.  */
>        if (cvt_type && modifier == NARROW_DST)
>         FOR_EACH_VEC_ELT (vec_oprnds0, i, vop0)
> @@ -10803,7 +10822,7 @@ vectorizable_load (vec_info *vinfo,
>
>        aggr_type = build_array_type_nelts (elem_type, group_size * nunits);
>        if (!costing_p)
> -       bump = vect_get_data_ptr_increment (vinfo, gsi, dr_info, aggr_type,
> +       bump = vect_get_data_ptr_increment (loop_vinfo, gsi, dr_info, 
> aggr_type,
>                                             memory_access_type, loop_lens);

That hunk seems unrelated, OK to push as separate commit.

>
>        unsigned int inside_cost = 0, prologue_cost = 0;
> @@ -13460,6 +13479,38 @@ vect_analyze_stmt (vec_info *vinfo,
>                                    " live stmt not supported: %G",
>                                    stmt_info->stmt);
>
> +  if (bb_vinfo)
> +    {
> +      unsigned int group_size = SLP_TREE_LANES (node);
> +      tree vectype = SLP_TREE_VECTYPE (node);
> +      poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype);
> +      bool needs_partial = maybe_lt (group_size, nunits);
> +      if (needs_partial)
> +       {
> +         /* If partial vectors are required then they must be supported by 
> the
> +            target; however, don't assume that a partial vectors style has
> +            been set because a mask or length may not be required for the
> +            statement.  */
> +         if (!SLP_TREE_CAN_USE_PARTIAL_VECTORS_P (node))
> +           return opt_result::failure_at (stmt_info->stmt,
> +                                          "not vectorized: SLP node needs 
> but "
> +                                          "cannot use partial vectors: %G",
> +                                          stmt_info->stmt);

that's the job of vectorizable_* and it's already done for the loop
vect case.  We
should do the same for predicated tails.

> +       }
> +      else
> +       {
> +         /* If we don't need partial vectors then we don't care about whether
> +            they are supported or not; however, we need to clear any partial
> +            vectors style that might have been chosen because it will be used
> +            to control generation of lengths or masks.  */
> +         SLP_TREE_PARTIAL_VECTORS_STYLE (node) = vect_partial_vectors_none;
> +         SLP_TREE_NUM_PARTIAL_VECTORS (node) = 0;

Likewise.

> +       }
> +
> +      if (maybe_gt (group_size, nunits))
> +       gcc_assert (multiple_p (group_size, nunits));
> +    }
> +
>    return opt_result::success ();
>  }
>
> @@ -13767,13 +13818,7 @@ tree
>  get_vectype_for_scalar_type (vec_info *vinfo, tree scalar_type,
>                              unsigned int group_size)
>  {
> -  /* For BB vectorization, we should always have a group size once we've
> -     constructed the SLP tree; the only valid uses of zero GROUP_SIZEs
> -     are tentative requests during things like early data reference
> -     analysis and pattern recognition.  */
> -  if (is_a <bb_vec_info> (vinfo))
> -    gcc_assert (vinfo->slp_instances.is_empty () || group_size != 0);
> -  else
> +  if (!is_a <bb_vec_info> (vinfo))

But we really want to override the preferred SIMD mode, so any
"auto-detection" for BB vect looks wrong to me.  Can we instead
pass group-size rounded down to the next power of two?

>      group_size = 0;
>
>    tree vectype = get_related_vectype_for_scalar_type (vinfo->vector_mode,
> @@ -13787,10 +13832,18 @@ get_vectype_for_scalar_type (vec_info *vinfo, tree 
> scalar_type,
>      vinfo->used_vector_modes.add (TYPE_MODE (vectype));
>
>    /* If the natural choice of vector type doesn't satisfy GROUP_SIZE,
> -     try again with an explicit number of elements.  */
> -  if (vectype
> -      && group_size
> -      && maybe_ge (TYPE_VECTOR_SUBPARTS (vectype), group_size))
> +     try again with an explicit number of elements.  A vector type satisfies
> +     GROUP_SIZE if it is definitely not too long to store the whole group,
> +     or we are able to generate masks to handle the unknown number of excess
> +     lanes that might exist.  Otherwise, we must substitute a vector type 
> that
> +     can be used to carve up the group.
> +   */
> +  if (vectype && group_size
> +      && maybe_gt (TYPE_VECTOR_SUBPARTS (vectype), group_size)
> +      && (vect_get_partial_vector_style (vectype, true)
> +           == vect_partial_vectors_none
> +         || vect_get_partial_vector_style (vectype, false)
> +              == vect_partial_vectors_none))

this changes _ge to _gt - why?  I don't like a partial vector style query here.
I know we're setting the actual type up only late, but shouldn't we get
proper failures when analyzing this as partial vectors always but we cannot
support them later?

>      {
>        /* Start with the biggest number of units that fits within
>          GROUP_SIZE and halve it until we find a valid vector type.
> @@ -14106,7 +14159,36 @@ vect_maybe_update_slp_op_vectype (vec_info *vinfo, 
> slp_tree op, tree vectype)
>        && SLP_TREE_DEF_TYPE (op) == vect_external_def
>        && SLP_TREE_LANES (op) > 1)
>      return false;
> -  (void) vinfo; /* FORNOW */
> +
> +  /* When the vectorizer falls back to building vector operands from scalars,
> +     it can create SLP trees with external defs that have a number of lanes 
> not
> +     divisible by the number of subparts in a vector type naively inferred 
> from
> +     the scalar type.  Reject such types to avoid ICE when later computing 
> the
> +     prologue cost for invariant operands.  */
> +  if (SLP_TREE_DEF_TYPE (op) == vect_external_def)
> +    {
> +      poly_uint64 vf = 1;
> +
> +      if (loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (vinfo))
> +       vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
> +
> +      vf *= SLP_TREE_LANES (op);
> +
> +      if (maybe_lt (TYPE_VECTOR_SUBPARTS (vectype), vf)
> +         && !multiple_p (vf, TYPE_VECTOR_SUBPARTS (vectype)))

So this seems to be a guard for vect_get_num_copies?  It seems to me
that if this is only a problem to costing we should fix up there, not here.

As I understand for SVE you're always having a single vector given
you want to use AdvSIMD for full vector parts, like with 7 int lanes
you force split during SLP discovery to get 4 lanes AdvSIMD and
3 lanes with predicated tail and SVE?  On x86 we can share the
vector type so technically we do not need to force split there
(but I think there's no harm done if doing so).  So -- why does

  if (known_ge (TYPE_VECTOR_SUBPARTS (vectype), vf))
    return 1;

in vect_get_num_vectors not trigger and solve the issue?

> +       {
> +         if (dump_enabled_p ())
> +           dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> +                            "lanes=" HOST_WIDE_INT_PRINT_UNSIGNED
> +                            " is not divisible by "
> +                            "subparts=" HOST_WIDE_INT_PRINT_UNSIGNED ".\n",
> +                            estimated_poly_value (vf),
> +                            estimated_poly_value (
> +                              TYPE_VECTOR_SUBPARTS (vectype)));
> +         return false;
> +       }
> +    }
> +
>    SLP_TREE_VECTYPE (op) = vectype;
>    return true;
>  }
> @@ -14814,27 +14896,32 @@ vect_gen_while_not (gimple_seq *seq, tree 
> mask_type, tree start_index,
>
>     - Set *NUNITS_VECTYPE_OUT to the vector type that contains the maximum
>       number of units needed to vectorize STMT_INFO, or NULL_TREE if the
> -     statement does not help to determine the overall number of units.  */
> +     statement does not help to determine the overall number of units.
> +
> +   - Set *UNSUPPORTED_DATATYPE to false.
> +
> +   On failure:
> +
> +   - Set *UNSUPPORTED_DATATYPE to true if the statement can't be vectorized
> +     because it uses a data type that the target doesn't support in vector 
> form
> +     for a group of the given GROUP_SIZE.
> + */
>
>  opt_result
>  vect_get_vector_types_for_stmt (vec_info *vinfo, stmt_vec_info stmt_info,
>                                 tree *stmt_vectype_out,
>                                 tree *nunits_vectype_out,
> +                               bool *unsupported_datatype,
>                                 unsigned int group_size)
>  {
>    gimple *stmt = stmt_info->stmt;
>
> -  /* For BB vectorization, we should always have a group size once we've
> -     constructed the SLP tree; the only valid uses of zero GROUP_SIZEs
> -     are tentative requests during things like early data reference
> -     analysis and pattern recognition.  */
> -  if (is_a <bb_vec_info> (vinfo))
> -    gcc_assert (vinfo->slp_instances.is_empty () || group_size != 0);
> -  else
> +  if (!is_a<bb_vec_info> (vinfo))
>      group_size = 0;
>
>    *stmt_vectype_out = NULL_TREE;
>    *nunits_vectype_out = NULL_TREE;
> +  *unsupported_datatype = false;
>
>    if (gimple_get_lhs (stmt) == NULL_TREE
>        /* Allow vector conditionals through here.  */
> @@ -14907,10 +14994,13 @@ vect_get_vector_types_for_stmt (vec_info *vinfo, 
> stmt_vec_info stmt_info,
>         }
>        vectype = get_vectype_for_scalar_type (vinfo, scalar_type, group_size);
>        if (!vectype)
> -       return opt_result::failure_at (stmt,
> -                                      "not vectorized:"
> -                                      " unsupported data-type %T\n",
> -                                      scalar_type);
> +       {
> +         *unsupported_datatype = true;
> +         return opt_result::failure_at (stmt,
> +                                        "not vectorized:"
> +                                        " unsupported data-type %T\n",
> +                                        scalar_type);
> +       }
>
>        if (dump_enabled_p ())
>         dump_printf_loc (MSG_NOTE, vect_location, "vectype: %T\n", vectype);
> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> index 62f6ad320f0..1b9103b6f5f 100644
> --- a/gcc/tree-vectorizer.h
> +++ b/gcc/tree-vectorizer.h
> @@ -2353,6 +2353,8 @@ vect_get_num_copies (vec_info *vinfo, slp_tree node)
>
>    vf *= SLP_TREE_LANES (node);
>    tree vectype = SLP_TREE_VECTYPE (node);
> +  if (known_ge (TYPE_VECTOR_SUBPARTS (vectype), vf))
> +    return 1;
>
>    return vect_get_num_vectors (vf, vectype);
>  }
> @@ -2621,9 +2623,9 @@ extern tree vect_gen_while (gimple_seq *, tree, tree, 
> tree,
>                             const char * = nullptr);
>  extern void vect_gen_while_ssa_name (gimple_seq *, tree, tree, tree, tree);
>  extern tree vect_gen_while_not (gimple_seq *, tree, tree, tree);
> -extern opt_result vect_get_vector_types_for_stmt (vec_info *,
> -                                                 stmt_vec_info, tree *,
> -                                                 tree *, unsigned int = 0);
> +extern opt_result vect_get_vector_types_for_stmt (vec_info *, stmt_vec_info,
> +                                                 tree *, tree *,
> +                                                 bool *, unsigned int = 0);
>  extern opt_tree vect_get_mask_type_for_stmt (stmt_vec_info, unsigned int = 
> 0);
>
>  /* In tree-if-conv.cc.  */
> @@ -2956,9 +2958,8 @@ vect_can_use_partial_vectors_p (vec_info *vinfo, 
> slp_tree slp_node)
>    loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (vinfo);
>    if (loop_vinfo)
>      return LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo);
> -
> -  (void) slp_node; /* FORNOW */
> -  return false;
> +  else
> +    return SLP_TREE_CAN_USE_PARTIAL_VECTORS_P (slp_node);
>  }
>
>  /* If VINFO is vectorizer state for loop vectorization then record that we no
> --
> 2.43.0
>

Re: [PATCH v11 09/12] Extend BB SLP vectorization to use predicated tails

Reply via email to