On Wed, Jun 3, 2026 at 5:21 PM Christopher Bazley <[email protected]> wrote:
>
> This enables use of a predicate mask or length limit for
> vectorization of basic blocks in cases where previously only the
> equivalent rolled (i.e. loop) form of some source code would have
> been vectorized. Predication is used for groups whose size
> is not neatly divisible into vectors of lengths that can be
> supported directly by the target.
>
> The initial vector mode for an SLP region is "autodetected" by calling
> aarch64_preferred_simd_mode, which prefers SVE modes if supported and
> unless configured otherwise (e.g. VNx4SI for int). If at least one
> profitable subgraph can be scheduled then GCC does not try to vectorise
> the region using any other modes, even though their estimated costs
> might otherwise have been lower.
>
> For example, if analysis of a 24-byte group succeeds with vector mode
> V16QI (using types vector(16) and vector(8) char) then the estimated
> cost of the vectorised code is 11+11=22. If analysis of the same group
> succeeds with vector mode VNx16QI (using type vector([16,16]) char for
> both subtrees) then the estimated cost is 15+15=30. In both cases, the
> estimated vectorised cost would beat the estimated scalar cost of
> 96+48=144, so vector([16,16]) wins because VNx16QI is tried first.
>
> This is mitigated by the fact that a sequence of GIMPLE stmts such as:
>
> vectp.14_86 = x_50(D) + 16;
> slp_mask_87 = .WHILE_ULT (0, 8, { 0, ... });
> .MASK_STORE (vectp.14_86, 8B, slp_mask_87, vect__34.12_85);
>
> are lowered to a fixed-length vector store (e.g., str d30, [x0, 16]) if
> possible, instead of a more literal interpretation such as:
>
> add x0, x0, 16
> ptrue p7.b, vl7
> st1b z30.b, p7, [x0]
>
> The vect_record_nunits function used during building of an SLP
> tree is updated to prevent it returning failure for BB SLP if the
> group size is not an integral multiple of the number of lanes in the
> vector type; it now allows such cases if the vector type might be more
> than long enough.
>
> Instead of giving up if vect_get_vector_types_for_stmt
> fails for the specified group size, vect_build_slp_tree_1
> now calls vect_get_vector_types_for_stmt again without
> a group size (which defaults to 0) as a fallback.
> If this succeeds then the initial failure is treated as a
> 'soft' failure that results in the group being split.
> Consequently, assertions that "For BB vectorization, we
> should always have a group size once we've constructed the
> SLP tree" were deleted in get_vectype_for_scalar_type and
> vect_get_vector_types_for_stmt.
>
> For BB SLP, vect_analyze_slp_instance previously gave up after
> building an SLP tree if it could not prove that the group size was
> at least the maximum lane count across all of the vector types in
> the SLP tree (which is unprovable for scalable vector types), or
> attempted to split the group if it could prove that the group size
> was greater than this maximum but not exactly divisible by it
> (which is also unprovable for scalable vector types).
>
> This function will now provisionally create a new SLP instance if the
> group size definitely does not exceed the minimum number of lanes,
> even if the group size otherwise satisfies conditions that would
> require a loop to be unrolled (e.g., a group of size 3 that uses a
> mixture of V4SI and V8HI types). If the group size lies between the
> minimum and maximum number of lanes then vectorization is still
> abandoned (e.g., a group of size 3 that uses a mixture of
> V2DI and V4SI types).
>
> With BB SLP, there is no need for agreement between different SLP
> nodes about whether to use masks or lengths to support partial vectors.
> Instead, that decision is made early and per individual SLP node, by
> vect_analyze_stmt. If a partial vector is required (i.e. if the number
> of subparts in the vector type may be greater than the number of active
> lanes for the node) then vect_analyze_stmt now requires
> SLP_TREE_CAN_USE_PARTIAL_VECTORS_P to be true; otherwise it clears any
> SLP_TREE_PARTIAL_VECTORS_STYLE that could have been set.
>
> The vect_get_num_copies function used during statement analysis
> is updated to return early with 1 if a vector type is long enough for
> the specified SLP tree node. This avoids an ICE in vect_get_num_vectors,
> which cannot cope with SVE vector types.
>
> When checking whether a value that is used outside the vectorized
> region can be supported, the vectorizable_live_operation function
> calculates which vector contains the result, and which lane of that
> vector we need. Previously, this calculation gave the wrong answer
> for BB SLP with a variable-length vector type (eventually generating
> invalid offsets such as BIT_FIELD_REF <_251, 32, POLY_INT_CST
> [96, 128]> to access the third element of a group using type VNx4SI)
> because it reused logic intended for loop vectorization, which selects
> the 'last' occurrence of a scalar index relative to the group size
> (which is a multiple of the vector length). For BB SLP with a
> predicate mask, only the first SLP_TREE_LANES elements are well
> defined.
>
> vect_create_vectorized_promotion_stmts no longer pushes
> more stmts than implied by vect_get_num_copies because it could
> previously overrun the number of slots allocated for an SLP node
> (based on its number of lanes and type). e.g., four defs were
> pushed for a promotion of V8HI to V2DI (8/2=4) even if only two
> lanes of the V8HI were active. Allowing it later caused ICE in
> vectorizable_operation for a parent node, because binary ops
> require both operands to be the same length.
>
> Since promotion no longer produces redundant definitions,
> vectorizable_conversion also had to be modified so that demotion no
> longer relies on an even number of defs being produced. If
> necessary, it now pushes a single constant zero def.
>
> The whole change is enabled by wiring the wrapper function
> vect_can_use_partial_vectors_p to SLP_TREE_CAN_USE_PARTIAL_VECTORS_P
> when invoked for BB SLP vectorization.
>
> Update test expectations for gcc.dg/vect/vect-over-widen-*.c,
> gcc.target/aarch64/sve/slp_6.c and
> gcc.target/aarch64/sve/vec_construct_*.c.
>
> The vec_construct_*.c tests previously expected their output
> to use Advanced SIMD instead of SVE despite their use of
> vector length agnostic types such as svint16_t and despite
> the fact that they are in the aarch64/sve directory. Since
> BB SLP can now vectorize these tests using VLA types such
> as 'vector([8,8]) char', and because (with one exception) the
> resultant code is deemed profitable relative to scalar code,
> GCC no longer considers vectorizing using non-VLA types such
> as 'vector(8) char' (although the estimated cost with non-VLA
> types might have been lower, had it been calculated).
> Instruction selection is not the focus of these tests therefore
> I updated them to expect SVE instead (e.g. st1b instead of str)
> and added --param=aarch64-autovec-preference=sve-only to reduce
> future churn.
>
> Because the cost model takes into account predicate mask
> generation for BB SLP with VLA types, the threshold at which
> vectorized code wins against scalar code is higher than
> before. The number of elements stored by vec_construct_3.c was
> increased just enough to allow for that.
Some (possibly incomplete) review pieces below.
> gcc/ChangeLog:
>
> * tree-vect-loop.cc (vectorizable_live_operation): Simplify the
> calculation of the index of the final result to avoid
> generating invalid polynomial offsets relative to the end of
> variable-length vector types, which is what happens if the code
> for loop vectorization is reused for basic block SLP.
> * tree-vect-slp.cc (vect_record_nunits): Allow group sizes that
> are indivisible by the vector length.
> (vect_build_slp_tree_1): In case of failure of
> vect_get_vector_types_for_stmt, try to get fallback vector
> types and continue analysis to allow splitting of groups.
> (vect_build_slp_tree_2): Don't call
> can_duplicate_and_interleave_p when doing basic block SLP
> vectorization.
> (vect_update_slp_min_nunits_for_node): New recursive function.
> Update min_nunits to reflect the minimum number of subparts for
> all of the vector types used by an SLP subgraph.
> (vect_slp_tree_min_nunits): New function. Initialize min_nunits
> then call vect_update_slp_min_nunits_for_node.
> (vect_analyze_slp_instance): For BB SLP vectorization, create
> a new SLP instance if the group size definitely does not exceed
> the minimum number of subparts for all of the vector types used
> in the SLP tree, even if the group size otherwise satisfies
> conditions that would require a loop to be unrolled.
> (vectorizable_slp_permutation_1): Instead of asserting that an
> SLP tree node's number of lanes is compatible with the chosen
> vector width, return a failure indication if incompatible.
> * tree-vect-stmts.cc (check_load_store_for_partial_vectors):
> When calculating the number of vectors, get the group size from
> SLP_TREE_LANES instead of a parameter (e.g., DR_GROUP_SIZE) if
> doing BB SLP vectorization. Don't assume it can be divided by
> the number of subparts in the vector type to get a compile-time
> constant.
> (vect_get_data_ptr_increment): Require a parameter of type
> loop_vec_info instead of vec_info *.
> (vect_create_vectorized_promotion_stmts): Require an SLP tree
> node to be passed by the caller, for use by
> vect_get_num_copies.
> Stop pushing more stmts than implied by vect_get_num_copies.
> (vectorizable_conversion): Pass SLP tree node to
> vect_create_vectorized_promotion_stmts.
> Demotion no longer relies on an even number of definitions
> being produced by promotion. If necessary, push a single constant
> zero definition.
> (vectorizable_load): Pass loop_vec_info instead of vec_info *
> when calling vect_get_data_ptr_increment.
> (vect_analyze_stmt): For BB SLP vectorization, check whether
> the group needs partial vectors. If it does then return a
> failure indication if SLP_TREE_CAN_USE_PARTIAL_VECTORS_P was
> cleared by a callee of this function; if it doesn't need
> partial vectors then clear any partial vectors style that might
> have been chosen by callees of this function.
> (get_vectype_for_scalar_type): For BB SLP vectorization, allow
> invocation of this function with a group size of zero even if
> one or more SLP instances have been created.
> If the number of subparts in the natural choice of vector type
> could be greater than the group size then pick a shorter vector
> type only if the target does not support partial vectors.
> (vect_maybe_update_slp_op_vectype): Reject external definitions
> that have a number of lanes not divisible by the number of
> subparts in a vector type naively inferred from the scalar
> type.
> (vect_get_vector_types_for_stmt): Add a new output parameter of
> Boolean type. Set it to true if the statement can't be
> vectorized because it uses a data type that the target doesn't
> support in vector form for a group of the given size, otherwise
> false.
> * tree-vectorizer.h (vect_get_num_copies): Return early with 1
> if a vector type is long enough for the specified SLP tree
> node to avoid an ICE in vect_get_num_vectors.
> (vect_get_vector_types_for_stmt): Update function declaration.
> (vect_can_use_partial_vectors_p): Handle the BB SLP use-case by
> returning the result of SLP_TREE_CAN_USE_PARTIAL_VECTORS_P.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.dg/vect/vect-over-widen-10.c: Update test expectations to
> avoid spurious matching of scan-tree-dump-not pattern.
> * gcc.dg/vect/vect-over-widen-13.c: As above.
> * gcc.dg/vect/vect-over-widen-14.c: As above.
> * gcc.dg/vect/vect-over-widen-17.c: As above.
> * gcc.dg/vect/vect-over-widen-18.c: As above.
> * gcc.dg/vect/vect-over-widen-5.c: As above.
> * gcc.dg/vect/vect-over-widen-6.c: As above.
> * gcc.dg/vect/vect-over-widen-7.c: As above.
> * gcc.dg/vect/vect-over-widen-8.c: As above.
> * gcc.dg/vect/vect-over-widen-9.c: As above.
> * gcc.target/aarch64/sve/vec_construct_1.c:
> Expect SVE instead of ASIMD instructions and add
> --param=aarch64-autovec-preference=sve-only to stop
> flip-flopping.
> * gcc.target/aarch64/sve/vec_construct_1.c: As above.
> * gcc.target/aarch64/sve/vec_construct_2.c: As above.
> * gcc.target/aarch64/sve/vec_construct_3.c:
> Expect SVE instead of ASIMD instructions and add
> --param=aarch64-autovec-preference=sve-only to avoid
> flip-flopping. Increase the number of elements
> stored to ensure vectorization using SVE is deemed
> profitable despite predicate mask costs.
> * gcc.target/aarch64/sve/vec_construct_4.c:
> Expect SVE instead of ASIMD instructions and add
> --param=aarch64-autovec-preference=sve-only to stop
> flip-flopping.
> * gcc.target/aarch64/sve/vec_construct_5.c: As above.
> ---
> .../gcc.dg/vect/vect-over-widen-10.c | 2 +-
> .../gcc.dg/vect/vect-over-widen-13.c | 2 +-
> .../gcc.dg/vect/vect-over-widen-14.c | 2 +-
> .../gcc.dg/vect/vect-over-widen-17.c | 2 +-
> .../gcc.dg/vect/vect-over-widen-18.c | 2 +-
> gcc/testsuite/gcc.dg/vect/vect-over-widen-5.c | 2 +-
> gcc/testsuite/gcc.dg/vect/vect-over-widen-6.c | 2 +-
> gcc/testsuite/gcc.dg/vect/vect-over-widen-7.c | 2 +-
> gcc/testsuite/gcc.dg/vect/vect-over-widen-8.c | 2 +-
> gcc/testsuite/gcc.dg/vect/vect-over-widen-9.c | 2 +-
> gcc/testsuite/gcc.target/aarch64/sve/slp_6.c | 3 -
> .../gcc.target/aarch64/sve/vec_construct_1.c | 6 +-
> .../gcc.target/aarch64/sve/vec_construct_2.c | 4 +-
> .../gcc.target/aarch64/sve/vec_construct_3.c | 20 +-
> .../gcc.target/aarch64/sve/vec_construct_4.c | 5 +-
> .../gcc.target/aarch64/sve/vec_construct_5.c | 6 +-
> gcc/tree-vect-loop.cc | 14 +-
> gcc/tree-vect-slp.cc | 141 ++++++++++--
> gcc/tree-vect-stmts.cc | 202 +++++++++++++-----
> gcc/tree-vectorizer.h | 13 +-
> 20 files changed, 317 insertions(+), 117 deletions(-)
>
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-10.c
> b/gcc/testsuite/gcc.dg/vect/vect-over-widen-10.c
> index f0140e4ef6d..6efcf739db9 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-10.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-10.c
> @@ -16,5 +16,5 @@
> /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> detected:[^\n]* >> 1} "vect" } } */
> /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> detected:[^\n]* >> 2} "vect" } } */
> /* { dg-final { scan-tree-dump {vect_recog_cast_forwprop_pattern:
> detected:[^\n]* \(unsigned char\)} "vect" } } */
> -/* { dg-final { scan-tree-dump-not {vector[^ ]* int} "vect" } } */
> +/* { dg-final { scan-tree-dump-not {vector[^ ]* int vect__} "vect" } } */
> /* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-13.c
> b/gcc/testsuite/gcc.dg/vect/vect-over-widen-13.c
> index 08a65ea5518..720353716cf 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-13.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-13.c
> @@ -48,5 +48,5 @@ main (void)
> /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> detected:[^\n]* \+} "vect" } } */
> /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> detected:[^\n]* / 2} "vect" } } */
> /* { dg-final { scan-tree-dump {vect_recog_cast_forwprop_pattern:
> detected:[^\n]* = \(signed char\)} "vect" } } */
> -/* { dg-final { scan-tree-dump-not {vector[^ ]* int} "vect" } } */
> +/* { dg-final { scan-tree-dump-not {vector[^ ]* int vect__} "vect" } } */
> /* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-14.c
> b/gcc/testsuite/gcc.dg/vect/vect-over-widen-14.c
> index dfa09f5d2ca..f1d5f95c543 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-14.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-14.c
> @@ -15,5 +15,5 @@
> /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> detected:[^\n]* \+} "vect" } } */
> /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> detected:[^\n]* >> 1} "vect" } } */
> /* { dg-final { scan-tree-dump {vect_recog_cast_forwprop_pattern:
> detected:[^\n]* = \(unsigned char\)} "vect" } } */
> -/* { dg-final { scan-tree-dump-not {vector[^ ]* int} "vect" } } */
> +/* { dg-final { scan-tree-dump-not {vector[^ ]* int vect__} "vect" } } */
> /* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-17.c
> b/gcc/testsuite/gcc.dg/vect/vect-over-widen-17.c
> index 53fcfd0c06c..ac1a0f86727 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-17.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-17.c
> @@ -46,5 +46,5 @@ main (void)
> adopts realign_load scheme. It requires rs6000_builtin_mask_for_load to
> generate mask whose return type is vector char. */
> /* { dg-final { scan-tree-dump-not {vector[^\n]*char} "vect" { target
> vect_hw_misalign } } } */
> -/* { dg-final { scan-tree-dump-not {vector[^ ]* int} "vect" } } */
> +/* { dg-final { scan-tree-dump-not {vector[^ ]* int vect__} "vect" } } */
> /* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-18.c
> b/gcc/testsuite/gcc.dg/vect/vect-over-widen-18.c
> index aa58cd1c957..3ebfaa78270 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-18.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-18.c
> @@ -47,5 +47,5 @@ main (void)
> /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> detected:[^\n]* |} "vect" } } */
> /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> detected:[^\n]* <<} "vect" } } */
> /* { dg-final { scan-tree-dump {vector[^\n]*char} "vect" } } */
> -/* { dg-final { scan-tree-dump-not {vector[^ ]* int} "vect" } } */
> +/* { dg-final { scan-tree-dump-not {vector[^ ]* int vect__} "vect" } } */
> /* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-5.c
> b/gcc/testsuite/gcc.dg/vect/vect-over-widen-5.c
> index c2ab11a9d32..1d89789a86d 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-5.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-5.c
> @@ -49,5 +49,5 @@ main (void)
> /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> detected:[^\n]* \+ } "vect" } } */
> /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> detected:[^\n]* >> 1} "vect" } } */
> /* { dg-final { scan-tree-dump {vect_recog_cast_forwprop_pattern:
> detected:[^\n]* \(signed char\)} "vect" } } */
> -/* { dg-final { scan-tree-dump-not {vector[^ ]* int} "vect" } } */
> +/* { dg-final { scan-tree-dump-not {vector[^ ]* int vect__} "vect" } } */
> /* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-6.c
> b/gcc/testsuite/gcc.dg/vect/vect-over-widen-6.c
> index bda92c965e0..62d5a52587e 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-6.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-6.c
> @@ -13,5 +13,5 @@
> /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> detected:[^\n]* \+ } "vect" } } */
> /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> detected:[^\n]* >> 1} "vect" } } */
> /* { dg-final { scan-tree-dump {vect_recog_cast_forwprop_pattern:
> detected:[^\n]* \(unsigned char\)} "vect" } } */
> -/* { dg-final { scan-tree-dump-not {vector[^ ]* int} "vect" } } */
> +/* { dg-final { scan-tree-dump-not {vector[^ ]* int vect__} "vect" } } */
> /* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-7.c
> b/gcc/testsuite/gcc.dg/vect/vect-over-widen-7.c
> index 1d55e13fb1f..6e09631009a 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-7.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-7.c
> @@ -51,5 +51,5 @@ main (void)
> /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> detected:[^\n]* \+ } "vect" } } */
> /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> detected:[^\n]* >> 2} "vect" } } */
> /* { dg-final { scan-tree-dump {vect_recog_cast_forwprop_pattern:
> detected:[^\n]* \(signed char\)} "vect" } } */
> -/* { dg-final { scan-tree-dump-not {vector[^ ]* int} "vect" } } */
> +/* { dg-final { scan-tree-dump-not {vector[^ ]* int vect__} "vect" } } */
> /* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-8.c
> b/gcc/testsuite/gcc.dg/vect/vect-over-widen-8.c
> index 553c0712a79..b6d650beab4 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-8.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-8.c
> @@ -16,5 +16,5 @@
> /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> detected:[^\n]* \+ } "vect" } } */
> /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> detected:[^\n]* >> 2} "vect" } } */
> /* { dg-final { scan-tree-dump {vect_recog_cast_forwprop_pattern:
> detected:[^\n]* \(unsigned char\)} "vect" } } */
> -/* { dg-final { scan-tree-dump-not {vector[^ ]* int} "vect" } } */
> +/* { dg-final { scan-tree-dump-not {vector[^ ]* int vect__} "vect" } } */
> /* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" } } */
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-over-widen-9.c
> b/gcc/testsuite/gcc.dg/vect/vect-over-widen-9.c
> index 36bfc68e053..e82f8a571da 100644
> --- a/gcc/testsuite/gcc.dg/vect/vect-over-widen-9.c
> +++ b/gcc/testsuite/gcc.dg/vect/vect-over-widen-9.c
> @@ -56,5 +56,5 @@ main (void)
> /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> detected:[^\n]* >> 1} "vect" } } */
> /* { dg-final { scan-tree-dump {vect_recog_over_widening_pattern:
> detected:[^\n]* >> 2} "vect" } } */
> /* { dg-final { scan-tree-dump {vect_recog_cast_forwprop_pattern:
> detected:[^\n]* \(signed char\)} "vect" } } */
> -/* { dg-final { scan-tree-dump-not {vector[^ ]* int} "vect" } } */
> +/* { dg-final { scan-tree-dump-not {vector[^ ]* int vect__} "vect" } } */
> /* { dg-final { scan-tree-dump-times "vectorized 1 loop" 1 "vect" } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/slp_6.c
> b/gcc/testsuite/gcc.target/aarch64/sve/slp_6.c
> index 44d128477d2..1c9ac15a699 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/slp_6.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_6.c
> @@ -37,9 +37,6 @@ vec_slp_##TYPE (TYPE *restrict a, TYPE *restrict b, int n)
> \
> TEST_ALL (VEC_PERM)
>
> /* These loops can't use SLP. */
> -/* { dg-final { scan-assembler-not {\tld1b\t} } } */
> -/* { dg-final { scan-assembler-not {\tld1h\t} } } */
> -/* { dg-final { scan-assembler-not {\tld1w\t} } } */
> /* { dg-final { scan-assembler-not {\tld1d\t} } } */
> /* { dg-final { scan-assembler {\tld3b\t} } } */
> /* { dg-final { scan-assembler {\tld3h\t} } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_1.c
> b/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_1.c
> index 2f8ce6808a9..eea13c28e49 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_1.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_1.c
> @@ -1,5 +1,5 @@
> /* { dg-do compile } */
> -/* { dg-options "-O2 -ftree-slp-vectorize" } */
> +/* { dg-options "-O2 -ftree-slp-vectorize
> --param=aarch64-autovec-preference=sve-only" } */
>
> /* Test that a group of stores of 8 elements derived from a horizontal
> reduction is vectorized by constructing a vector and storing it.
> @@ -30,8 +30,8 @@ foo (svint8_t src0, svint8_t src1, svint8_t src2, svint8_t
> src3, svint8_t src4,
> s.h = svaddv_s8 (all, src7);
> }
>
> -/* { dg-final { scan-assembler-times {\tins\tv[0-9]+\.b\[[0-9]+\],
> v[0-9]+\.b\[[0-9]+\]\n} 7 } } */
> -/* { dg-final { scan-assembler-times {\tstr\td[0-9]+, } 1 } } */
> +/* { dg-final { scan-assembler-times {\tinsr\tz[0-9]+\.h, h[0-9]+\n} 7 } } */
> +/* { dg-final { scan-assembler-times {\tst1b\tz[0-9]+\.h, p[0-9]+,
> \[x[0-9]+\]\n} 1 } } */
>
> /* { dg-final { scan-assembler-not {\tstr\tb[0-9]+, } } } */
> /* { dg-final { scan-assembler-not {\tstrb\tw[0-9]+, } } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_2.c
> b/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_2.c
> index 6715118d7b0..2bf537e13e2 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_2.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_2.c
> @@ -1,5 +1,5 @@
> /* { dg-do compile } */
> -/* { dg-options "-O2 -ftree-slp-vectorize" } */
> +/* { dg-options "-O2 -ftree-slp-vectorize
> --param=aarch64-autovec-preference=sve-only" } */
>
> /* Test that a group of stores of 8 elements derived from the results of
> calls
> to a function that has only vector parameters and returns a scalar result
> is
> @@ -40,3 +40,5 @@ foo (svint8_t src0, svint8_t src1, svint8_t src2, svint8_t
> src3, svint8_t src4,
>
> /* { dg-final { scan-assembler-not {\tins\tv[0-9]+\.b\[[0-9]+\], w[0-9]+\n}
> } } */
> /* { dg-final { scan-assembler-not {\tstr\td[0-9]+, } } } */
> +/* { dg-final { scan-assembler-not {\tfmov\th[0-9]+, h[0-9]+\n} } } */
> +/* { dg-final { scan-assembler-not {\tinsr\tz[0-9]+\.b, w[0-9]+\n} } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_3.c
> b/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_3.c
> index 8143d0050ad..ccadaccbcb4 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_3.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_3.c
> @@ -1,7 +1,7 @@
> /* { dg-do compile } */
> -/* { dg-options "-O2 -ftree-slp-vectorize" } */
> +/* { dg-options "-O2 -ftree-slp-vectorize
> --param=aarch64-autovec-preference=sve-only" } */
>
> -/* Test that a group of stores of 8 elements derived from a horizontal
> +/* Test that a group of stores of 14 elements derived from a horizontal
> reduction is vectorized by constructing a vector and storing it
> even if the results of the reductions are narrowed.
> Since there are no GPR-to-SIMD register transfers, there is no
> @@ -13,12 +13,14 @@
>
> struct S
> {
> - char a, b, c, d, e, f, g, h;
> + char a, b, c, d, e, f, g, h, i, j, k, l, m, n;
> } s;
>
> void
> foo (svint16_t src0, svint32_t src1, svint16_t src2, svint32_t src3,
> - svint32_t src4, svint16_t src5, svint32_t src6, svint16_t src7)
> + svint32_t src4, svint16_t src5, svint32_t src6, svint16_t src7,
> + svint16_t src8, svint32_t src9, svint16_t src10, svint32_t src11,
> + svint32_t src12, svint16_t src13)
> {
> svbool_t all16 = svptrue_b16 ();
> svbool_t all32 = svptrue_b32 ();
> @@ -30,10 +32,16 @@ foo (svint16_t src0, svint32_t src1, svint16_t src2,
> svint32_t src3,
> s.f = svminv_s16 (all16, src5);
> s.g = svlastb_s32 (svptrue_pat_b32 (SV_VL1), src6);
> s.h = svaddv_s16 (all16, src7);
> + s.i = svmaxv_s16 (all16, src8);
> + s.j = svminv_s32 (all32, src9);
> + s.k = svlastb_s16 (svptrue_pat_b16 (SV_VL1), src10);
> + s.l = svaddv_s32 (all32, src11);
> + s.m = svmaxv_s32 (all32, src12);
> + s.n = svminv_s16 (all16, src13);
> }
>
> -/* { dg-final { scan-assembler-times {\tins\tv[0-9]+\.b\[[0-9]+\],
> v[0-9]+\.b\[[0-9]+\]\n} 7 } } */
> -/* { dg-final { scan-assembler-times {\tstr\td[0-9]+, } 1 } } */
> +/* { dg-final { scan-assembler-times {\tinsr\tz[0-9]+\.b, b[0-9]+\n} 13 } }
> */
> +/* { dg-final { scan-assembler-times {\tst1b\tz[0-9]+\.b, p[0-9]+,
> \[x[0-9]\]\n} 1 } } */
>
> /* { dg-final { scan-assembler-not {\tstr\tb[0-9]+, } } } */
> /* { dg-final { scan-assembler-not {\tstrb\tw[0-9]+, } } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_4.c
> b/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_4.c
> index 49f8114b64c..3d41af684a3 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_4.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_4.c
> @@ -1,5 +1,5 @@
> /* { dg-do compile } */
> -/* { dg-options "-O2 -ftree-slp-vectorize" } */
> +/* { dg-options "-O2 -ftree-slp-vectorize
> --param=aarch64-autovec-preference=sve-only" } */
>
> /* Test that a group of stores of 8 elements derived from a horizontal
> reduction is not vectorized by constructing a vector and storing it
> @@ -33,5 +33,6 @@ foo (svint16_t src0, svint8_t src1, svint16_t src2,
> svint8_t src3,
> /* { dg-final { scan-assembler-times {\tstp\tw[0-9]+, w[0-9]+,} 4 } } */
>
> /* { dg-final { scan-assembler-not {\tins\tv[0-9]+\.s\[[0-9]+\], w[0-9]+\n}
> } } */
> -/* { dg-final { scan-assembler-not {\tfmov\ts[0-9]+, w[0-9]+\n} } }
> +/* { dg-final { scan-assembler-not {\tfmov\ts[0-9]+, w[0-9]+\n} } } */
> /* { dg-final { scan-assembler-not {\tstp\tq[0-9]+, q[0-9]+,} } } */
> +/* { dg-final { scan-assembler-not {\tinsr\tz[0-9]+.s, w[0-9]+\n} } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_5.c
> b/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_5.c
> index 983d6c69ebc..89e57406c0e 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_5.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/vec_construct_5.c
> @@ -1,5 +1,5 @@
> /* { dg-do compile } */
> -/* { dg-options "-O2 -ftree-slp-vectorize" } */
> +/* { dg-options "-O2 -ftree-slp-vectorize
> --param=aarch64-autovec-preference=sve-only" } */
>
> /* Test that a group of stores of 8 elements derived from lane extractions is
> vectorized by constructing a vector and storing it. Since there are no
> @@ -30,8 +30,8 @@ foo (svint8_t src0, svint8_t src1, svint8_t src2, svint8_t
> src3, svint8_t src4,
> s.h = svlastb_s8 (p, src7);
> }
>
> -/* { dg-final { scan-assembler-times {\tins\tv[0-9]+\.b\[[0-9]+\],
> v[0-9]+\.b\[[0-9]+\]\n} 7 } } */
> -/* { dg-final { scan-assembler-times {\tstr\td[0-9]+, } 1 } } */
> +/* { dg-final { scan-assembler-times {\tinsr\tz[0-9]+\.h, h[0-9]+\n} 7 } } */
> +/* { dg-final { scan-assembler-times {\tst1b\tz[0-9]+\.h, p[0-9]+,
> \[x[0-9]+\]\n} 1 } } */
>
> /* { dg-final { scan-assembler-not {\tstr\tb[0-9]+, } } } */
> /* { dg-final { scan-assembler-not {\tstrb\tw[0-9]+, } } } */
> diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> index 6d602c67108..7503fd084cf 100644
> --- a/gcc/tree-vect-loop.cc
> +++ b/gcc/tree-vect-loop.cc
> @@ -10227,12 +10227,16 @@ vectorizable_live_operation (vec_info *vinfo,
> stmt_vec_info stmt_info,
>
> gcc_assert (slp_index >= 0);
>
> - /* Get the last occurrence of the scalar index from the concatenation of
> - all the slp vectors. Calculate which slp vector it is and the index
> - within. */
> - int num_scalar = SLP_TREE_LANES (slp_node);
> int num_vec = vect_get_num_copies (vinfo, slp_node);
> - poly_uint64 pos = (num_vec * nunits) - num_scalar + slp_index;
> + poly_uint64 pos = slp_index;
> + if (loop_vinfo)
> + {
> + /* Get the last occurrence of the scalar index from the concatenation
> of
> + all the slp vectors. Calculate which slp vector it is and the index
> + within. */
> + int num_scalar = SLP_TREE_LANES (slp_node);
> + pos += (num_vec * nunits) - num_scalar;
> + }
This looks independent enough and is OK to push separately.
> /* Calculate which vector contains the result, and which lane of
> that vector we need. */
> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> index 1850af4e753..6af13e65e19 100644
> --- a/gcc/tree-vect-slp.cc
> +++ b/gcc/tree-vect-slp.cc
> @@ -1117,8 +1117,12 @@ vect_record_max_nunits (vec_info *vinfo, stmt_vec_info
> stmt_info,
> }
>
> /* If populating the vector type requires unrolling then fail
> - before adjusting *max_nunits for basic-block vectorization. */
> + before adjusting *max_nunits for basic-block vectorization.
> + Allow group sizes that are indivisible by the vector length only if they
> + are known not to exceed the vector length. We may be able to support
> such
> + cases by generating constant masks. */
> if (is_a <bb_vec_info> (vinfo)
> + && maybe_gt (group_size, TYPE_VECTOR_SUBPARTS (vectype))
> && !multiple_p (group_size, TYPE_VECTOR_SUBPARTS (vectype)))
> {
> if (dump_enabled_p ())
is this hunk still necessary?
> @@ -1170,12 +1174,29 @@ vect_build_slp_tree_1 (vec_info *vinfo, unsigned char
> *swap,
> tree soft_fail_nunits_vectype = NULL_TREE;
>
> tree vectype, nunits_vectype;
> + bool unsupported_datatype = false;
> if (!vect_get_vector_types_for_stmt (vinfo, first_stmt_info, &vectype,
> - &nunits_vectype, group_size))
> + &nunits_vectype, &unsupported_datatype,
> + group_size))
> {
> - /* Fatal mismatch. */
> - matches[0] = false;
> - return false;
> + /* Try to get fallback vector types and continue analysis, producing
> + matches[] as if vectype was not an issue. This allows splitting of
> + groups to happen. */
> + if (unsupported_datatype
> + && vect_get_vector_types_for_stmt (vinfo, first_stmt_info, &vectype,
> + &nunits_vectype,
> + &unsupported_datatype))
Can we instead add a "allow_predicated_tail" flag to
vect_get_vector_types_for_stmt?
Btw, I don't remember that we fail here for say a group_size of 7, do
we? We fail
when there's no vector type with nunits < group_size?
> + {
> + gcc_assert (is_a<bb_vec_info> (vinfo));
> + maybe_soft_fail = true;
> + soft_fail_nunits_vectype = nunits_vectype;
> + }
> + else
> + {
> + /* Fatal mismatch. */
> + matches[0] = false;
> + return false;
> + }
> }
> if (is_a <bb_vec_info> (vinfo)
> && known_le (TYPE_VECTOR_SUBPARTS (vectype), 1U))
> @@ -1705,16 +1726,22 @@ vect_build_slp_tree_1 (vec_info *vinfo, unsigned char
> *swap,
>
> if (maybe_soft_fail)
> {
> - unsigned HOST_WIDE_INT const_nunits;
> - if (!TYPE_VECTOR_SUBPARTS
> - (soft_fail_nunits_vectype).is_constant (&const_nunits)
> - || const_nunits > group_size)
> + /* Use the known minimum number of subparts for VLA because we still
> need
> + to choose a splitting point although the choice is more arbitrary.
> */
> + unsigned HOST_WIDE_INT const_nunits = constant_lower_bound (
> + TYPE_VECTOR_SUBPARTS (soft_fail_nunits_vectype));
> +
> + if (const_nunits > group_size)
> matches[0] = false;
> else
> {
> /* With constant vector elements simulate a mismatch at the
> point we need to split. */
> + gcc_assert ((const_nunits & (const_nunits - 1)) == 0);
> unsigned tail = group_size & (const_nunits - 1);
> + if (tail == 0)
> + tail = const_nunits;
> + gcc_assert (group_size >= tail);
> memset (&matches[group_size - tail], 0, sizeof (bool) * tail);
> }
> return false;
> @@ -2446,13 +2473,21 @@ vect_build_slp_tree_2 (vec_info *vinfo, slp_tree node,
> /* Check whether we can build the invariant. If we can't
> we never will be able to. */
> tree type = TREE_TYPE (chains[0][n].op);
> - if (!GET_MODE_SIZE (vinfo->vector_mode).is_constant ()
> - && (TREE_CODE (type) == BOOLEAN_TYPE
> - || !can_duplicate_and_interleave_p (vinfo,
> group_size,
> - type)))
> + if (!GET_MODE_SIZE (vinfo->vector_mode).is_constant ())
> {
> - matches[0] = false;
> - goto out;
> + if (TREE_CODE (type) == BOOLEAN_TYPE)
> + {
> + matches[0] = false;
> + goto out;
> + }
> +
> + if (!is_a<bb_vec_info> (vinfo)
> + && !can_duplicate_and_interleave_p (vinfo,
> group_size,
> + type))
> + {
> + matches[0] = false;
> + goto out;
> + }
Hmm, but this would be target specific, right? It relies on the vec_init
CTOR expansion? How about constants? Don't we eventually want to
at least check s/group_size/TYPE_VECTOR_SUBPARTS/ for that
case so we can generate a zero-padded constant?
> }
> }
> else if (dt != vect_internal_def)
> @@ -2881,7 +2916,7 @@ out:
> uniform_val = NULL_TREE;
> break;
> }
> - if (!uniform_val
> + if (!uniform_val && !is_a<bb_vec_info> (vinfo)
> && !can_duplicate_and_interleave_p (vinfo,
> oprnd_info->ops.length
> (),
> TREE_TYPE (op0)))
> @@ -4993,6 +5028,53 @@ vect_analyze_slp_reductions (loop_vec_info loop_vinfo,
> return true;
> }
>
> +/* Update MIN_NUNITS to reflect the minimum number of subparts for all of the
> + vector types used by the SLP subgraph rooted at NODE. VISITED is used to
> + avoid reevaluating any node in the subgraph; it thereby prevents infinite
> + recursion should a cycle be encountered. The value of MIN_NUNITS will
> only be
> + updated if any node in the subgraph has a vector type with a number of
> + subparts that is smaller than the passed-in value of MIN_NUNITS. Before
> + calling this function for the first time, initialize MIN_NUNITS to
> + UINT64_MAX. */
> +
> +static void
> +vect_update_slp_min_nunits_for_node (slp_tree node, poly_uint64 &min_nunits,
> + hash_set<slp_tree> &visited)
> +{
> + if (!node || SLP_TREE_DEF_TYPE (node) != vect_internal_def)
> + return;
> +
> + if (visited.add (node))
> + return;
> +
> + for (slp_tree child : SLP_TREE_CHILDREN (node))
> + vect_update_slp_min_nunits_for_node (child, min_nunits, visited);
> +
> + tree vectype = SLP_TREE_VECTYPE (node);
> + if (!vectype)
> + return;
> +
> + /* All unit counts have the form vec_info::vector_size * X for some
> + rational X, therefore we know the values are ordered. */
> + poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype);
> + min_nunits = known_eq (min_nunits, UINT64_MAX)
> + ? nunits
> + : ordered_min (min_nunits, nunits);
> +}
> +
> +/* For NODE, return the minimum number of subparts for all of the vector
> + types used in the given SLP graph. */
> +
> +static poly_uint64
> +vect_slp_tree_min_nunits (slp_tree node)
> +{
> + poly_uint64 min_nunits = UINT64_MAX;
> + hash_set<slp_tree> visited;
> + vect_update_slp_min_nunits_for_node (node, min_nunits, visited);
> + gcc_checking_assert (known_ne (min_nunits, UINT64_MAX));
> + return min_nunits;
> +}
> +
> /* Analyze an SLP instance starting from a group of grouped stores. Call
> vect_build_slp_tree to build a tree of packed stmts if possible.
> Return FALSE if it's impossible to SLP any stmt in the group. */
> @@ -5062,8 +5144,8 @@ vect_analyze_slp_instance (vec_info *vinfo,
> poly_uint64 unrolling_factor
> = calculate_unrolling_factor (max_nunits, group_size);
>
> - if (maybe_ne (unrolling_factor, 1U)
> - && is_a <bb_vec_info> (vinfo))
> + if (maybe_ne (unrolling_factor, 1U) && is_a<bb_vec_info> (vinfo)
> + && !known_ge (vect_slp_tree_min_nunits (node), group_size))
please line up &&s in new lines.
Is this really good enough? 'group_size' might not be uniform across
the SLP graph so I'd have expected the check to be done for each SLP
node (as part of the vect_slp_tree_min_nunits walk).
> {
> unsigned HOST_WIDE_INT const_max_nunits;
> if (!max_nunits.is_constant (&const_max_nunits)
> @@ -5148,9 +5230,10 @@ vect_analyze_slp_instance (vec_info *vinfo,
> = TREE_TYPE (DR_REF (STMT_VINFO_DATA_REF (stmt_info)));
> tree vectype = get_vectype_for_scalar_type (vinfo, scalar_type,
> 1 << floor_log2 (i));
> - unsigned HOST_WIDE_INT const_nunits;
> - if (vectype
> - && TYPE_VECTOR_SUBPARTS (vectype).is_constant (&const_nunits))
> + unsigned HOST_WIDE_INT const_nunits
> + = vectype ? constant_lower_bound (TYPE_VECTOR_SUBPARTS (vectype))
> + : 0;
> + if (const_nunits > 1 && (i % const_nunits) == 0)
I prefer
if (vectype
&& (const_nunits = constant_lower_bound (...)) > 1
&&
instead of the conditional assignment.
> {
> /* Split into two groups at the first vector boundary. */
> gcc_assert ((const_nunits & (const_nunits - 1)) == 0);
> @@ -11596,7 +11679,21 @@ vectorizable_slp_permutation_1 (vec_info *vinfo,
> gimple_stmt_iterator *gsi,
> unpack_factor = 1;
> }
> unsigned olanes = unpack_factor * ncopies * SLP_TREE_LANES (node);
> - gcc_assert (repeating_p || multiple_p (olanes, nunits));
> +
> + /* With fully-predicated BB-SLP, an external node's number of lanes can be
> + incompatible with the chosen vector width (e.g., lane packs of 3 with a
> + natural 2-lane vector type). */
> + if (!repeating_p && !multiple_p (olanes, nunits))
> + {
> + if (dump_p)
> + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> + "unsupported permutation %p: vector type %T,"
> + " nunits=" HOST_WIDE_INT_PRINT_UNSIGNED
> + " ncopies=%" PRIu64 ", lanes=%u and unpack=%u\n",
> + (void *) node, vectype, estimated_poly_value
> (nunits),
> + ncopies, SLP_TREE_LANES (node), unpack_factor);
> + return -1;
> + }
>
> /* Compute the { { SLP operand, vector index}, lane } permutation sequence
> from the { SLP operand, scalar lane } permutation as recorded in the
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index cd3ba6fa1cb..367a9c63ea4 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -1671,23 +1671,27 @@ check_load_store_for_partial_vectors (vec_info
> *vinfo, tree vectype,
> unsigned int nvectors;
> if (can_div_away_from_zero_p (size, nunits, &nvectors))
> return nvectors;
> - gcc_unreachable ();
> +
> + gcc_assert (known_le (size, nunits));
> + return 1u;
> };
>
> poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype);
> - poly_uint64 vf = loop_vinfo ? LOOP_VINFO_VECT_FACTOR (loop_vinfo) : 1;
> + poly_uint64 size = loop_vinfo
> + ? group_size * LOOP_VINFO_VECT_FACTOR (loop_vinfo)
> + : SLP_TREE_LANES (slp_node);
I'm not sure this is correct if we have a load permutation on
slp_node. It at least
warrants a comment (or an assert?).
> unsigned factor;
> vect_partial_vector_style partial_vector_style
> = vect_get_partial_vector_style (vectype, is_load, &factor, elsvals);
>
> if (partial_vector_style == vect_partial_vectors_len)
> {
> - nvectors = group_memory_nvectors (group_size * vf, nunits);
> + nvectors = group_memory_nvectors (size, nunits);
> vect_record_len (vinfo, slp_node, nvectors, vectype, factor);
> }
> else if (partial_vector_style == vect_partial_vectors_while_ult)
> {
> - nvectors = group_memory_nvectors (group_size * vf, nunits);
> + nvectors = group_memory_nvectors (size, nunits);
> vect_record_mask (vinfo, slp_node, nvectors, vectype, scalar_mask);
> }
> else
> @@ -3362,12 +3366,11 @@ vect_get_strided_load_store_ops (stmt_vec_info
> stmt_info, slp_tree node,
>
> static tree
> vect_get_loop_variant_data_ptr_increment (
> - vec_info *vinfo, tree aggr_type, gimple_stmt_iterator *gsi,
> + loop_vec_info loop_vinfo, tree aggr_type, gimple_stmt_iterator *gsi,
> vec_loop_lens *loop_lens, dr_vec_info *dr_info,
> vect_memory_access_type memory_access_type)
> {
> - loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (vinfo);
> - tree step = vect_dr_behavior (vinfo, dr_info)->step;
> + tree step = vect_dr_behavior (loop_vinfo, dr_info)->step;
>
> /* gather/scatter never reach here. */
> gcc_assert (!mat_gather_scatter_p (memory_access_type));
> @@ -3411,7 +3414,7 @@ vect_get_data_ptr_increment (vec_info *vinfo,
> gimple_stmt_iterator *gsi,
>
> loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (vinfo);
> if (loop_vinfo && LOOP_VINFO_USING_SELECT_VL_P (loop_vinfo))
> - return vect_get_loop_variant_data_ptr_increment (vinfo, aggr_type, gsi,
> + return vect_get_loop_variant_data_ptr_increment (loop_vinfo, aggr_type,
> gsi,
> loop_lens, dr_info,
> memory_access_type);
The two hunks look independent. They are OK if pushed separately.
>
> @@ -5297,7 +5300,7 @@ vect_create_vectorized_demotion_stmts (vec_info *vinfo,
> vec<tree> *vec_oprnds,
> call the function recursively. */
>
> static void
> -vect_create_vectorized_promotion_stmts (vec_info *vinfo,
> +vect_create_vectorized_promotion_stmts (vec_info *vinfo, slp_tree slp_node,
> vec<tree> *vec_oprnds0,
> vec<tree> *vec_oprnds1,
> stmt_vec_info stmt_info, tree
> vec_dest,
> @@ -5310,37 +5313,39 @@ vect_create_vectorized_promotion_stmts (vec_info
> *vinfo,
> gimple *new_stmt1, *new_stmt2;
> vec<tree> vec_tmp = vNULL;
>
> - vec_tmp.create (vec_oprnds0->length () * 2);
> + const unsigned ncopies = vect_get_num_copies (vinfo, slp_node);
> + vec_tmp.create (ncopies);
> + gcc_assert (vec_oprnds0->length () <= ncopies);
> FOR_EACH_VEC_ELT (*vec_oprnds0, i, vop0)
> {
> + if (vec_tmp.length () >= ncopies)
> + break;
> +
> if (op_type == binary_op)
> vop1 = (*vec_oprnds1)[i];
> else
> vop1 = NULL_TREE;
>
> /* Generate the two halves of promotion operation. */
> - new_stmt1 = vect_gen_widened_results_half (vinfo, ch1, vop0, vop1,
> - op_type, vec_dest, gsi,
> - stmt_info);
> - new_stmt2 = vect_gen_widened_results_half (vinfo, ch2, vop0, vop1,
> - op_type, vec_dest, gsi,
> - stmt_info);
> - if (is_gimple_call (new_stmt1))
> - {
> - new_tmp1 = gimple_call_lhs (new_stmt1);
> - new_tmp2 = gimple_call_lhs (new_stmt2);
> - }
> - else
> + new_stmt1
> + = vect_gen_widened_results_half (vinfo, ch1, vop0, vop1, op_type,
> + vec_dest, gsi, stmt_info);
> + new_tmp1 = is_gimple_call (new_stmt1) ? gimple_call_lhs (new_stmt1)
> + : gimple_assign_lhs (new_stmt1);
> + vec_tmp.quick_push (new_tmp1);
> +
> + if (vec_tmp.length () < ncopies)
> {
> - new_tmp1 = gimple_assign_lhs (new_stmt1);
> - new_tmp2 = gimple_assign_lhs (new_stmt2);
> + new_stmt2
> + = vect_gen_widened_results_half (vinfo, ch2, vop0, vop1, op_type,
> + vec_dest, gsi, stmt_info);
> + new_tmp2 = is_gimple_call (new_stmt2) ? gimple_call_lhs (new_stmt2)
> + : gimple_assign_lhs
> (new_stmt2);
> + vec_tmp.quick_push (new_tmp2);
> }
> -
> - /* Store the results for the next step. */
> - vec_tmp.quick_push (new_tmp1);
> - vec_tmp.quick_push (new_tmp2);
> }
>
> + gcc_assert (vec_tmp.length () <= ncopies);
> vec_oprnds0->release ();
> *vec_oprnds0 = vec_tmp;
> }
> @@ -5553,6 +5558,7 @@ vectorizable_conversion (vec_info *vinfo,
> from the scalar type. */
> if (!vectype_in)
> vectype_in = get_vectype_for_scalar_type (vinfo, rhs_type, slp_node);
> +
> if (!cost_vec)
> gcc_assert (vectype_in);
> if (!vectype_in)
> @@ -5961,12 +5967,15 @@ vectorizable_conversion (vec_info *vinfo,
> stmt_info, this_dest, gsi, c1,
> op_type);
> else
> - vect_create_vectorized_promotion_stmts (vinfo, &vec_oprnds0,
> - &vec_oprnds1, stmt_info,
> - this_dest, gsi,
> + vect_create_vectorized_promotion_stmts (vinfo, slp_node,
> + &vec_oprnds0,
> &vec_oprnds1,
> + stmt_info, this_dest, gsi,
> c1, c2, op_type);
> }
>
> + gcc_assert (vec_oprnds0.length ()
> + == vect_get_num_copies (vinfo, slp_node));
> +
> FOR_EACH_VEC_ELT (vec_oprnds0, i, vop0)
> {
> gimple *new_stmt;
> @@ -5990,6 +5999,16 @@ vectorizable_conversion (vec_info *vinfo,
> generate more than one vector stmt - i.e - we need to "unroll"
> the vector stmt by a factor VF/nunits. */
> vect_get_vec_defs (vinfo, slp_node, op0, &vec_oprnds0);
> +
> + /* Promotion no longer produces redundant defs (since support was
> + added for length/mask-predicated BB SLP of awkward-sized groups),
> + therefore demotion now has to handle that case too. */
> + if (vec_oprnds0.length () % 2 != 0)
> + {
> + tree vectype = TREE_TYPE (vec_oprnds0[0]);
> + vec_oprnds0.safe_push (build_zero_cst (vectype));
> + }
> +
Please split out the promotion/demotion related required changes. They are
useful elsewhere (getting rid of max_nunits) and we can iterate on those
separately.
> /* Arguments are ready. Create the new vector stmts. */
> if (cvt_type && modifier == NARROW_DST)
> FOR_EACH_VEC_ELT (vec_oprnds0, i, vop0)
> @@ -10803,7 +10822,7 @@ vectorizable_load (vec_info *vinfo,
>
> aggr_type = build_array_type_nelts (elem_type, group_size * nunits);
> if (!costing_p)
> - bump = vect_get_data_ptr_increment (vinfo, gsi, dr_info, aggr_type,
> + bump = vect_get_data_ptr_increment (loop_vinfo, gsi, dr_info,
> aggr_type,
> memory_access_type, loop_lens);
That hunk seems unrelated, OK to push as separate commit.
>
> unsigned int inside_cost = 0, prologue_cost = 0;
> @@ -13460,6 +13479,38 @@ vect_analyze_stmt (vec_info *vinfo,
> " live stmt not supported: %G",
> stmt_info->stmt);
>
> + if (bb_vinfo)
> + {
> + unsigned int group_size = SLP_TREE_LANES (node);
> + tree vectype = SLP_TREE_VECTYPE (node);
> + poly_uint64 nunits = TYPE_VECTOR_SUBPARTS (vectype);
> + bool needs_partial = maybe_lt (group_size, nunits);
> + if (needs_partial)
> + {
> + /* If partial vectors are required then they must be supported by
> the
> + target; however, don't assume that a partial vectors style has
> + been set because a mask or length may not be required for the
> + statement. */
> + if (!SLP_TREE_CAN_USE_PARTIAL_VECTORS_P (node))
> + return opt_result::failure_at (stmt_info->stmt,
> + "not vectorized: SLP node needs
> but "
> + "cannot use partial vectors: %G",
> + stmt_info->stmt);
that's the job of vectorizable_* and it's already done for the loop
vect case. We
should do the same for predicated tails.
> + }
> + else
> + {
> + /* If we don't need partial vectors then we don't care about whether
> + they are supported or not; however, we need to clear any partial
> + vectors style that might have been chosen because it will be used
> + to control generation of lengths or masks. */
> + SLP_TREE_PARTIAL_VECTORS_STYLE (node) = vect_partial_vectors_none;
> + SLP_TREE_NUM_PARTIAL_VECTORS (node) = 0;
Likewise.
> + }
> +
> + if (maybe_gt (group_size, nunits))
> + gcc_assert (multiple_p (group_size, nunits));
> + }
> +
> return opt_result::success ();
> }
>
> @@ -13767,13 +13818,7 @@ tree
> get_vectype_for_scalar_type (vec_info *vinfo, tree scalar_type,
> unsigned int group_size)
> {
> - /* For BB vectorization, we should always have a group size once we've
> - constructed the SLP tree; the only valid uses of zero GROUP_SIZEs
> - are tentative requests during things like early data reference
> - analysis and pattern recognition. */
> - if (is_a <bb_vec_info> (vinfo))
> - gcc_assert (vinfo->slp_instances.is_empty () || group_size != 0);
> - else
> + if (!is_a <bb_vec_info> (vinfo))
But we really want to override the preferred SIMD mode, so any
"auto-detection" for BB vect looks wrong to me. Can we instead
pass group-size rounded down to the next power of two?
> group_size = 0;
>
> tree vectype = get_related_vectype_for_scalar_type (vinfo->vector_mode,
> @@ -13787,10 +13832,18 @@ get_vectype_for_scalar_type (vec_info *vinfo, tree
> scalar_type,
> vinfo->used_vector_modes.add (TYPE_MODE (vectype));
>
> /* If the natural choice of vector type doesn't satisfy GROUP_SIZE,
> - try again with an explicit number of elements. */
> - if (vectype
> - && group_size
> - && maybe_ge (TYPE_VECTOR_SUBPARTS (vectype), group_size))
> + try again with an explicit number of elements. A vector type satisfies
> + GROUP_SIZE if it is definitely not too long to store the whole group,
> + or we are able to generate masks to handle the unknown number of excess
> + lanes that might exist. Otherwise, we must substitute a vector type
> that
> + can be used to carve up the group.
> + */
> + if (vectype && group_size
> + && maybe_gt (TYPE_VECTOR_SUBPARTS (vectype), group_size)
> + && (vect_get_partial_vector_style (vectype, true)
> + == vect_partial_vectors_none
> + || vect_get_partial_vector_style (vectype, false)
> + == vect_partial_vectors_none))
this changes _ge to _gt - why? I don't like a partial vector style query here.
I know we're setting the actual type up only late, but shouldn't we get
proper failures when analyzing this as partial vectors always but we cannot
support them later?
> {
> /* Start with the biggest number of units that fits within
> GROUP_SIZE and halve it until we find a valid vector type.
> @@ -14106,7 +14159,36 @@ vect_maybe_update_slp_op_vectype (vec_info *vinfo,
> slp_tree op, tree vectype)
> && SLP_TREE_DEF_TYPE (op) == vect_external_def
> && SLP_TREE_LANES (op) > 1)
> return false;
> - (void) vinfo; /* FORNOW */
> +
> + /* When the vectorizer falls back to building vector operands from scalars,
> + it can create SLP trees with external defs that have a number of lanes
> not
> + divisible by the number of subparts in a vector type naively inferred
> from
> + the scalar type. Reject such types to avoid ICE when later computing
> the
> + prologue cost for invariant operands. */
> + if (SLP_TREE_DEF_TYPE (op) == vect_external_def)
> + {
> + poly_uint64 vf = 1;
> +
> + if (loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (vinfo))
> + vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
> +
> + vf *= SLP_TREE_LANES (op);
> +
> + if (maybe_lt (TYPE_VECTOR_SUBPARTS (vectype), vf)
> + && !multiple_p (vf, TYPE_VECTOR_SUBPARTS (vectype)))
So this seems to be a guard for vect_get_num_copies? It seems to me
that if this is only a problem to costing we should fix up there, not here.
As I understand for SVE you're always having a single vector given
you want to use AdvSIMD for full vector parts, like with 7 int lanes
you force split during SLP discovery to get 4 lanes AdvSIMD and
3 lanes with predicated tail and SVE? On x86 we can share the
vector type so technically we do not need to force split there
(but I think there's no harm done if doing so). So -- why does
if (known_ge (TYPE_VECTOR_SUBPARTS (vectype), vf))
return 1;
in vect_get_num_vectors not trigger and solve the issue?
> + {
> + if (dump_enabled_p ())
> + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> + "lanes=" HOST_WIDE_INT_PRINT_UNSIGNED
> + " is not divisible by "
> + "subparts=" HOST_WIDE_INT_PRINT_UNSIGNED ".\n",
> + estimated_poly_value (vf),
> + estimated_poly_value (
> + TYPE_VECTOR_SUBPARTS (vectype)));
> + return false;
> + }
> + }
> +
> SLP_TREE_VECTYPE (op) = vectype;
> return true;
> }
> @@ -14814,27 +14896,32 @@ vect_gen_while_not (gimple_seq *seq, tree
> mask_type, tree start_index,
>
> - Set *NUNITS_VECTYPE_OUT to the vector type that contains the maximum
> number of units needed to vectorize STMT_INFO, or NULL_TREE if the
> - statement does not help to determine the overall number of units. */
> + statement does not help to determine the overall number of units.
> +
> + - Set *UNSUPPORTED_DATATYPE to false.
> +
> + On failure:
> +
> + - Set *UNSUPPORTED_DATATYPE to true if the statement can't be vectorized
> + because it uses a data type that the target doesn't support in vector
> form
> + for a group of the given GROUP_SIZE.
> + */
>
> opt_result
> vect_get_vector_types_for_stmt (vec_info *vinfo, stmt_vec_info stmt_info,
> tree *stmt_vectype_out,
> tree *nunits_vectype_out,
> + bool *unsupported_datatype,
> unsigned int group_size)
> {
> gimple *stmt = stmt_info->stmt;
>
> - /* For BB vectorization, we should always have a group size once we've
> - constructed the SLP tree; the only valid uses of zero GROUP_SIZEs
> - are tentative requests during things like early data reference
> - analysis and pattern recognition. */
> - if (is_a <bb_vec_info> (vinfo))
> - gcc_assert (vinfo->slp_instances.is_empty () || group_size != 0);
> - else
> + if (!is_a<bb_vec_info> (vinfo))
> group_size = 0;
>
> *stmt_vectype_out = NULL_TREE;
> *nunits_vectype_out = NULL_TREE;
> + *unsupported_datatype = false;
>
> if (gimple_get_lhs (stmt) == NULL_TREE
> /* Allow vector conditionals through here. */
> @@ -14907,10 +14994,13 @@ vect_get_vector_types_for_stmt (vec_info *vinfo,
> stmt_vec_info stmt_info,
> }
> vectype = get_vectype_for_scalar_type (vinfo, scalar_type, group_size);
> if (!vectype)
> - return opt_result::failure_at (stmt,
> - "not vectorized:"
> - " unsupported data-type %T\n",
> - scalar_type);
> + {
> + *unsupported_datatype = true;
> + return opt_result::failure_at (stmt,
> + "not vectorized:"
> + " unsupported data-type %T\n",
> + scalar_type);
> + }
>
> if (dump_enabled_p ())
> dump_printf_loc (MSG_NOTE, vect_location, "vectype: %T\n", vectype);
> diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
> index 62f6ad320f0..1b9103b6f5f 100644
> --- a/gcc/tree-vectorizer.h
> +++ b/gcc/tree-vectorizer.h
> @@ -2353,6 +2353,8 @@ vect_get_num_copies (vec_info *vinfo, slp_tree node)
>
> vf *= SLP_TREE_LANES (node);
> tree vectype = SLP_TREE_VECTYPE (node);
> + if (known_ge (TYPE_VECTOR_SUBPARTS (vectype), vf))
> + return 1;
>
> return vect_get_num_vectors (vf, vectype);
> }
> @@ -2621,9 +2623,9 @@ extern tree vect_gen_while (gimple_seq *, tree, tree,
> tree,
> const char * = nullptr);
> extern void vect_gen_while_ssa_name (gimple_seq *, tree, tree, tree, tree);
> extern tree vect_gen_while_not (gimple_seq *, tree, tree, tree);
> -extern opt_result vect_get_vector_types_for_stmt (vec_info *,
> - stmt_vec_info, tree *,
> - tree *, unsigned int = 0);
> +extern opt_result vect_get_vector_types_for_stmt (vec_info *, stmt_vec_info,
> + tree *, tree *,
> + bool *, unsigned int = 0);
> extern opt_tree vect_get_mask_type_for_stmt (stmt_vec_info, unsigned int =
> 0);
>
> /* In tree-if-conv.cc. */
> @@ -2956,9 +2958,8 @@ vect_can_use_partial_vectors_p (vec_info *vinfo,
> slp_tree slp_node)
> loop_vec_info loop_vinfo = dyn_cast<loop_vec_info> (vinfo);
> if (loop_vinfo)
> return LOOP_VINFO_CAN_USE_PARTIAL_VECTORS_P (loop_vinfo);
> -
> - (void) slp_node; /* FORNOW */
> - return false;
> + else
> + return SLP_TREE_CAN_USE_PARTIAL_VECTORS_P (slp_node);
> }
>
> /* If VINFO is vectorizer state for loop vectorization then record that we no
> --
> 2.43.0
>