> -----Original Message-----
> From: Christopher Bazley <[email protected]>
> Sent: 03 June 2026 16:19
> To: [email protected]
> Cc: Tamar Christina <[email protected]>;
> [email protected]; [email protected]; Chris Bazley
> <[email protected]>
> Subject: [PATCH v11 08/12] AArch64/SVE: Optimize vec_init for partial SVE
> vector modes
>
> When basic block vectorization is extended to support predicated
> vector tails, it attempts to vectorize more stores. This is only
> done if the cost model deems it profitable, but the cost model
> assumes that vec_init is cheap; in practice, that was not always
> true.
>
> For example,
>
> uint8_t * vectp.2689;
> vector([4,4]) unsigned char _846;
> vector([4,4]) <signed-boolean:4> slp_mask_848;
> ...
> _846 = {_20, _30, _40, _50};
> vectp.2689_847 = src_61(D) + 64;
> slp_mask_848 = .WHILE_ULT (0, 4, { 0, ... });
> .MASK_STORE (vectp.2689_847, 8B, slp_mask_848, _846);
>
> was expected to have a vector cost of 4 (the same as the scalar
> cost) but the code actually generated for
>
> _846 = {_20, _30, _40, _50};
>
> was a repetitive series of write-modify-read operations
> using four stack locations for temporary storage:
>
> (set (reg:VNx16BI Y)
> (const_vector:VNx16BI repeat [
> (const_int 1 [0x1])
> ]))
>
> (set (mem/c:VNx4QI (plus:DI (reg/f:DI 96 virtual-stack-vars)
> (const_poly_int:DI [-O, -O])) [0 S[O, O] A8])
> (unspec:VNx4QI [
> (subreg:VNx4BI (reg:VNx16BI Y) 0)
> (reg:VNx4QI 205 [ _845 ])
> ] UNSPEC_PRED_X))
>
> (set (reg:QI Z)
> (subreg:QI (reg:SI X [ _W ]) 0))
>
> (set (mem/c:QI (plus:DI (reg/f:DI 96 virtual-stack-vars)
> (const_poly_int:DI [-O, -O])) [0 S1 A8])
> (reg:QI Z))
>
> (set (reg:VNx4QI 205 [ _845 ])
> (unspec:VNx4QI [
> (subreg:VNx4BI (reg:VNx16BI V) 0)
> (mem/c:VNx4QI (plus:DI (reg/f:DI 96 virtual-stack-vars)
> (const_poly_int:DI [-O, -O])) [0 S[O, O] A8])
> ] UNSPEC_PRED_X))
>
> (repeated four times)
>
> which compiled to something like:
>
> addpl x5, sp, #6
> st1b {z27.s}, p7, sp, #3, mul vl
> strb w10, [x5]
> ld1b {z28.s}, p7/z, sp, #3, mul vl
>
> (repeated four times)
>
> With these changes, the compiled code is instead:
>
> mov z31.b, w0
> insr z31.s, s28
> insr z31.s, s29
> insr z31.s, s30
>
> which is not yet optimal but is a great improvement.
>
> To achieve that, "vec_init<mode><Vel>" was modified to
> accept all SVE vector modes, which means that the
> associated function aarch64_sve_expand_vector_init
> must now handle all partial modes (namely, VNx8QI, VNx4QI,
> VNx2QI, VNx4HI, VNx2HI, VNx2SI, VNx2HF, VNx4HF, VNx2SF,
> VNx2BF, VNx4BF).
>
> I verified that the following dependencies already
> handle partial vector modes:
> - "@aarch64_sve_<perm_insn><mode>" (for ZIP1)
> - "*vec_duplicate<mode>_reg"
> - maybe_code_for_aarch64_sve_rev
>
> I did not verify that emit_move_insn (which is a dependency
> of aarch64_sve_expand_vector_init_handle_trailing_constants)
> handles partial vector modes, but it seems highly likely.
>
> "vec_shl_insert_<mode>" has been modified to accept SVE_ALL
> instead of only SVE_FULL and operate on container instead of
> element types.
>
> gcc/ChangeLog:
>
> * config/aarch64/aarch64-sve.md: Update
> vec_init<mode><Vel> and vec_shl_insert_<mode> to
> accept all SVE vector modes.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/aarch64/sve/slp_stack.c: New test.
OK.
Thanks,
Tamar
> ---
> gcc/config/aarch64/aarch64-sve.md | 16 +++++------
> .../gcc.target/aarch64/sve/slp_stack.c | 27 +++++++++++++++++++
> 2 files changed, 35 insertions(+), 8 deletions(-)
> create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/slp_stack.c
>
> diff --git a/gcc/config/aarch64/aarch64-sve.md
> b/gcc/config/aarch64/aarch64-sve.md
> index 585a587d8cf..6750428255c 100644
> --- a/gcc/config/aarch64/aarch64-sve.md
> +++ b/gcc/config/aarch64/aarch64-sve.md
> @@ -2959,7 +2959,7 @@ (define_insn "@aarch64_sve_ld1ro<mode>"
> ;; -------------------------------------------------------------------------
>
> (define_expand "vec_init<mode><Vel>"
> - [(match_operand:SVE_FULL 0 "register_operand")
> + [(match_operand:SVE_ALL 0 "register_operand")
> (match_operand 1 "")]
> "TARGET_SVE"
> {
> @@ -3003,17 +3003,17 @@ (define_expand "vec_initvnx16qivnx2qi"
>
> ;; Shift an SVE vector left and insert a scalar into element 0.
> (define_insn "vec_shl_insert_<mode>"
> - [(set (match_operand:SVE_FULL 0 "register_operand")
> - (unspec:SVE_FULL
> - [(match_operand:SVE_FULL 1 "register_operand")
> + [(set (match_operand:SVE_ALL 0 "register_operand")
> + (unspec:SVE_ALL
> + [(match_operand:SVE_ALL 1 "register_operand")
> (match_operand:<VEL> 2 "aarch64_reg_or_zero")]
> UNSPEC_INSR))]
> "TARGET_SVE"
> {@ [ cons: =0 , 1 , 2 ; attrs: movprfx ]
> - [ ?w , 0 , rZ ; * ] insr\t%0.<Vetype>, %<vwcore>2
> - [ w , 0 , w ; * ] insr\t%0.<Vetype>, %<Vetype>2
> - [ ??&w , w , rZ ; yes ] movprfx\t%0,
> %1\;insr\t%0.<Vetype>,
> %<vwcore>2
> - [ ?&w , w , w ; yes ] movprfx\t%0,
> %1\;insr\t%0.<Vetype>,
> %<Vetype>2
> + [ ?w , 0 , rZ ; * ] insr\t%0.<Vctype>, %<vccore>2
> + [ w , 0 , w ; * ] insr\t%0.<Vctype>, %<Vctype>2
> + [ ??&w , w , rZ ; yes ] movprfx\t%0,
> %1\;insr\t%0.<Vctype>,
> %<vccore>2
> + [ ?&w , w , w ; yes ] movprfx\t%0,
> %1\;insr\t%0.<Vctype>,
> %<Vctype>2
> }
> [(set_attr "sve_type" "sve_int_general")]
> )
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/slp_stack.c
> b/gcc/testsuite/gcc.target/aarch64/sve/slp_stack.c
> new file mode 100644
> index 00000000000..76be816e0d6
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_stack.c
> @@ -0,0 +1,27 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2 -ftree-vectorize -mcpu=neoverse-n2 --param=aarch64-
> autovec-preference=sve-only -msve-vector-bits=scalable" } */
> +
> +#include <stdint.h>
> +
> +/* Without an efficient implementation of vec_init for partial SVE types, a
> + decision to vectorize a group in the basic block vectorizer can result in
> + code that repeatedly stores a whole vector on the stack, overwrites one
> + element, reloads the whole vector, stores it to another location,
> + overwrites another element, etc. This is a fairly minimal reproducer. */
> +void
> +vec_slp_pathological_stack (uint8_t *src)
> +{
> + int lt = src[-33];
> + int l0 = src[-1];
> + int l1 = src[31];
> + int t0 = src[-32];
> + int t1 = src[-31];
> + int t2 = src[-30];
> + src[64] = (l1 + (2 * l0) + lt + 2) >> 2;
> + src[65] = (lt + t0 + 1) >> 1;
> + src[66] = (t0 + t1 + 1) >> 1;
> + src[67] = (t1 + t2 + 1) >> 1;
> +}
> +
> +/* { dg-final { scan-assembler-not {sp} } }
> + */
> --
> 2.43.0