On Wed, May 13, 2026 at 09:01:58AM +0100, Tamar Christina wrote:
> Hi Art,
>
> > -----Original Message-----
> > From: Artemiy Volkov <[email protected]>
> > Sent: 27 April 2026 09:06
> > To: [email protected]
> > Cc: Tamar Christina <[email protected]>; Wilco Dijkstra
> > <[email protected]>; [email protected]; Richard
> > Earnshaw <[email protected]>; [email protected]; Alice
> > Carlotti <[email protected]>; Alex Coplan <[email protected]>;
> > Artemiy Volkov <[email protected]>
> > Subject: [PATCH 2/4] aarch64: initialize vectors from starting subsequence
> >
> > Now that we have 2- and 4-element vector modes for all the sub-word scalar
> > modes, we can emit more efficient code when the elements of a vector
> > constructor can be generated from a common starting subsequence of length
> > power of two. To do this, first detect the shortest possible starting
> > subsequence by repeatedly folding the initial constructor element array
> > in half, as long as the left and the right halves are equal. Afterwards,
> > after emitting the subsequence, duplicate it by generating a
> > vec_duplicate with the correct source mode.
> >
> > On the MD side, this requires implementing the vec_duplicate optab to
> > duplicate an arbitrary sub-128-bit value into a full 64- or a 128-bit
> > AdvSIMD register, as well as the vec_set insn for the VSUB64 modes (needed
> > as fallback for the divide-and-conquer approach). The latter uses a
> > properly scaled and shifted "bfi" for integer values, and a properly
> > indexed "ins" for FP elements.
> >
> > This change allows us to get rid of long chains of inserts and compile
> > things like:
> >
> > int16x8_t f (int16_t x, int16_t y, int16_t z, int16_t w)
> > {
> > return (int16x8_t) {x, y, z, w, x, y, z, w};
> > }
> >
> > into:
> > bfi w0, w2, 16, 16
> > bfi w1, w3, 16, 16
> > dup v31.2s, w0
> > dup v0.2s, w1
> > zip1 v0.8h, v31.8h, v0.8h
> > ret
>
> Curious, while and improvement why didn't this
> continue and merge w0 and w1 into x0 and do a single dup?
>
> so another bfi but of an x registers or an orr of x registers.
>
> That removes the other transfer and the need for the zip.
This is a problem of constructing a vector of N>2 arbitrary elements,
which we still handle by either (a) using dup+zip, or (b) inserting
elements one by one using ins, whichever is cheaper. Perhaps achieving
your codegen is just a matter of doing something like:
diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index b58c4572299..b4ccb0e1352 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -25625,6 +25625,25 @@ aarch64_expand_vector_init_fallback (rtx target, rtx
vals)
}
}
+ if (pow2p_hwi (n_elts))
+ {
+ machine_mode subv_mode = mode_for_vector (inner_mode, n_elts /
2).require ();
+ rtx new_target0 = gen_reg_rtx (subv_mode), new_target1 = gen_reg_rtx
(subv_mode);
+ rtvec new_vals0 = rtvec_alloc (n_elts / 2), new_vals1 = rtvec_alloc
(n_elts / 2);
+ for (int i = 0; i < n_elts / 2; i++) {
+ RTVEC_ELT (new_vals0, i) = XVECEXP (vals, 0, i);
+ RTVEC_ELT (new_vals1, i) = XVECEXP (vals, 0, n_elts / 2 + i);
+ }
+ aarch64_expand_vector_init (new_target0, gen_rtx_PARALLEL (subv_mode,
new_vals0));
+ aarch64_expand_vector_init (new_target1, gen_rtx_PARALLEL (subv_mode,
new_vals1));
+
+ rtvec new_vals = rtvec_alloc (2);
+ RTVEC_ELT (new_vals, 0) = new_target0;
+ RTVEC_ELT (new_vals, 1) = new_target1;
+ aarch64_expand_vector_init (target, gen_rtx_PARALLEL (mode, new_vals));
+ return;
+ }
+
I guess this could always be used at least as a last-resort option instead
of a sequence of vec_sets (and in this case it will also be cheaper than
dup+zip), but it still needs a full assessment so I suggest we leave that
for a later patch.
> >
> > rather than:
> >
> > dup v31.4h, w0
> > dup v0.4h, w1
> > ins v31.h[1], w2
> > ins v0.h[1], w3
> > ins v31.h[3], w2
> > ins v0.h[3], w3
> > zip1 v0.8h, v31.8h, v0.8h
> > ret
> >
> > This patch also includes an extensive new test, which includes the above
> > case, as well as adjustments to existing codegen tests as necessary.
> >
> > gcc/ChangeLog:
> >
> > * config/aarch64/aarch64-simd.md
> > (*aarch64_simd_dup_subvector<VQ:mode><VDUP:mode>):
> > New insn pattern.
> > (*aarch64_simd_dup_subvector<VD:mode><VDUP:mode>): Likewise.
> > (@aarch64_simd_vec_set<mode>): Likewise.
> > (vec_set<mode>): Handle 16- and 32-bit vector modes in the
> > expander.
> > * config/aarch64/aarch64.cc (aarch64_choose_vector_init_constant):
> > Handle 16- and 32-bit vector modes.
> > (aarch64_expand_vector_init_fallback): Add logic to initialize vector
> > from starting subsequence. Make static.
> > (scalar_move_insn_p): Consider sub-64-bit vector moves scalar.
> > * config/aarch64/iterators.md (VDUP): New iterator.
> > (elem_bits): Define attribute for sub-64-bit vector modes.
> > (Vetype): Likewise.
> > (VEL): Likewise.
> > (single_wx): Define attribute for sub-64-bit vector and scalar modes.
> > (single_type): Likewise.
> > (Vqduptype): New mode attribute.
> > (Vdduptype): Likewise.
> > (vstype): Define attribute for 64-bit vector and sub-128-bit scalar
> > modes.
> >
> > gcc/testsuite/ChangeLog:
> >
> > * gcc.target/aarch64/ldp_stp_16.c: Adjust testcase.
> > * gcc.target/aarch64/sve/slp_1.c: Likewise.
> > * gcc.target/aarch64/vec-init-18.c: Likewise.
> > * gcc.target/aarch64/vec-init-23.c: New test.
> > ---
> > gcc/config/aarch64/aarch64-simd.md | 48 +-
> > gcc/config/aarch64/aarch64.cc | 46 +-
> > gcc/config/aarch64/iterators.md | 62 ++-
> > gcc/testsuite/gcc.target/aarch64/ldp_stp_16.c | 5 +-
> > gcc/testsuite/gcc.target/aarch64/sve/slp_1.c | 7 +-
> > .../gcc.target/aarch64/vec-init-18.c | 7 +-
> > .../gcc.target/aarch64/vec-init-23.c | 435 ++++++++++++++++++
> > 7 files changed, 587 insertions(+), 23 deletions(-)
> > create mode 100644 gcc/testsuite/gcc.target/aarch64/vec-init-23.c
> >
> > diff --git a/gcc/config/aarch64/aarch64-simd.md
> > b/gcc/config/aarch64/aarch64-simd.md
> > index 855b1ba353c..4bb26621efc 100644
> > --- a/gcc/config/aarch64/aarch64-simd.md
> > +++ b/gcc/config/aarch64/aarch64-simd.md
> > @@ -136,6 +136,28 @@
> > }
> > )
> >
> > +(define_insn "*aarch64_simd_dup_subvector<VQ:mode><VDUP:mode>"
> > + [(set (match_operand:VQ 0 "register_operand")
> > + (vec_duplicate:VQ
> > + (match_operand:VDUP 1 "register_operand")))]
> > + "TARGET_SIMD"
> > + {@ [ cons: =0 , 1 ; attrs: type ]
> > + [ w , w ; neon_dup<VQ:q> ] dup\t%0.<VDUP:Vqduptype>,
> > %1.<VDUP:vstype>[0]
> > + [ w , r ; neon_from_gp<VQ:q> ] dup\t%0.<VDUP:Vqduptype>,
> > %<VDUP:single_wx>1
> > + }
> > +)
> > +
> > +(define_insn "*aarch64_simd_dup_subvector<VD:mode><VDUP:mode>"
> > + [(set (match_operand:VD 0 "register_operand")
> > + (vec_duplicate:VD
> > + (match_operand:VDUP 1 "register_operand")))]
> > + "TARGET_SIMD"
> > + {@ [ cons: =0 , 1 ; attrs: type ]
> > + [ w , w ; neon_dup<VD:q> ] dup\t%0.<VDUP:Vdduptype>,
> > %1.<VDUP:vstype>[0]
> > + [ w , r ; neon_from_gp<VD:q> ] dup\t%0.<VDUP:Vdduptype>,
> > %<VDUP:single_wx>1
> > + }
> > +)
>
> Since VDUP also contains 64-bit elements doesn't this create odd combinations
> Like v8qiv8qi?
>
> Indeed and in those cases Vdduptype returns "" so the assembly is broken
>
> (define_insn ("*aarch64_simd_dup_subvectorv8qiv8qi")
> [
> (set (match_operand:V8QI 0 ("register_operand") ("=w,w"))
> (vec_duplicate:V8QI (match_operand:V8QI 1 ("register_operand")
> ("w,r"))))
> ] ("TARGET_SIMD") ("@
> dup\t%0., %1.d[0]
> dup\t%0., %x1")
> [
> (set_attr ("type") ("neon_dup,neon_from_gp"))
> ])
>
> I think also for the RTL these shouldn't be just a `vec_duplicate` because
> some of
> them change the interpretation of the elements too. Like
>
> aarch64_simd_dup_subvectorv8bfsf is
>
> (set (match_operand:V8BF 0 ("register_operand") ("=w,w"))
> (vec_duplicate:V8BF (match_operand:SF 1 ("register_operand")
> ("w,r"))))
>
> Which is invalid for a vec_duplicate as the submodes aren't the same.
>
> So I think these patterns should be split into not just 64 and 128 bit
> duplicates
> but into also into duplicates of the same element type and not.
>
> For the mismatch elements you'll need to use a subreg to convert the type to
> make it valid RTL.
Thank you for this observation; the intent was indeed to only support
vec_duplicates of smaller types into larger ones, and only where inner
modes are the same. I have added a couple more necessary attributes and
modified the patterns to be driven only by the mode of the inner operand.
>
> > +
> > (define_insn "@aarch64_dup_lane<mode>"
> > [(set (match_operand:VALL_F16 0 "register_operand" "=w")
> > (vec_duplicate:VALL_F16
> > @@ -1282,6 +1304,30 @@
> > [(set_attr "type" "neon_ins<q>, neon_from_gp<q>,
> > neon_load1_one_lane<q>")]
> > )
> >
> > +(define_insn "@aarch64_simd_vec_set<mode>"
> > + [(set (match_operand:VSUB64 0 "register_operand" "=r, w")
> > + (vec_merge:VSUB64
> > + (vec_duplicate:VSUB64
> > + (match_operand:<VEL> 1
> > "aarch64_simd_nonimmediate_operand" "r, w"))
> > + (match_operand:VSUB64 3 "register_operand" "0, 0")
> > + (match_operand:SI 2 "immediate_operand" "i, i")))]
> > + "TARGET_SIMD && exact_log2 (INTVAL (operands[2])) >= 0"
> > + {
> > + int elt = exact_log2 (INTVAL (operands[2]));
> > + switch (which_alternative)
> > + {
> > + case 0:
> > + operands[2] = GEN_INT (elt * <elem_bits>);
> > + return "bfi\t%w0, %w1, %2, <elem_bits>";
> > + case 1:
> > + return "ins\t%0.<Vetype>[%p2], %1.<Vetype>[0]";
> > + default:
> > + gcc_unreachable ();
> > + }
> > + }
> > + [(set_attr "type" "bfm, neon_ins")]
> > +)
> > +
>
> aarch64_simd_nonimmediate_operand allows memory as well so
> in the above your predicate is wider than your constraint and would
> force a reload which is probably not what we want.
>
> Any reason for not allowing Utv here like the other patterns do?
> But if we really don't want the load the predicate should be register_operand.
No reason to speak of; will add a Utv alternative.
>
> > ;; Inserting from the zero register into a vector lane is treated as an
> > ;; expensive GP->FP move on all CPUs. Avoid it when optimizing for speed.
> > (define_insn "aarch64_simd_vec_set_zero<mode>"
> > @@ -1711,7 +1757,7 @@
> > )
> >
> > (define_expand "vec_set<mode>"
> > - [(match_operand:VALL_F16 0 "register_operand")
> > + [(match_operand:VALL_F16_SUB64 0 "register_operand")
> > (match_operand:<VEL> 1 "aarch64_simd_nonimmediate_operand")
> > (match_operand:SI 2 "immediate_operand")]
> > "TARGET_SIMD"
> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> > index 257c193fa64..5b1afa50ff8 100644
> > --- a/gcc/config/aarch64/aarch64.cc
> > +++ b/gcc/config/aarch64/aarch64.cc
> > @@ -25446,7 +25446,9 @@ aarch64_choose_vector_init_constant
> > (machine_mode mode, rtx vals)
> > }
> >
> > rtx c = gen_rtx_CONST_VECTOR (mode, copy);
> > - if (aarch64_simd_valid_mov_imm (c))
> > + if (aarch64_simd_valid_mov_imm (c)
> > + || (INTEGRAL_MODE_P (mode)
> > + && aarch64_advsimd_partial_mode_p (mode)))
> > return c;
>
> I'm having trouble figuring out why this particular change is needed.
> when I remove it I see test_int8_9 failing because the initial constant
> is different.
>
> This code seems to want to preserve the repeating pattern as a constant
>
> (const_vector:V4QI [
> (const_int 0 [0]) repeated x2
> (const_int 1 [0x1]) repeated x2
> ])
>
> For an insertion
>
> (parallel:V4QI [
> (subreg/s/u:QI (reg/v:SI 102 [ x0 ]) 0)
> (const_int 0 [0])
> (subreg/s/u:QI (reg/v:SI 103 [ x1 ]) 0)
> (const_int 1 [0x1])
> ])
>
> Now the first one isn't a valid mov immediate so you don't return early but
> the loop below knows that you're going to overwrite index 0 and 2 so it
> instead
> tries the constant
>
> (const_vector:V4QI [
> (const_int 0 [0]) repeated x3
> (const_int 1 [0x1])
> ])
>
> Which is correct and valid and if both these fail you get c anyway.
>
> So it seems like the check isn't needed?
Yes, this is another leftover from previous revisions when for some reason
I wasn't sure that small modes will be handled correctly here. I will
drop this for v2.
Thank you for the review,
Artemiy
>
> Thanks,
> Tamar
> >
> > /* Try generating a stepped sequence. */
> > @@ -25487,7 +25489,7 @@ aarch64_choose_vector_init_constant
> > (machine_mode mode, rtx vals)
> > The caller has already tried a divide-and-conquer approach, so do
> > not consider that case here. */
> >
> > -void
> > +static void
> > aarch64_expand_vector_init_fallback (rtx target, rtx vals)
> > {
> > machine_mode mode = GET_MODE (target);
> > @@ -25545,6 +25547,43 @@ aarch64_expand_vector_init_fallback (rtx
> > target, rtx vals)
> > return;
> > }
> >
> > + /* Check if the vector can be represented as a duplicate of a
> > + subvector starting at index 0. */
> > + if (pow2p_hwi (n_elts))
> > + {
> > + bool halves_equal = true;
> > + int n_seq = n_elts;
> > + while (n_seq > 2)
> > + {
> > + for (int i = 0; i < n_seq / 2; i++)
> > + if (!rtx_equal_p (XVECEXP (vals, 0, i),
> > + XVECEXP (vals, 0, i + n_seq / 2)))
> > + {
> > + halves_equal = false;
> > + break;
> > + }
> > +
> > + if (!halves_equal)
> > + break;
> > +
> > + n_seq /= 2;
> > + }
> > +
> > + if (n_seq != n_elts)
> > + {
> > + machine_mode subv_mode = mode_for_vector (inner_mode,
> > + n_seq).require ();
> > + rtx new_target = gen_reg_rtx (subv_mode);
> > + rtvec new_vals = rtvec_alloc (n_seq);
> > + for (int i = 0; i < n_seq; i++)
> > + RTVEC_ELT (new_vals, i) = XVECEXP (vals, 0, i);
> > + aarch64_expand_vector_init (new_target,
> > + gen_rtx_PARALLEL (subv_mode,
> > new_vals));
> > + aarch64_emit_move (target, gen_vec_duplicate (mode,
> > new_target));
> > + return;
> > + }
> > + }
> > +
> > enum insn_code icode = optab_handler (vec_set_optab, mode);
> > gcc_assert (icode != CODE_FOR_nothing);
> >
> > @@ -25704,7 +25743,8 @@ scalar_move_insn_p (rtx set)
> > rtx src = SET_SRC (set);
> > rtx dest = SET_DEST (set);
> > return (is_a<scalar_mode> (GET_MODE (dest))
> > - && aarch64_mov_operand (src, GET_MODE (dest)));
> > + && aarch64_mov_operand (src, GET_MODE (dest)))
> > + || aarch64_advsimd_partial_mode_p (GET_MODE (dest));
> > }
> >
> > /* Similar to seq_cost, but ignore cost for scalar moves. */
> > diff --git a/gcc/config/aarch64/iterators.md
> > b/gcc/config/aarch64/iterators.md
> > index dfca3327f1f..1fc67d95bd4 100644
> > --- a/gcc/config/aarch64/iterators.md
> > +++ b/gcc/config/aarch64/iterators.md
> > @@ -139,6 +139,10 @@
> > ;; VQMOV without 2-element modes.
> > (define_mode_iterator VQMOV_NO2E [V16QI V8HI V4SI V8HF V8BF V4SF])
> >
> > +;; Modes that can be duplicated into a 64/128-bit register.
> > +(define_mode_iterator VDUP [V8QI V4QI V2QI QI V4HI V2HI HI V2SI SI DI
> > + V4BF V2BF BF V4HF V2HF HF V2SF SF DF])
> > +
> > ;; Double integer vector modes.
> > (define_mode_iterator VD_I [V8QI V4HI V2SI DI])
> >
> > @@ -1488,7 +1492,9 @@
> >
> > ;; The number of bits in a vector element, or controlled by a predicate
> > ;; element.
> > -(define_mode_attr elem_bits [(VNx16BI "8") (VNx8BI "16")
> > +(define_mode_attr elem_bits [(V2QI "8") (V4QI "8") (V2HF "16") (V2HI
> > "16")
> > + (V2BF "16")
> > + (VNx16BI "8") (VNx8BI "16")
> > (VNx4BI "32") (VNx2BI "64")
> > (VNx16QI "8") (VNx32QI "8") (VNx64QI "8")
> > (VNx8HI "16") (VNx16HI "16") (VNx32HI "16")
> > @@ -1593,11 +1599,12 @@
> >
> > ;; Mode-to-individual element type mapping.
> > (define_mode_attr Vetype [(V8QI "b") (V16QI "b")
> > - (V4HI "h") (V8HI "h")
> > + (V2QI "b") (V4QI "b")
> > + (V4HI "h") (V8HI "h") (V2HI "h")
> > (V2SI "s") (V4SI "s")
> > (V2DI "d") (V1DI "d")
> > - (V4HF "h") (V8HF "h")
> > - (V2SF "s") (V4SF "s")
> > + (V4HF "h") (V8HF "h") (V2HF "h")
> > + (V2SF "s") (V4SF "s") (V2BF "h")
> > (V2DF "d") (V1DF "d")
> > (V2x8QI "b") (V2x4HI "h")
> > (V2x2SI "s") (V2x1DI "d")
> > @@ -1772,8 +1779,10 @@
> > (V4x2DF "v2df") (V4x8BF "v8bf")])
> >
> > ;; Define element mode for each vector mode.
> > -(define_mode_attr VEL [(V8QI "QI") (V16QI "QI")
> > +(define_mode_attr VEL [(V8QI "QI") (V16QI "QI")
> > + (V2QI "QI") (V4QI "QI")
> > (V4HI "HI") (V8HI "HI")
> > + (V2HI "HI") (V2HF "HF")
> > (V2SI "SI") (V4SI "SI")
> > (DI "DI") (V1DI "DI")
> > (V2DI "DI")
> > @@ -1784,6 +1793,7 @@
> > (SI "SI") (HI "HI")
> > (QI "QI")
> > (V4BF "BF") (V8BF "BF")
> > + (V2BF "BF")
> > (V2x8QI "QI") (V2x4HI "HI")
> > (V2x2SI "SI") (V2x1DI "DI")
> > (V2x4HF "HF") (V2x2SF "SF")
> > @@ -2037,6 +2047,26 @@
> > (define_mode_attr V2ntype [(V8HI "16b") (V4SI "8h")
> > (V2DI "4s")])
> >
> > +;; Register suffix used when duplicating a value of a certain mode
> > +;; into a full 128-bit AdvSIMD register.
> > +(define_mode_attr Vqduptype [(QI "16b") (V2QI "8h") (V4QI "4s") (V8QI
> > "2d")
> > + (HI "8h") (V2HI "4s") (V4HI "2d")
> > + (HF "8h") (V2HF "4s") (V4HF "2d")
> > + (BF "8h") (V2BF "4s") (V4BF "2d")
> > + (SI "4s") (V2SI "2d")
> > + (SF "4s") (V2SF "2d")
> > + (DI "2d") (DF "2d")])
> > +
> > +;; Register suffix used when duplicating a value of a certain mode
> > +;; into a partial 64-bit AdvSIMD register.
> > +(define_mode_attr Vdduptype [(QI "8b") (V2QI "4h") (V4QI "2s") (V8QI "")
> > + (HI "4h") (V2HI "2s") (V4HI "")
> > + (HF "4h") (V2HF "2s") (V4HF "")
> > + (BF "4h") (V2BF "2s") (V4BF "")
> > + (SI "2s") (V2SI "")
> > + (SF "2s") (V2SF "")
> > + (DI "") (DF "")])
> > +
> > ;; The result of FCVTN on two vectors of the given mode. The result has
> > ;; twice as many QI elements as the input.
> > (define_mode_attr VPACKB [(V4HF "V8QI") (V8HF "V16QI") (V4SF "V8QI")])
> > @@ -2161,8 +2191,13 @@
> > ;; Whether a mode fits in W or X registers (i.e. "w" for 32-bit modes
> > ;; and "x" for 64-bit modes).
> > (define_mode_attr single_wx [(SI "w") (SF "w")
> > + (V2QI "w") (V4QI "w")
> > (V8QI "x") (V4HI "x")
> > (V4HF "x") (V4BF "x")
> > + (V2HI "w") (V2HF "w")
> > + (HF "w") (QI "w")
> > + (V2BF "w") (BF "w")
> > + (HI "w")
> > (V2SI "x") (V2SF "x")
> > (DI "x") (DF "x")])
> >
> > @@ -2172,7 +2207,12 @@
> > (V8QI "d") (V4HI "d")
> > (V4HF "d") (V4BF "d")
> > (V2SI "d") (V2SF "d")
> > - (DI "d") (DF "d")])
> > + (DI "d") (DF "d")
> > + (QI "b") (BF "h")
> > + (V2HF "s") (HI "h")
> > + (V4QI "s") (V2QI "h")
> > + (V2HI "s") (V2BF "s")
> > + (HF "h")])
> >
> > ;; Whether a double-width mode fits in D or Q registers (i.e. "d" for
> > ;; 32-bit modes and "q" for 64-bit modes).
> > @@ -2182,9 +2222,13 @@
> > (V2SI "q") (V2SF "q")
> > (DI "q") (DF "q")])
> >
> > -;; Scalar size of a sub-64-bit vector mode.
> > -(define_mode_attr vstype [(V4QI "s") (V2QI "h")
> > - (V2HI "s") (V2BF "s") (V2HF "s")])
> > +;; Scalar size of a sub-128-bit vector or scalar mode.
> > +(define_mode_attr vstype [(V8QI "d") (V4QI "s") (V2QI "h") (QI "b")
> > + (V4HI "d") (V2HI "s") (HI "h")
> > + (V2SI "d") (SI "s") (DI "d")
> > + (V4BF "d") (V2BF "s") (BF "h")
> > + (V4HF "d") (V2HF "s") (HF "h")
> > + (V2SF "d") (SF "s") (DF "d")])
> >
> > ;; Define corresponding core/FP element mode for each vector mode.
> > (define_mode_attr vw [(V8QI "w") (V16QI "w")
> > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_stp_16.c
> > b/gcc/testsuite/gcc.target/aarch64/ldp_stp_16.c
> > index 95835aa2eb4..a6b4d50f34f 100644
> > --- a/gcc/testsuite/gcc.target/aarch64/ldp_stp_16.c
> > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_stp_16.c
> > @@ -96,9 +96,8 @@ CONS2_FN (4, float);
> >
> > /*
> > ** cons2_8_float:
> > -** dup v[0-9]+\.2s, v[0-9]+\.s\[0\]
> > -** dup v[0-9]+\.2s, v[0-9]+\.s\[0\]
> > -** zip1 v([0-9]+)\.4s, v[0-9]+\.4s, v[0-9]+\.4s
> > +** uzp1 v1\.2s, v0\.2s, v1\.2s
> > +** dup v([0-9]+)\.2d, v1\.d\[0\]
> > ** stp q\1, q\1, \[x0\]
> > ** stp q\1, q\1, \[x0, #?32\]
> > ** ret
> > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
> > b/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
> > index 98e8ac3c785..2bb2c04fa20 100644
> > --- a/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
> > +++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
> > @@ -30,14 +30,13 @@ vec_slp_##TYPE (TYPE *restrict a, TYPE b, TYPE c, int
> > n) \
> > TEST_ALL (VEC_PERM)
> >
> > /* We should use one DUP for each of the 8-, 16- and 32-bit types,
> > - (for now, insert both elements with umov + ins for _Float16). We should
> > use two
> > + (for now, insert both elements with ins for _Float16). We should use
> > two
> > DUPs for each of the three 64-bit types. */
> > /* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.h, [hw]} 2 } } */
> > /* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.s, [sw]} 3 } } */
> > /* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, [dx]} 9 } } */
> > -/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.h} 2 } } */
> > -/* { dg-final { scan-assembler-times {\tins\tv[0-9]+\.h\[0\], w[0-9]+} 1 }
> > } */
> > -/* { dg-final { scan-assembler-times {\tins\tv[0-9]+\.h\[1\], w[0-9]+} 1 }
> > } */
> > +/* { dg-final { scan-assembler-times {\tins\tv[0-9]+\.h\[0\],
> > v[0-9]+\.h\[0\]}
> > 1 } } */
> > +/* { dg-final { scan-assembler-times {\tins\tv[0-9]+\.h\[1\],
> > v[0-9]+\.h\[0\]}
> > 1 } } */
> > /* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.d, z[0-9]+\.d, z[0-
> > 9]+\.d\n} 3 } } */
> > /* { dg-final { scan-assembler-not {\tzip2\t} } } */
> >
> > diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-18.c
> > b/gcc/testsuite/gcc.target/aarch64/vec-init-18.c
> > index ecb59fe510b..feeb181a0b5 100644
> > --- a/gcc/testsuite/gcc.target/aarch64/vec-init-18.c
> > +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-18.c
> > @@ -15,6 +15,7 @@ int16x8_t foo2(int16_t x)
> > return v;
> > }
> >
> > -/* { dg-final { scan-assembler-times {\tdup\tv[0-9]+\.4h, w[0-9]+} 3 } } */
> > -/* { dg-final { scan-assembler {\tmovi\tv[0-9]+\.4h, 0x1} } } */
> > -/* { dg-final { scan-assembler-times {\tzip1\tv[0-9]+\.8h, v[0-9]+\.8h,
> > v[0-
> > 9]+\.8h} 2 } } */
> > +/* { dg-final { scan-assembler-times {\tdup\tv[0-9]+\.4s, w[0-9]+} 2 } } */
> > +/* { dg-final { scan-assembler-times {\tmov\tw[0-9]+, 65537} 1 } } */
> > +/* { dg-final { scan-assembler-times {\tbfi\tw[0-9]+, w[0-9]+, 0, 16} 1 }
> > } */
> > +/* { dg-final { scan-assembler-times {\tbfi\tw[0-9]+, w[0-9]+, 16, 16} 1 }
> > } */
> > diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-23.c
> > b/gcc/testsuite/gcc.target/aarch64/vec-init-23.c
> > new file mode 100644
> > index 00000000000..595470b29fb
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-23.c
> > @@ -0,0 +1,435 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -march=armv8.2-a+fp16" } */
> > +/* { dg-final { check-function-bodies "**" "" "" } } */
> > +
> > +#include <arm_neon.h>
> > +
> > +/* Check vector initialization with a repeating sequence of elements. */
> > +
> > +#ifndef TESTCASE
> > +#define TESTCASE(TYPE, ETYPE, T, SZ, NUM, MULT, ...)\
> > + TYPE##SZ##MULT##_t test_##TYPE##SZ##_##NUM (ETYPE x0, ETYPE x1,
> > ETYPE x2, ETYPE x3,\
> > + ETYPE x4, ETYPE x5, ETYPE x6, ETYPE
> > x7)\
> > + {\
> > + return (TYPE##SZ##MULT##_t) {__VA_ARGS__};\
> > + }
> > +#endif
> > +
> > +#define TEST_8(TYPE, ETYPE, T)\
> > + TESTCASE (TYPE, ETYPE, T, 8, 1, x16, x0, x0, x0, x0, x0, x0, x0, x0,\
> > + x0, x0, x0, x0, x0, x0, x0, x0)\
> > + TESTCASE (TYPE, ETYPE, T, 8, 2, x16, x0, x1, x0, x1, x0, x1, x0, x1,\
> > + x0, x1, x0, x1, x0, x1, x0, x1)\
> > + TESTCASE (TYPE, ETYPE, T, 8, 3, x16, x0, x1, x2, x3, x0, x1, x2, x3,\
> > + x0, x1, x2, x3, x0, x1, x2, x3)\
> > + TESTCASE (TYPE, ETYPE, T, 8, 4, x16, x0, x1, x2, x3, x4, x5, x6, x7,\
> > + x0, x1, x2, x3, x4, x5, x6, x7)\
> > + TESTCASE (TYPE, ETYPE, T, 8, 5, x16, x0, 0, x0, 0, x0, 0, x0, 0,\
> > + x0, 0, x0, 0, x0, 0, x0, 0)\
> > + TESTCASE (TYPE, ETYPE, T, 8, 6, x16, 0, x0, 0, x0, 0, x0, 0, x0,\
> > + 0, x0, 0, x0, 0, x0, 0, x0)\
> > + TESTCASE (TYPE, ETYPE, T, 8, 7, x16, x0, x1, 0, 1, x0, x1, 0, 1,\
> > + x0, x1, 0, 1, x0, x1, 0, 1)\
> > + TESTCASE (TYPE, ETYPE, T, 8, 8, x16, 0, 1, x0, x1, 0, 1, x0, x1,\
> > + 0, 1, x0, x1, 0, 1, x0, x1)\
> > + TESTCASE (TYPE, ETYPE, T, 8, 9, x16, x0, 0, x1, 1, x0, 0, x1, 1,\
> > + x0, 0, x1, 1, x0, 0, x1, 1)\
> > + TESTCASE (TYPE, ETYPE, T, 8, 10, x16, x0, 0, x1, 1, x2, 2, x3, 3,\
> > + x0, 0, x1, 1, x2, 2, x3, 3)\
> > + TESTCASE (TYPE, ETYPE, T, 8, 11, x16, 0, x0, 1, x1, 2, x2, 3, x3,\
> > + 0, x0, 1, x1, 2, x2, 3, x3)\
> > + TESTCASE (TYPE, ETYPE, T, 8, 12, x16, x0, x1, 0, 1, x2, x3, 2, 3,\
> > + x0, x1, 0, 1, x2, x3, 2, 3)\
> > + TESTCASE (TYPE, ETYPE, T, 8, 13, x16, 0, 1, x0, x1, 2, 3, x2, x3,\
> > + 0, 1, x0, x1, 2, 3, x2, x3)
> > +
> > +#define TEST_16(TYPE, ETYPE, T)\
> > + TESTCASE (TYPE, ETYPE, T, 16, 1, x8, x0, x0, x0, x0, x0, x0, x0, x0)\
> > + TESTCASE (TYPE, ETYPE, T, 16, 2, x8, x0, x1, x0, x1, x0, x1, x0, x1)\
> > + TESTCASE (TYPE, ETYPE, T, 16, 3, x8, x0, x1, x2, x3, x0, x1, x2, x3)\
> > + TESTCASE (TYPE, ETYPE, T, 16, 4, x8, x0, 0, x0, 0, x0, 0, x0, 0)\
> > + TESTCASE (TYPE, ETYPE, T, 16, 5, x8, 0, x0, 0, x0, 0, x0, 0, x0)\
> > + TESTCASE (TYPE, ETYPE, T, 16, 6, x8, x0, x1, 0, 1, x0, x1, 0, 1)\
> > + TESTCASE (TYPE, ETYPE, T, 16, 7, x8, 0, 1, x0, x1, 0, 1, x0, x1)\
> > + TESTCASE (TYPE, ETYPE, T, 16, 8, x8, 0, x0, 1, x1, 0, x0, 1, x1)\
> > +
> > +#define TEST_32(TYPE, ETYPE, T)\
> > + TESTCASE (TYPE, ETYPE, T, 32, 1, x4, x0, x0, x0, x0)\
> > + TESTCASE (TYPE, ETYPE, T, 32, 2, x4, x0, x1, x0, x1)\
> > + TESTCASE (TYPE, ETYPE, T, 32, 3, x4, x0, 0, x0, 0)\
> > + TESTCASE (TYPE, ETYPE, T, 32, 4, x4, 0, x0, 0, x0)
> > +
> > +#define TEST_64(TYPE, ETYPE, T)\
> > + TESTCASE (TYPE, ETYPE, T, 64, 1, x2, x0, x0)
> > +
> > +TEST_8(int, int8_t, s)
> > +
> > +TEST_16(float, float, f)
> > +TEST_16(int, int16_t, s)
> > +
> > +TEST_32(float, float, f)
> > +TEST_32(int, int32_t, s)
> > +
> > +TEST_64(float, double, f)
> > +TEST_64(int, int64_t, s)
> > +
> > +/*
> > +** test_int8_1:
> > +** dup v0\.16b, w0
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_int8_2:
> > +** bfi w0, w1, 8, 8
> > +** dup v0\.8h, w0
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_int8_3:
> > +** bfi w0, w1, 8, 8
> > +** bfi w0, w2, 16, 8
> > +** bfi w0, w3, 24, 8
> > +** dup v0\.4s, w0
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_int8_4:
> > +** bfi w0, w2, 8, 8
> > +** bfi w1, w3, 8, 8
> > +** bfi w0, w4, 16, 8
> > +** bfi w1, w5, 16, 8
> > +** bfi w0, w6, 24, 8
> > +** bfi w1, w7, 24, 8
> > +** dup v31\.2s, w0
> > +** dup v0\.2s, w1
> > +** zip1 v0\.16b, v31\.16b, v0\.16b
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_int8_5:
> > +** mov w1, 0
> > +** bfi w1, w0, 0, 8
> > +** dup v0\.8h, w1
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_int8_6:
> > +** mov w1, 0
> > +** bfi w1, w0, 8, 8
> > +** dup v0\.8h, w1
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_int8_7:
> > +** mov w2, 16777472
> > +** bfi w2, w0, 0, 8
> > +** bfi w2, w1, 8, 8
> > +** dup v0\.4s, w2
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_int8_8:
> > +** mov w2, 16777472
> > +** bfi w2, w0, 16, 8
> > +** bfi w2, w1, 24, 8
> > +** dup v0\.4s, w2
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_int8_9:
> > +** mov w2, 16842752
> > +** bfi w2, w0, 0, 8
> > +** bfi w2, w1, 16, 8
> > +** dup v0\.4s, w2
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_int8_10:
> > +** bfi w0, w1, 8, 8
> > +** bfi w0, w2, 16, 8
> > +** bfi w0, w3, 24, 8
> > +** dup v31\.2s, w0
> > +** adrp x0, .LANCHOR[0-9]+
> > +** ldr d0, \[x0, #:lo12:.LANCHOR[0-9]+\]
> > +** zip1 v0\.16b, v31\.16b, v0\.16b
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_int8_11:
> > +** bfi w0, w1, 8, 8
> > +** adrp x4, .LANCHOR[0-9]+
> > +** bfi w0, w2, 16, 8
> > +** ldr d0, \[x4, #:lo12:\.LANCHOR[0-9]+\]
> > +** bfi w0, w3, 24, 8
> > +** dup v31\.2s, w0
> > +** zip1 v0\.16b, v0\.16b, v31\.16b
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_int8_12:
> > +** mov w4, 33685504
> > +** bfi w4, w0, 0, 8
> > +** mov w0, 257
> > +** movk w0, 0x303, lsl 16
> > +** bfi w0, w1, 0, 8
> > +** bfi w4, w2, 16, 8
> > +** bfi w0, w3, 16, 8
> > +** dup v31\.2s, w4
> > +** dup v0\.2s, w0
> > +** zip1 v0\.16b, v31\.16b, v0\.16b
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_int8_13:
> > +** mov w4, 33685504
> > +** bfi w4, w0, 8, 8
> > +** mov w0, 257
> > +** movk w0, 0x303, lsl 16
> > +** bfi w0, w1, 8, 8
> > +** bfi w4, w2, 24, 8
> > +** bfi w0, w3, 24, 8
> > +** dup v31\.2s, w4
> > +** dup v0\.2s, w0
> > +** zip1 v0\.16b, v31\.16b, v0\.16b
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_float16_1:
> > +** fcvt h0, s0
> > +** dup v0\.8h, v0\.h\[0\]
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_float16_2:
> > +** fcvt h1, s1
> > +** fcvt h0, s0
> > +** ins v0\.h\[1\], v1\.h\[0\]
> > +** dup v0\.4s, v0\.s\[0\]
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_float16_3:
> > +** uzp1 v2\.2s, v0\.2s, v2\.2s
> > +** uzp1 v3\.2s, v1\.2s, v3\.2s
> > +** zip1 v3\.4s, v2\.4s, v3\.4s
> > +** fcvtn v0\.4h, v3\.4s
> > +** uzp1 v0\.2d, v0\.2d, v0\.2d
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_float16_4:
> > +** fcvt h0, s0
> > +** movi v31\.2d, #0
> > +** ins v31\.h\[0\], v0\.h\[0\]
> > +** dup v0\.4s, v31\.s\[0\]
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_float16_5:
> > +** fcvt h0, s0
> > +** movi v31\.2d, #0
> > +** ins v31\.h\[1\], v0\.h\[0\]
> > +** dup v0\.4s, v31\.s\[0\]
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_float16_6:
> > +** fcvt h1, s1
> > +** fcvt h0, s0
> > +** movi v31\.2d, #0
> > +** mov w0, 1006648320
> > +** umov w1, v1\.h\[0\]
> > +** ins v31\.h\[0\], v0\.h\[0\]
> > +** bfi w0, w1, 0, 16
> > +** dup v31\.2s, v31\.s\[0\]
> > +** dup v0\.2s, w0
> > +** zip1 v0\.8h, v31\.8h, v0\.8h
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_float16_7:
> > +** fcvt h1, s1
> > +** fcvt h0, s0
> > +** movi v31\.2d, #0
> > +** mov w0, 1006648320
> > +** umov w1, v1\.h\[0\]
> > +** ins v31\.h\[1\], v0\.h\[0\]
> > +** bfi w0, w1, 16, 16
> > +** dup v31\.2s, v31\.s\[0\]
> > +** dup v0\.2s, w0
> > +** zip1 v0\.8h, v31\.8h, v0\.8h
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_float16_8:
> > +** fcvt h1, s1
> > +** fcvt h0, s0
> > +** movi v31\.2s, 0x3c, lsl 24
> > +** ins v0\.h\[1\], v1\.h\[0\]
> > +** dup v0\.2s, v0\.s\[0\]
> > +** zip1 v0\.8h, v31\.8h, v0\.8h
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_int16_1:
> > +** dup v0\.8h, w0
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_int16_2:
> > +** bfi w0, w1, 16, 16
> > +** dup v0\.4s, w0
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_int16_3:
> > +** bfi w0, w2, 16, 16
> > +** bfi w1, w3, 16, 16
> > +** dup v31\.2s, w0
> > +** dup v0\.2s, w1
> > +** zip1 v0\.8h, v31\.8h, v0\.8h
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_int16_4:
> > +** mov w1, 0
> > +** bfi w1, w0, 0, 16
> > +** dup v0\.4s, w1
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_int16_5:
> > +** mov w1, 0
> > +** bfi w1, w0, 16, 16
> > +** dup v0\.4s, w1
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_int16_6:
> > +** mov w2, 0
> > +** bfi w2, w0, 0, 16
> > +** mov w0, 65537
> > +** bfi w0, w1, 0, 16
> > +** dup v31\.2s, w2
> > +** dup v0\.2s, w0
> > +** zip1 v0\.8h, v31\.8h, v0\.8h
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_int16_7:
> > +** mov w2, 0
> > +** bfi w2, w0, 16, 16
> > +** mov w0, 65537
> > +** bfi w0, w1, 16, 16
> > +** dup v31\.2s, w2
> > +** dup v0\.2s, w0
> > +** zip1 v0\.8h, v31\.8h, v0\.8h
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_int16_8:
> > +** bfi w0, w1, 16, 16
> > +** movi v0\.2s, 0x1, lsl 16
> > +** dup v31\.2s, w0
> > +** zip1 v0\.8h, v0\.8h, v31\.8h
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_float32_1:
> > +** dup v0\.4s, v0\.s\[0\]
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_float32_2:
> > +** uzp1 v0\.2s, v0\.2s, v1\.2s
> > +** dup v0\.2d, v0\.d\[0\]
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_float32_3:
> > +** movi v31\.2s, 0
> > +** dup v0\.2s, v0\.s\[0\]
> > +** zip1 v0\.4s, v0\.4s, v31\.4s
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_float32_4:
> > +** movi v31\.2s, 0
> > +** dup v0\.2s, v0\.s\[0\]
> > +** zip1 v0\.4s, v31\.4s, v0\.4s
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_int32_1:
> > +** dup v0\.4s, w0
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_int32_2:
> > +** fmov s0, w0
> > +** ins v0\.s\[1\], w1
> > +** dup v0\.2d, v0\.d\[0\]
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_int32_3:
> > +** dup v31\.2s, w0
> > +** movi v0\.2s, 0
> > +** zip1 v0\.4s, v31\.4s, v0\.4s
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_int32_4:
> > +** dup v31\.2s, w0
> > +** movi v0\.2s, 0
> > +** zip1 v0\.4s, v0\.4s, v31\.4s
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_float64_1:
> > +** dup v0\.2d, v0\.d\[0\]
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_int64_1:
> > +** dup v0\.2d, x0
> > +** ret
> > +*/
> > --
> > 2.43.0
>