On Wed, May 13, 2026 at 07:41:28AM +0100, Tamar Christina wrote:
> > -----Original Message-----
> > From: Artemiy Volkov <[email protected]>
> > Sent: 30 April 2026 18:10
> > To: Tamar Christina <[email protected]>
> > Cc: [email protected]; Wilco Dijkstra <[email protected]>;
> > [email protected]; Richard Earnshaw
> > <[email protected]>; [email protected]; Alice Carlotti
> > <[email protected]>; Alex Coplan <[email protected]>
> > Subject: Re: [PATCH 1/4] aarch64: introduce partial AdvSIMD vector modes
> >
> > On Tue, Apr 28, 2026 at 08:26:07AM +0100, Tamar Christina wrote:
> > > Hi Artemiy,
> > >
> > > > -----Original Message-----
> > > > From: Artemiy Volkov <[email protected]>
> > > > Sent: 27 April 2026 09:06
> > > > To: [email protected]
> > > > Cc: Tamar Christina <[email protected]>; Wilco Dijkstra
> > > > <[email protected]>; [email protected]; Richard
> > > > Earnshaw <[email protected]>; [email protected]; Alice
> > > > Carlotti <[email protected]>; Alex Coplan <[email protected]>;
> > > > Artemiy Volkov <[email protected]>
> > > > Subject: [PATCH 1/4] aarch64: introduce partial AdvSIMD vector modes
> > > >
> > > > In addition to V2HF that already exists, this patch adds 4 more partial
> > > > (16- and 32-bit) AdvSIMD vector modes: V4QI, V2QI, V2HI, and V2BF. For
> > > > now, these are intended only for duplication into full-sized (32-, 64-,
> > > > and 128-bit) registers. As a minimal closure required to bootstrap the
> > > > compiler, this also implements the "mov" expand and the
> > > > "aarch64_simd_mov"
> > > > insn_and_split for the new modes (gathered under the VSUB64 iterator).
> > > >
> > > > These modes are also added to aarch64_classify_vector_mode (), and are
> > > > classified as VEC_ADVSIMD | VEC_PARTIAL, a yet-untaken value that
> > seems to
> > > > fit the bill. This is then used in
> > >
> > > I haven't reviewed the whole thing yet, however I don't think we want to
> > > use
> > > VEC_PARTIAL here as the context for which it's used for SVE is quite
> > > different,
> > > different enough that I don't think we should mix them.
> > >
> > > In SVE VEC_PARTIAL means the vector is also using an unpacked bits
> > > representation, whereas your use here uses a packed one. They
> > differentiate
> > > between a container and a data type, whereas here the container and data
> > > type must be the same. i.e. V2QI must use .b for both data and container.
> > >
> > > And lastly there's a mismatch where VN2xSI is considered a partial vector
> > > but here V2SI isn't.
> > >
> > > So instead how about just using in the helper
> > > VECTOR_MODE_P && known_lt (GET_MODE_BITSIZE (mode), 64)
> > > as you do in the constraint, and rename it to something like
> > > aarch64_advsimd_sub_dword_mode_p, since I don't think you actually
> > > need flag and it's best not have it for a separate concept between SVE
> > > and Adv. SIMD.
> >
> > Hi Tamar,
>
> Hi Artemiy,
>
> >
> > This sounds fair, I'll create the helper and use it everywhere. (I think I
> > still need to call aarch64_classify_vector_mode () to filter out VLS SVE
> > modes though.)
>
> Yeah that's fine.
>
> >
> > >
> > > > aarch64_ira_change_pseudo_allocno_class () to instruct regalloc to
> > > > prefer
> > > > GENERAL_REGS to FP_REGS for the integer modes, i.e. V4QI, V2QI, and
> > V2HI.
> > >
> > > Why GENERAL_REGS over FP_REGS?
> > > It seems more useful to prefer FP_REGS. I see later on in the patches you
> > want
> > > construction using BFM? But BFM typically has the same latency but lower
> > > throughput than INS.
> >
> > So what I tried here is to make 32-bit and smaller modes (V2{Q,H}I, V4QI)
> > behave like integers, as far as regalloc is concerned. IOW, I want to
> > avoid unnecessary regfile transfers when combining 8-bit and 16-bit
> > quantities residing in GPRs. Without this change to the hook, some tuning
> > models would move those into FPRs before doing the combination, and I'm
> > not sure we want that, but this is a rare scenario anyway as it requires an
> > exact tie between GP and FP register classes...
>
> Yeah, I think it makes sense to treat them as FPR and just look at the costing
> If it's required. I have no strong feelings here but seems like a more
> natural
> fit.
>
> Do you have an example I can look at for when this happens?
I recall seeing a difference when trying out vec-init-23.c with
-mtune=generic-armv9-a, but it looks like it's gone when using the latest
version of this patch series. I will therefore drop this hunk from v2.
Thanks,
Artemiy
>
> Thanks,
> Tamar
>
> >
> > >
> > > >
> > > > Some existing testcases were adjusted where needed. (The _Float16
> > > > testcase in sve/slp_1.c temporarily expects GPRs to be used for V2HF,
> > > > which is corrected to FPRs by the succeeding patch; and the half-float
> > > > complex tests now recognize some of the patterns, but check that V2BF
> > > > still can't be used for vectorization.)
> > > >
> > > > gcc/ChangeLog:
> > > >
> > > > * config/aarch64/aarch64-modes.def (VECTOR_MODE): Remove
> > > > V2HF.
> > > > (VECTOR_MODES): Define V2QI, V4QI, V2HI, V2HF, V2BF.
> > > > * config/aarch64/aarch64-simd.md (*aarch64_simd_mov<mode>):
> > > > New
> > > > define_insn_and_split pattern.
> > > > (mov<mode>): Add sub-64-bit vector modes to the VALL_F16
> > > > expander.
> > > > Forego const vector expansion for those modes.
> > > > * config/aarch64/aarch64.cc
> > > > (aarch64_ira_change_pseudo_allocno_class):
> > > > Prefer GPRs for 16- and 32-bit integral vector modes.
> > > > (aarch64_classify_vector_mode): Handle 16- and 32-bit vector
> > > > modes.
> > > > (aarch64_advsimd_partial_mode_p): New predicate.
> > > > (aarch64_vectorize_vec_perm_const): Refuse for partial vector
> > > > modes.
> > > > * config/aarch64/constraints.md (Da): New constraint.
> > > > * config/aarch64/iterators.md (VSUB64): New iterator.
> > > > (VALL_F16_SUB64): Likewise.
> > > > (size): Define attribute for sub-64-bit vector modes.
> > > > (VSC): New mode attribute.
> > > > (vstype): Likewise.
> > > >
> > > > gcc/testsuite/ChangeLog:
> > > >
> > > > * gcc.dg/vect/complex/bb-slp-complex-add-half-float.c: Adjust
> > > > testcase.
> > > > * gcc.dg/vect/complex/bb-slp-complex-mla-half-float.c: Likewise.
> > > > * gcc.dg/vect/complex/bb-slp-complex-mul-half-float.c: Likewise.
> > > > * gcc.target/aarch64/sve/slp_1.c: Likewise.
> > > > ---
> > > > gcc/config/aarch64/aarch64-modes.def | 4 +-
> > > > gcc/config/aarch64/aarch64-simd.md | 64 ++++++++++++-
> > > > gcc/config/aarch64/aarch64.cc | 89 ++++++++++++-------
> > > > gcc/config/aarch64/constraints.md | 5 ++
> > > > gcc/config/aarch64/iterators.md | 19 +++-
> > > > .../complex/bb-slp-complex-add-half-float.c | 2 +
> > > > .../complex/bb-slp-complex-mla-half-float.c | 4 +-
> > > > .../complex/bb-slp-complex-mul-half-float.c | 6 +-
> > > > gcc/testsuite/gcc.target/aarch64/sve/slp_1.c | 11 +--
> > > > 9 files changed, 157 insertions(+), 47 deletions(-)
> > > >
> > > > diff --git a/gcc/config/aarch64/aarch64-modes.def
> > > > b/gcc/config/aarch64/aarch64-modes.def
> > > > index d9bff61adec..d5a54689f7a 100644
> > > > --- a/gcc/config/aarch64/aarch64-modes.def
> > > > +++ b/gcc/config/aarch64/aarch64-modes.def
> > > > @@ -79,8 +79,10 @@ VECTOR_MODES (FLOAT, 8); /*
> > > > V2SF. */
> > > > VECTOR_MODES (FLOAT, 16); /* V4SF V2DF. */
> > > > VECTOR_MODE (INT, DI, 1); /* V1DI. */
> > > > VECTOR_MODE (FLOAT, DF, 1); /* V1DF. */
> > > > -VECTOR_MODE (FLOAT, HF, 2); /* V2HF. */
> > > >
> > > > +VECTOR_MODES (INT, 2); /* V2QI. */
> > > > +VECTOR_MODES (INT, 4); /* V4QI V2HI. */
> > > > +VECTOR_MODES (FLOAT, 4); /* V2BF V2HF. */
> > > >
> > > > /* Integer vector modes used to represent intermediate widened values
> > > > in
> > > > some
> > > > instructions. Not intended to be moved to and from registers or
> > memory.
> > > > */
> > > > diff --git a/gcc/config/aarch64/aarch64-simd.md
> > > > b/gcc/config/aarch64/aarch64-simd.md
> > > > index c314e85927d..855b1ba353c 100644
> > > > --- a/gcc/config/aarch64/aarch64-simd.md
> > > > +++ b/gcc/config/aarch64/aarch64-simd.md
> > > > @@ -49,8 +49,8 @@
> > > > (define_subst_attr "vczbe" "add_vec_concat_subst_be" ""
> > > > "_vec_concatz_be")
> > > >
> > > > (define_expand "mov<mode>"
> > > > - [(set (match_operand:VALL_F16 0 "nonimmediate_operand")
> > > > - (match_operand:VALL_F16 1 "general_operand"))]
> > > > + [(set (match_operand:VALL_F16_SUB64 0 "nonimmediate_operand")
> > > > + (match_operand:VALL_F16_SUB64 1 "general_operand"))]
> > > > "TARGET_FLOAT"
> > > > "
> > > > /* Force the operand into a register if it is not an
> > > > @@ -77,7 +77,8 @@
> > > > aarch64_expand_vector_init (operands[0], operands[1]);
> > > > DONE;
> > > > }
> > > > - else if (!aarch64_simd_imm_zero (operands[1], <MODE>mode)
> > > > + else if (known_ge (GET_MODE_SIZE (<MODE>mode), 8)
> > >
> > > Use the helper?
> > >
> > > > + && !aarch64_simd_imm_zero (operands[1], <MODE>mode)
> > > > && !aarch64_simd_special_constant_p (operands[1],
> > > > <MODE>mode)
> > > > && !aarch64_simd_valid_mov_imm (operands[1]))
> > > > {
> > > > @@ -241,6 +242,63 @@
> > > > }
> > > > )
> > > >
> > > > +(define_insn_and_split "*aarch64_simd_mov<mode>"
> > > > + [(set (match_operand:VSUB64 0 "nonimmediate_operand")
> > > > + (match_operand:VSUB64 1 "general_operand"))]
> > > > + "TARGET_FLOAT
> > > > + && (register_operand (operands[0], <MODE>mode)
> > > > + || aarch64_simd_reg_or_zero (operands[1], <MODE>mode)
> > > > + || CONST_VECTOR_P (operands[1]))"
> > > > + {@ [cons: =0, 1; attrs: type, arch]
> > > > + [r , Dz ; mov_imm , * ] mov\t%w0, 0
> > > > + [r , rZ ; mov_reg , * ] mov\t%w0, %w1
> > > > + [r , Da ; mov_imm , * ] #
> > > > + [r , w ; mov_reg , simd ] #
> > > > + [r , m ; load_4 , * ] ldr<size>\t%w0, %1
> > > > + [w , w ; neon_logic , simd ] mov\t%0.8b, %1.8b
> > > > + [w , m ; neon_load1_1reg , simd ] ldr\t%<vstype>0, %1
> > > > + [w , Dz ; f_mcr , * ] fmov\t%<vstype>0, xzr
> > > > + [m , rZ ; store_4 , * ] str<size>\t%w1, %0
> > > > + [m , w ; neon_store1_1reg , simd ] str\t%<vstype>1, %0
> > > > + }
> > > > + "&& reload_completed
> > > > + && REG_P (operands[0])"
> > > > + [(const_int 0)]
> > > > + {
> > > > + if (CONST_VECTOR_P (operands[1]))
> > > > + {
> > > > + int elt_bitsize
> > > > + = GET_MODE_BITSIZE (GET_MODE_INNER (GET_MODE
> > > > (operands[1])));
> > > > + int n_elts = CONST_VECTOR_NUNITS (operands[1]).to_constant ();
> > > > + int val = 0;
> > > > + bool int_vector_p = CONST_INT_P (CONST_VECTOR_ELT
> > (operands[1],
> > > > 0));
> > > > + unsigned HOST_WIDE_INT eltval;
> > > > + rtx elt;
> > > > + for (int i = 0; i < n_elts; i++)
> > > > + {
> > > > + elt = CONST_VECTOR_ELT (operands[1], BYTES_BIG_ENDIAN
> > > > + ? i
> > > > + : n_elts - 1 - i);
> > > > + if (int_vector_p)
> > > > + eltval = INTVAL (elt);
> > > > + else
> > > > + {
> > > > + bool res = aarch64_reinterpret_float_as_int (elt,
> > > > &eltval);
> > > > + gcc_assert (res);
> > > > + }
> > > > +
> > > > + val = (val << elt_bitsize) + (eltval & ((1 << elt_bitsize)
> > > > - 1));
> > > > + }
> > > > + emit_move_insn (gen_rtx_REG (SImode, REGNO (operands[0])),
> > > > + GEN_INT (val));
> > > > + }
> > > > + else if (REG_P (operands[1]))
> > > > + aarch64_simd_emit_reg_reg_move (operands, <VSC>mode, 1);
> > > > + DONE;
> > > > + }
> > > > + [(set_attr "type" "mov_reg")]
> > > > +)
> > > > +
> > > > ;; When storing lane zero we can use the normal STR and its more
> > permissive
> > > > ;; addressing modes.
> > > >
> > > > diff --git a/gcc/config/aarch64/aarch64.cc
> > b/gcc/config/aarch64/aarch64.cc
> > > > index 37c28c8f2f8..257c193fa64 100644
> > > > --- a/gcc/config/aarch64/aarch64.cc
> > > > +++ b/gcc/config/aarch64/aarch64.cc
> > > > @@ -1479,40 +1479,6 @@ pr_or_ffr_regnum_p (unsigned int regno)
> > > > return PR_REGNUM_P (regno) || regno == FFR_REGNUM || regno ==
> > > > FFRT_REGNUM;
> > > > }
> > > >
> > > > -/* Implement TARGET_IRA_CHANGE_PSEUDO_ALLOCNO_CLASS.
> > > > - The register allocator chooses POINTER_AND_FP_REGS if FP_REGS and
> > > > - GENERAL_REGS have the same cost - even if POINTER_AND_FP_REGS
> > has a
> > > > much
> > > > - higher cost. POINTER_AND_FP_REGS is also used if the cost of both
> > > > FP_REGS
> > > > - and GENERAL_REGS is lower than the memory cost (in this case the
> > > > best
> > > > class
> > > > - is the lowest cost one). Using POINTER_AND_FP_REGS irrespectively
> > > > of
> > its
> > > > - cost results in bad allocations with many redundant int<->FP moves
> > which
> > > > - are expensive on various cores.
> > > > - To avoid this we don't allow POINTER_AND_FP_REGS as the allocno
> > class,
> > > > but
> > > > - force a decision between FP_REGS and GENERAL_REGS. We use the
> > allocno
> > > > class
> > > > - if it isn't POINTER_AND_FP_REGS. Similarly, use the best class if
> > > > it isn't
> > > > - POINTER_AND_FP_REGS. Otherwise set the allocno class depending on
> > the
> > > > mode.
> > > > - The result of this is that it is no longer inefficient to have a
> > > > higher
> > > > - memory move cost than the register move cost.
> > > > -*/
> > > > -
> > > > -static reg_class_t
> > > > -aarch64_ira_change_pseudo_allocno_class (int regno, reg_class_t
> > > > allocno_class,
> > > > - reg_class_t best_class)
> > > > -{
> > > > - machine_mode mode;
> > > > -
> > > > - if (!reg_class_subset_p (GENERAL_REGS, allocno_class)
> > > > - || !reg_class_subset_p (FP_REGS, allocno_class))
> > > > - return allocno_class;
> > > > -
> > > > - if (!reg_class_subset_p (GENERAL_REGS, best_class)
> > > > - || !reg_class_subset_p (FP_REGS, best_class))
> > > > - return best_class;
> > > > -
> > > > - mode = PSEUDO_REGNO_MODE (regno);
> > > > - return FLOAT_MODE_P (mode) || VECTOR_MODE_P (mode) ? FP_REGS :
> > > > GENERAL_REGS;
> > > > -}
> > > > -
> > > > static unsigned int
> > > > aarch64_min_divisions_for_recip_mul (machine_mode mode)
> > > > {
> > > > @@ -1777,6 +1743,14 @@ aarch64_classify_vector_mode
> > (machine_mode
> > > > mode, bool any_target_p = false)
> > > > case E_V4x2DFmode:
> > > > return (TARGET_FLOAT || any_target_p) ? VEC_ADVSIMD |
> > VEC_STRUCT :
> > > > 0;
> > > >
> > > > + /* 16-bit Advanced SIMD vectors. */
> > > > + case E_V2QImode:
> > > > + /* 32-bit Advanced SIMD vectors. */
> > > > + case E_V2HFmode:
> > > > + case E_V2BFmode:
> > > > + case E_V2HImode:
> > > > + case E_V4QImode:
> > > > + return (TARGET_FLOAT || any_target_p) ? VEC_ADVSIMD |
> > VEC_PARTIAL
> > > > : 0;
> > > > /* 64-bit Advanced SIMD vectors. */
> > > > case E_V8QImode:
> > > > case E_V4HImode:
> > > > @@ -1855,6 +1829,13 @@ aarch64_advsimd_full_struct_mode_p
> > > > (machine_mode mode)
> > > > return (aarch64_classify_vector_mode (mode) == (VEC_ADVSIMD |
> > > > VEC_STRUCT));
> > > > }
> > > >
> > > > +/* Return true if MODE is a partial (sub-64-bit) Advanced SIMD mode.
> > > > */
> > > > +static bool
> > > > +aarch64_advsimd_partial_mode_p (machine_mode mode)
> > > > +{
> > > > + return (aarch64_classify_vector_mode (mode) == (VEC_ADVSIMD |
> > > > VEC_PARTIAL));
> > > > +}
> > > > +
> > > > /* Return true if MODE is any of the data vector modes, including
> > > > structure modes. */
> > > > static bool
> > > > @@ -2126,6 +2107,43 @@ aarch64_coalesce_units (machine_mode
> > > > vec_mode, unsigned int factor)
> > > > return {};
> > > > }
> > > >
> > > > +/* Implement TARGET_IRA_CHANGE_PSEUDO_ALLOCNO_CLASS.
> > > > + The register allocator chooses POINTER_AND_FP_REGS if FP_REGS and
> > > > + GENERAL_REGS have the same cost - even if POINTER_AND_FP_REGS
> > has a
> > > > much
> > > > + higher cost. POINTER_AND_FP_REGS is also used if the cost of both
> > > > FP_REGS
> > > > + and GENERAL_REGS is lower than the memory cost (in this case the
> > > > best
> > > > class
> > > > + is the lowest cost one). Using POINTER_AND_FP_REGS irrespectively
> > > > of
> > its
> > > > + cost results in bad allocations with many redundant int<->FP moves
> > which
> > > > + are expensive on various cores.
> > > > + To avoid this we don't allow POINTER_AND_FP_REGS as the allocno
> > class,
> > > > but
> > > > + force a decision between FP_REGS and GENERAL_REGS. We use the
> > allocno
> > > > class
> > > > + if it isn't POINTER_AND_FP_REGS. Similarly, use the best class if
> > > > it isn't
> > > > + POINTER_AND_FP_REGS. Otherwise set the allocno class depending on
> > the
> > > > mode.
> > > > + The result of this is that it is no longer inefficient to have a
> > > > higher
> > > > + memory move cost than the register move cost.
> > > > +*/
> > > > +
> > > > +static reg_class_t
> > > > +aarch64_ira_change_pseudo_allocno_class (int regno, reg_class_t
> > > > allocno_class,
> > > > + reg_class_t best_class)
> > > > +{
> > > > + machine_mode mode;
> > > > +
> > > > + if (!reg_class_subset_p (GENERAL_REGS, allocno_class)
> > > > + || !reg_class_subset_p (FP_REGS, allocno_class))
> > > > + return allocno_class;
> > > > +
> > > > + if (!reg_class_subset_p (GENERAL_REGS, best_class)
> > > > + || !reg_class_subset_p (FP_REGS, best_class))
> > > > + return best_class;
> > > > +
> > > > + mode = PSEUDO_REGNO_MODE (regno);
> > > > + return FLOAT_MODE_P (mode) || (VECTOR_MODE_P (mode)
> > > > + && (!INTEGRAL_MODE_P (mode)
> > > > + || !aarch64_advsimd_partial_mode_p
> > > > (mode)))
> > > > + ? FP_REGS : GENERAL_REGS;
> > > > +}
> > > > +
> > >
> > > The condition seems a bit messy, aren't you effectively adding
> > >
> > > If (INTEGRAL_MODE_P (mode) && aarch64_advsimd_partial_mode_p
> > (mode))
> > > Return GENERAL_REGS;
> >
> > ... so let me know if you think we can keep this hunk, with the fix you're
> > suggesting above.
> >
> > Thanks for your review so far and looking forward to the rest of it.
> >
> > Kind regards,
> > Artemiy
> >
> > >
> > > Presumably so V2HF and V2BF are still FP_REGS.
> > >
> > > > /* Implement TARGET_VECTORIZE_RELATED_MODE. */
> > > >
> > > > static opt_machine_mode
> > > > @@ -28202,6 +28220,9 @@ aarch64_vectorize_vec_perm_const
> > > > (machine_mode vmode, machine_mode op_mode,
> > > > {
> > > > struct expand_vec_perm_d d;
> > > >
> > > > + if (aarch64_advsimd_partial_mode_p (op_mode))
> > > > + return false;
> > > > +
> > > > /* Check whether the mask can be applied to a single vector. */
> > > > if (sel.ninputs () == 1
> > > > || (op0 && rtx_equal_p (op0, op1)))
> > > > diff --git a/gcc/config/aarch64/constraints.md
> > > > b/gcc/config/aarch64/constraints.md
> > > > index 3d166fe3a17..77eadc89819 100644
> > > > --- a/gcc/config/aarch64/constraints.md
> > > > +++ b/gcc/config/aarch64/constraints.md
> > > > @@ -524,6 +524,11 @@
> > > > (and (match_code "const_int")
> > > > (match_test "aarch64_simd_scalar_immediate_valid_for_move (op,
> > > > QImode)")))
> > > > +(define_constraint "Da"
> > > > + "@internal
> > > > + A constraint that matches all sub-64-bit vectors."
> > > > + (and (match_code "const_vector")
> > > > + (match_test "known_lt (GET_MODE_BITSIZE (mode), 64)")))
> > > >
> > >
> > > Use the helper.
> > >
> > > Thanks,
> > > Tamar
> > >
> > > > (define_constraint "Dt"
> > > > "@internal
> > > > diff --git a/gcc/config/aarch64/iterators.md
> > > > b/gcc/config/aarch64/iterators.md
> > > > index 39b1e84edcc..dfca3327f1f 100644
> > > > --- a/gcc/config/aarch64/iterators.md
> > > > +++ b/gcc/config/aarch64/iterators.md
> > > > @@ -227,10 +227,17 @@
> > > > ;; All Advanced SIMD integer modes
> > > > (define_mode_iterator VALLI [VDQ_BHSI V2DI])
> > > >
> > > > +;; All sub-64-bit vector modes.
> > > > +(define_mode_iterator VSUB64 [V2QI V4QI V2HI V2HF V2BF])
> > > > +
> > > > ;; All Advanced SIMD modes suitable for moving, loading, and storing.
> > > > (define_mode_iterator VALL_F16 [V8QI V16QI V4HI V8HI V2SI V4SI V2DI
> > > > V4HF V8HF V4BF V8BF V2SF V4SF V2DF])
> > > >
> > > > +;; All Advanced SIMD modes suitable for moving, loading, and storing,
> > > > +;; plus all sub-64-bit vector modes.
> > > > +(define_mode_iterator VALL_F16_SUB64 [VALL_F16 VSUB64])
> > > > +
> > > > ;; The VALL_F16 modes except the 128-bit 2-element ones.
> > > > (define_mode_iterator VALL_F16_NO_V2Q [V8QI V16QI V4HI V8HI V2SI
> > > > V4SI
> > > > V4HF V8HF V2SF V4SF])
> > > > @@ -1466,7 +1473,9 @@
> > > > (define_mode_attr s [(HF "h") (SF "s") (DF "d") (SI "s") (DI "d")])
> > > >
> > > > ;; Give the length suffix letter for a sign- or zero-extension.
> > > > -(define_mode_attr size [(QI "b") (HI "h") (SI "w")])
> > > > +(define_mode_attr size [(QI "b") (HI "h") (SI "w") (HF "") (BF "") (SF
> > > > "")
> > > > + (V2QI "h") (V4QI "") (V2HI "")
> > > > + (V2HF "") (V2BF "")])
> > > >
> > > > ;; Give the number of bits in the mode
> > > > (define_mode_attr sizen [(QI "8") (HI "16") (SI "32") (DI "64")])
> > > > @@ -1883,6 +1892,10 @@
> > > > (VNx4SI "v2si") (VNx4SF "v2sf")
> > > > (VNx2DI "di") (VNx2DF "df")])
> > > >
> > > > +;; Sub-64-bit vector mode to equivalent scalar mode.
> > > > +(define_mode_attr VSC [(V4QI "SI") (V2QI "HI")
> > > > + (V2HI "SI") (V2HF "SF") (V2BF "SF")])
> > > > +
> > > > (define_mode_attr vnx [(V4SI "vnx4si") (V2DI "vnx2di")])
> > > >
> > > > ;; 64-bit container modes the inner or scalar source mode.
> > > > @@ -2169,6 +2182,10 @@
> > > > (V2SI "q") (V2SF "q")
> > > > (DI "q") (DF "q")])
> > > >
> > > > +;; Scalar size of a sub-64-bit vector mode.
> > > > +(define_mode_attr vstype [(V4QI "s") (V2QI "h")
> > > > + (V2HI "s") (V2BF "s") (V2HF "s")])
> > > > +
> > > > ;; Define corresponding core/FP element mode for each vector mode.
> > > > (define_mode_attr vw [(V8QI "w") (V16QI "w")
> > > > (V4HI "w") (V8HI "w")
> > > > diff --git a/gcc/testsuite/gcc.dg/vect/complex/bb-slp-complex-add-half-
> > > > float.c b/gcc/testsuite/gcc.dg/vect/complex/bb-slp-complex-add-half-
> > float.c
> > > > index 3f1cce56955..6234f8646fe 100644
> > > > --- a/gcc/testsuite/gcc.dg/vect/complex/bb-slp-complex-add-half-float.c
> > > > +++ b/gcc/testsuite/gcc.dg/vect/complex/bb-slp-complex-add-half-float.c
> > > > @@ -12,3 +12,5 @@
> > > >
> > > > /* { dg-final { scan-tree-dump "add new stmt:
> > > > \[^\n\r]*COMPLEX_ADD_ROT270" "slp1" { xfail *-*-* } } } */
> > > > /* { dg-final { scan-tree-dump "add new stmt:
> > > > \[^\n\r]*COMPLEX_ADD_ROT90" "slp1" { xfail *-*-* } } } */
> > > > +/* { dg-final { scan-tree-dump "Found COMPLEX_ADD_ROT90" "slp1" } }
> > */
> > > > +/* { dg-final { scan-tree-dump "Found COMPLEX_ADD_ROT270" "slp1" } }
> > */
> > > > diff --git a/gcc/testsuite/gcc.dg/vect/complex/bb-slp-complex-mla-half-
> > float.c
> > > > b/gcc/testsuite/gcc.dg/vect/complex/bb-slp-complex-mla-half-float.c
> > > > index 33e500f3f4c..831f84bc1c8 100644
> > > > --- a/gcc/testsuite/gcc.dg/vect/complex/bb-slp-complex-mla-half-float.c
> > > > +++ b/gcc/testsuite/gcc.dg/vect/complex/bb-slp-complex-mla-half-float.c
> > > > @@ -9,4 +9,6 @@
> > > > #include "complex-mla-template.c"
> > > >
> > > > /* { dg-final { scan-tree-dump "Found COMPLEX_FMA_CONJ" "slp1" { xfail
> > *-
> > > > *-* } } } */
> > > > -/* { dg-final { scan-tree-dump "Found COMPLEX_FMA" "slp1" { xfail
> > > > *-*-*
> > } }
> > > > } */
> > > > +
> > > > +/* { dg-final { scan-tree-dump-times "add new
> > > > stmt:\[^\n\r]*COMPLEX_FMA" 1 "slp1" { xfail *-*-* } } } */
> > > > +/* { dg-final { scan-tree-dump "Found COMPLEX_FMA" "slp1" } } */
> > > > diff --git a/gcc/testsuite/gcc.dg/vect/complex/bb-slp-complex-mul-half-
> > > > float.c b/gcc/testsuite/gcc.dg/vect/complex/bb-slp-complex-mul-half-
> > float.c
> > > > index 259dd6b2e06..f74274ad034 100644
> > > > --- a/gcc/testsuite/gcc.dg/vect/complex/bb-slp-complex-mul-half-float.c
> > > > +++ b/gcc/testsuite/gcc.dg/vect/complex/bb-slp-complex-mul-half-float.c
> > > > @@ -8,5 +8,7 @@
> > > > #define N 16
> > > > #include "complex-mul-template.c"
> > > >
> > > > -/* { dg-final { scan-tree-dump "Found COMPLEX_MUL_CONJ" "slp1" {
> > xfail *-
> > > > *-* } } } */
> > > > -/* { dg-final { scan-tree-dump "Found COMPLEX_MUL" "slp1" { xfail
> > > > *-*-*
> > } }
> > > > } */
> > > > +/* { dg-final { scan-tree-dump-times "add new
> > > > stmt:\[^\n\r]*COMPLEX_MUL_CONJ" 1 "slp1" { xfail *-*-* } } } */
> > > > +/* { dg-final { scan-tree-dump "Found COMPLEX_MUL_CONJ" "slp1" } } */
> > > > +/* { dg-final { scan-tree-dump-times "add new
> > > > stmt:\[^\n\r]*COMPLEX_MUL" 1 "slp1" { xfail *-*-* } } } */
> > > > +/* { dg-final { scan-tree-dump "Found COMPLEX_MUL" "slp1" } } */
> > > > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
> > > > b/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
> > > > index 07d71a63414..98e8ac3c785 100644
> > > > --- a/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
> > > > +++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
> > > > @@ -30,12 +30,14 @@ vec_slp_##TYPE (TYPE *restrict a, TYPE b, TYPE c,
> > int
> > > > n) \
> > > > TEST_ALL (VEC_PERM)
> > > >
> > > > /* We should use one DUP for each of the 8-, 16- and 32-bit types,
> > > > - although we currently use LD1RW for _Float16. We should use two
> > > > + (for now, insert both elements with umov + ins for _Float16). We
> > > > should
> > > > use two
> > > > DUPs for each of the three 64-bit types. */
> > > > /* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.h, [hw]} 2 } } */
> > > > -/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.s, [sw]} 2 } } */
> > > > -/* { dg-final { scan-assembler-times {\tld1rw\tz[0-9]+\.s, } 1 } } */
> > > > +/* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.s, [sw]} 3 } } */
> > > > /* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, [dx]} 9 } } */
> > > > +/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.h} 2 }
> > > > }
> > */
> > > > +/* { dg-final { scan-assembler-times {\tins\tv[0-9]+\.h\[0\], w[0-9]+}
> > > > 1 } }
> > */
> > > > +/* { dg-final { scan-assembler-times {\tins\tv[0-9]+\.h\[1\], w[0-9]+}
> > > > 1 } }
> > */
> > > > /* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.d, z[0-9]+\.d,
> > > > z[0-
> > > > 9]+\.d\n} 3 } } */
> > > > /* { dg-final { scan-assembler-not {\tzip2\t} } } */
> > > >
> > > > @@ -53,7 +55,6 @@ TEST_ALL (VEC_PERM)
> > > > /* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.s} 6 } } */
> > > > /* { dg-final { scan-assembler-times {\twhilelo\tp[0-7]\.d} 6 } } */
> > > > /* { dg-final { scan-assembler-not {\tldr} } } */
> > > > -/* { dg-final { scan-assembler-times {\tstr} 2 } } */
> > > > -/* { dg-final { scan-assembler-times {\tstr\th[0-9]+} 2 } } */
> > > > +/* { dg-final { scan-assembler-not {\tstr} } } */
> > > >
> > > > /* { dg-final { scan-assembler-not {\tuqdec} } } */
> > > > --
> > > > 2.43.0
> > >