Re: [PATCH 2/4] aarch64: initialize vectors from starting subsequence

Artemiy Volkov Mon, 18 May 2026 07:52:35 -0700

On Wed, May 13, 2026 at 09:01:58AM +0100, Tamar Christina wrote:
> Hi Art,
> 
> > -----Original Message-----
> > From: Artemiy Volkov <[email protected]>
> > Sent: 27 April 2026 09:06
> > To: [email protected]
> > Cc: Tamar Christina <[email protected]>; Wilco Dijkstra
> > <[email protected]>; [email protected]; Richard
> > Earnshaw <[email protected]>; [email protected]; Alice
> > Carlotti <[email protected]>; Alex Coplan <[email protected]>;
> > Artemiy Volkov <[email protected]>
> > Subject: [PATCH 2/4] aarch64: initialize vectors from starting subsequence
> > 
> > Now that we have 2- and 4-element vector modes for all the sub-word scalar
> > modes, we can emit more efficient code when the elements of a vector
> > constructor can be generated from a common starting subsequence of length
> > power of two.  To do this, first detect the shortest possible starting
> > subsequence by repeatedly folding the initial constructor element array
> > in half, as long as the left and the right halves are equal.  Afterwards,
> > after emitting the subsequence, duplicate it by generating a
> > vec_duplicate with the correct source mode.
> > 
> > On the MD side, this requires implementing the vec_duplicate optab to
> > duplicate an arbitrary sub-128-bit value into a full 64- or a 128-bit
> > AdvSIMD register, as well as the vec_set insn for the VSUB64 modes (needed
> > as fallback for the divide-and-conquer approach).  The latter uses a
> > properly scaled and shifted "bfi" for integer values, and a properly
> > indexed "ins" for FP elements.
> > 
> > This change allows us to get rid of long chains of inserts and compile
> > things like:
> > 
> > int16x8_t f (int16_t x, int16_t y, int16_t z, int16_t w)
> > {
> >     return (int16x8_t) {x, y, z, w, x, y, z, w};
> > }
> > 
> > into:
> >     bfi     w0, w2, 16, 16
> >     bfi     w1, w3, 16, 16
> >     dup     v31.2s, w0
> >     dup     v0.2s, w1
> >     zip1    v0.8h, v31.8h, v0.8h
> >     ret
> 
> Curious, while and improvement why didn't this
> continue and merge w0 and w1 into x0 and do a single dup?
> 
> so another bfi but of an x registers or an orr of x registers.
> 
> That removes the other transfer and the need for the zip.


This is a problem of constructing a vector of N>2 arbitrary elements,
which we still handle by either (a) using dup+zip, or (b) inserting
elements one by one using ins, whichever is cheaper.  Perhaps achieving
your codegen is just a matter of doing something like:

diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
index b58c4572299..b4ccb0e1352 100644
--- a/gcc/config/aarch64/aarch64.cc
+++ b/gcc/config/aarch64/aarch64.cc
@@ -25625,6 +25625,25 @@ aarch64_expand_vector_init_fallback (rtx target, rtx 
vals)
          }
     }
 
+  if (pow2p_hwi (n_elts))
+    {
+      machine_mode subv_mode = mode_for_vector (inner_mode, n_elts / 
2).require ();
+      rtx new_target0 = gen_reg_rtx (subv_mode), new_target1 = gen_reg_rtx 
(subv_mode);
+      rtvec new_vals0 = rtvec_alloc (n_elts / 2), new_vals1 = rtvec_alloc 
(n_elts / 2);
+      for (int i = 0; i < n_elts / 2; i++) {
+       RTVEC_ELT (new_vals0, i) = XVECEXP (vals, 0, i);
+       RTVEC_ELT (new_vals1, i) = XVECEXP (vals, 0, n_elts / 2 + i);
+      }
+      aarch64_expand_vector_init (new_target0, gen_rtx_PARALLEL (subv_mode, 
new_vals0));
+      aarch64_expand_vector_init (new_target1, gen_rtx_PARALLEL (subv_mode, 
new_vals1));
+
+      rtvec new_vals = rtvec_alloc (2);
+      RTVEC_ELT (new_vals, 0) = new_target0;
+      RTVEC_ELT (new_vals, 1) = new_target1;
+      aarch64_expand_vector_init (target, gen_rtx_PARALLEL (mode, new_vals));
+      return;
+    }
+

I guess this could always be used at least as a last-resort option instead
of a sequence of vec_sets (and in this case it will also be cheaper than
dup+zip), but it still needs a full assessment so I suggest we leave that
for a later patch.

> > 
> > rather than:
> > 
> >     dup     v31.4h, w0
> >     dup     v0.4h, w1
> >     ins     v31.h[1], w2
> >     ins     v0.h[1], w3
> >     ins     v31.h[3], w2
> >     ins     v0.h[3], w3
> >     zip1    v0.8h, v31.8h, v0.8h
> >     ret
> > 
> > This patch also includes an extensive new test, which includes the above
> > case, as well as adjustments to existing codegen tests as necessary.
> > 
> > gcc/ChangeLog:
> > 
> >     * config/aarch64/aarch64-simd.md
> > (*aarch64_simd_dup_subvector<VQ:mode><VDUP:mode>):
> >     New insn pattern.
> >     (*aarch64_simd_dup_subvector<VD:mode><VDUP:mode>): Likewise.
> >     (@aarch64_simd_vec_set<mode>): Likewise.
> >     (vec_set<mode>): Handle 16- and 32-bit vector modes in the
> > expander.
> >     * config/aarch64/aarch64.cc (aarch64_choose_vector_init_constant):
> >     Handle 16- and 32-bit vector modes.
> >     (aarch64_expand_vector_init_fallback): Add logic to initialize vector
> >     from starting subsequence.  Make static.
> >     (scalar_move_insn_p): Consider sub-64-bit vector moves scalar.
> >     * config/aarch64/iterators.md (VDUP): New iterator.
> >     (elem_bits): Define attribute for sub-64-bit vector modes.
> >     (Vetype): Likewise.
> >     (VEL): Likewise.
> >     (single_wx): Define attribute for sub-64-bit vector and scalar modes.
> >     (single_type): Likewise.
> >     (Vqduptype): New mode attribute.
> >     (Vdduptype): Likewise.
> >     (vstype): Define attribute for 64-bit vector and sub-128-bit scalar
> >     modes.
> > 
> > gcc/testsuite/ChangeLog:
> > 
> >     * gcc.target/aarch64/ldp_stp_16.c: Adjust testcase.
> >     * gcc.target/aarch64/sve/slp_1.c: Likewise.
> >     * gcc.target/aarch64/vec-init-18.c: Likewise.
> >     * gcc.target/aarch64/vec-init-23.c: New test.
> > ---
> >  gcc/config/aarch64/aarch64-simd.md            |  48 +-
> >  gcc/config/aarch64/aarch64.cc                 |  46 +-
> >  gcc/config/aarch64/iterators.md               |  62 ++-
> >  gcc/testsuite/gcc.target/aarch64/ldp_stp_16.c |   5 +-
> >  gcc/testsuite/gcc.target/aarch64/sve/slp_1.c  |   7 +-
> >  .../gcc.target/aarch64/vec-init-18.c          |   7 +-
> >  .../gcc.target/aarch64/vec-init-23.c          | 435 ++++++++++++++++++
> >  7 files changed, 587 insertions(+), 23 deletions(-)
> >  create mode 100644 gcc/testsuite/gcc.target/aarch64/vec-init-23.c
> > 
> > diff --git a/gcc/config/aarch64/aarch64-simd.md
> > b/gcc/config/aarch64/aarch64-simd.md
> > index 855b1ba353c..4bb26621efc 100644
> > --- a/gcc/config/aarch64/aarch64-simd.md
> > +++ b/gcc/config/aarch64/aarch64-simd.md
> > @@ -136,6 +136,28 @@
> >    }
> >  )
> > 
> > +(define_insn "*aarch64_simd_dup_subvector<VQ:mode><VDUP:mode>"
> > +  [(set (match_operand:VQ 0 "register_operand")
> > +   (vec_duplicate:VQ
> > +     (match_operand:VDUP 1 "register_operand")))]
> > +  "TARGET_SIMD"
> > +  {@ [ cons: =0 , 1 ; attrs: type        ]
> > +     [ w        , w ; neon_dup<VQ:q>     ] dup\t%0.<VDUP:Vqduptype>,
> > %1.<VDUP:vstype>[0]
> > +     [ w        , r ; neon_from_gp<VQ:q> ] dup\t%0.<VDUP:Vqduptype>,
> > %<VDUP:single_wx>1
> > +  }
> > +)
> > +
> > +(define_insn "*aarch64_simd_dup_subvector<VD:mode><VDUP:mode>"
> > +  [(set (match_operand:VD 0 "register_operand")
> > +   (vec_duplicate:VD
> > +     (match_operand:VDUP 1 "register_operand")))]
> > +  "TARGET_SIMD"
> > +  {@ [ cons: =0 , 1 ; attrs: type        ]
> > +     [ w        , w ; neon_dup<VD:q>     ] dup\t%0.<VDUP:Vdduptype>,
> > %1.<VDUP:vstype>[0]
> > +     [ w        , r ; neon_from_gp<VD:q> ] dup\t%0.<VDUP:Vdduptype>,
> > %<VDUP:single_wx>1
> > +  }
> > +)
> 
> Since VDUP also contains 64-bit elements doesn't this create odd combinations
> Like v8qiv8qi?
> 
> Indeed and in those cases Vdduptype returns "" so the assembly is broken
> 
> (define_insn ("*aarch64_simd_dup_subvectorv8qiv8qi")
>      [
>         (set (match_operand:V8QI 0 ("register_operand") ("=w,w"))
>             (vec_duplicate:V8QI (match_operand:V8QI 1 ("register_operand") 
> ("w,r"))))
>     ] ("TARGET_SIMD") ("@
> dup\t%0., %1.d[0]
> dup\t%0., %x1")
>      [
>         (set_attr ("type") ("neon_dup,neon_from_gp"))
>     ])
> 
> I think also for the RTL these shouldn't be just a `vec_duplicate` because 
> some of
> them change the interpretation of the elements too. Like
> 
> aarch64_simd_dup_subvectorv8bfsf is
> 
> (set (match_operand:V8BF 0 ("register_operand") ("=w,w"))
>             (vec_duplicate:V8BF (match_operand:SF 1 ("register_operand") 
> ("w,r"))))
> 
> Which is invalid for a vec_duplicate as the submodes aren't the same.
> 
> So I think these patterns should be split into not just 64 and 128 bit 
> duplicates
> but into also into duplicates of the same element type and not.
> 
> For the mismatch elements you'll need to use a subreg to convert the type to
> make it valid RTL.

Thank you for this observation; the intent was indeed to only support
vec_duplicates of smaller types into larger ones, and only where inner
modes are the same.  I have added a couple more necessary attributes and
modified the patterns to be driven only by the mode of the inner operand.

> 
> > +
> >  (define_insn "@aarch64_dup_lane<mode>"
> >    [(set (match_operand:VALL_F16 0 "register_operand" "=w")
> >     (vec_duplicate:VALL_F16
> > @@ -1282,6 +1304,30 @@
> >    [(set_attr "type" "neon_ins<q>, neon_from_gp<q>,
> > neon_load1_one_lane<q>")]
> >  )
> > 
> > +(define_insn "@aarch64_simd_vec_set<mode>"
> > +  [(set (match_operand:VSUB64 0 "register_operand" "=r, w")
> > +   (vec_merge:VSUB64
> > +       (vec_duplicate:VSUB64
> > +           (match_operand:<VEL> 1
> > "aarch64_simd_nonimmediate_operand" "r, w"))
> > +       (match_operand:VSUB64 3 "register_operand" "0, 0")
> > +       (match_operand:SI 2 "immediate_operand" "i, i")))]
> > +  "TARGET_SIMD && exact_log2 (INTVAL (operands[2])) >= 0"
> > +  {
> > +    int elt = exact_log2 (INTVAL (operands[2]));
> > +    switch (which_alternative)
> > +      {
> > +      case 0:
> > +   operands[2] = GEN_INT (elt * <elem_bits>);
> > +   return "bfi\t%w0, %w1, %2, <elem_bits>";
> > +      case 1:
> > +   return "ins\t%0.<Vetype>[%p2], %1.<Vetype>[0]";
> > +      default:
> > +   gcc_unreachable ();
> > +      }
> > +  }
> > +  [(set_attr "type" "bfm, neon_ins")]
> > +)
> > +
> 
> aarch64_simd_nonimmediate_operand allows memory as well so
> in the above your predicate is wider than your constraint and would
> force a reload which is probably not what we want.
> 
> Any reason for not allowing Utv here like the other patterns do?
> But if we really don't want the load the predicate should be register_operand.

No reason to speak of; will add a Utv alternative.

> 
> >  ;; Inserting from the zero register into a vector lane is treated as an
> >  ;; expensive GP->FP move on all CPUs.  Avoid it when optimizing for speed.
> >  (define_insn "aarch64_simd_vec_set_zero<mode>"
> > @@ -1711,7 +1757,7 @@
> >  )
> > 
> >  (define_expand "vec_set<mode>"
> > -  [(match_operand:VALL_F16 0 "register_operand")
> > +  [(match_operand:VALL_F16_SUB64 0 "register_operand")
> >     (match_operand:<VEL> 1 "aarch64_simd_nonimmediate_operand")
> >     (match_operand:SI 2 "immediate_operand")]
> >    "TARGET_SIMD"
> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> > index 257c193fa64..5b1afa50ff8 100644
> > --- a/gcc/config/aarch64/aarch64.cc
> > +++ b/gcc/config/aarch64/aarch64.cc
> > @@ -25446,7 +25446,9 @@ aarch64_choose_vector_init_constant
> > (machine_mode mode, rtx vals)
> >      }
> > 
> >    rtx c = gen_rtx_CONST_VECTOR (mode, copy);
> > -  if (aarch64_simd_valid_mov_imm (c))
> > +  if (aarch64_simd_valid_mov_imm (c)
> > +      || (INTEGRAL_MODE_P (mode)
> > +     && aarch64_advsimd_partial_mode_p (mode)))
> >      return c;
> 
> I'm having trouble figuring out why this particular change is needed.
> when I remove it I see test_int8_9 failing because the initial constant
> is different.
> 
> This code seems to want to preserve the repeating pattern as a constant
> 
> (const_vector:V4QI [
>         (const_int 0 [0]) repeated x2
>         (const_int 1 [0x1]) repeated x2
>     ])
> 
> For an insertion
> 
> (parallel:V4QI [
>         (subreg/s/u:QI (reg/v:SI 102 [ x0 ]) 0)
>         (const_int 0 [0])
>         (subreg/s/u:QI (reg/v:SI 103 [ x1 ]) 0)
>         (const_int 1 [0x1])
>     ])
> 
> Now the first one isn't a valid mov immediate so you don't return early but
> the loop below knows that you're going to overwrite index 0 and 2 so it 
> instead
> tries the constant
> 
> (const_vector:V4QI [
>         (const_int 0 [0]) repeated x3
>         (const_int 1 [0x1])
>     ])
> 
> Which is correct and valid and if both these fail you get c anyway.
> 
> So it seems like the check isn't needed?

Yes, this is another leftover from previous revisions when for some reason
I wasn't sure that small modes will be handled correctly here.  I will
drop this for v2.

Thank you for the review,
Artemiy

> 
> Thanks,
> Tamar
> > 
> >    /* Try generating a stepped sequence.  */
> > @@ -25487,7 +25489,7 @@ aarch64_choose_vector_init_constant
> > (machine_mode mode, rtx vals)
> >     The caller has already tried a divide-and-conquer approach, so do
> >     not consider that case here.  */
> > 
> > -void
> > +static void
> >  aarch64_expand_vector_init_fallback (rtx target, rtx vals)
> >  {
> >    machine_mode mode = GET_MODE (target);
> > @@ -25545,6 +25547,43 @@ aarch64_expand_vector_init_fallback (rtx
> > target, rtx vals)
> >        return;
> >      }
> > 
> > +  /* Check if the vector can be represented as a duplicate of a
> > +     subvector starting at index 0.  */
> > +  if (pow2p_hwi (n_elts))
> > +    {
> > +   bool halves_equal = true;
> > +   int n_seq = n_elts;
> > +   while (n_seq > 2)
> > +     {
> > +       for (int i = 0; i < n_seq / 2; i++)
> > +         if (!rtx_equal_p (XVECEXP (vals, 0, i),
> > +                           XVECEXP (vals, 0, i + n_seq / 2)))
> > +           {
> > +             halves_equal = false;
> > +             break;
> > +           }
> > +
> > +       if (!halves_equal)
> > +         break;
> > +
> > +       n_seq /= 2;
> > +     }
> > +
> > +   if (n_seq != n_elts)
> > +     {
> > +       machine_mode subv_mode = mode_for_vector (inner_mode,
> > +                                                 n_seq).require ();
> > +       rtx new_target = gen_reg_rtx (subv_mode);
> > +       rtvec new_vals = rtvec_alloc (n_seq);
> > +       for (int i = 0; i < n_seq; i++)
> > +         RTVEC_ELT (new_vals, i) = XVECEXP (vals, 0, i);
> > +       aarch64_expand_vector_init (new_target,
> > +                                   gen_rtx_PARALLEL (subv_mode,
> > new_vals));
> > +       aarch64_emit_move (target, gen_vec_duplicate (mode,
> > new_target));
> > +       return;
> > +     }
> > +    }
> > +
> >    enum insn_code icode = optab_handler (vec_set_optab, mode);
> >    gcc_assert (icode != CODE_FOR_nothing);
> > 
> > @@ -25704,7 +25743,8 @@ scalar_move_insn_p (rtx set)
> >    rtx src = SET_SRC (set);
> >    rtx dest = SET_DEST (set);
> >    return (is_a<scalar_mode> (GET_MODE (dest))
> > -     && aarch64_mov_operand (src, GET_MODE (dest)));
> > +     && aarch64_mov_operand (src, GET_MODE (dest)))
> > +    || aarch64_advsimd_partial_mode_p (GET_MODE (dest));
> >  }
> > 
> >  /* Similar to seq_cost, but ignore cost for scalar moves.  */
> > diff --git a/gcc/config/aarch64/iterators.md
> > b/gcc/config/aarch64/iterators.md
> > index dfca3327f1f..1fc67d95bd4 100644
> > --- a/gcc/config/aarch64/iterators.md
> > +++ b/gcc/config/aarch64/iterators.md
> > @@ -139,6 +139,10 @@
> >  ;; VQMOV without 2-element modes.
> >  (define_mode_iterator VQMOV_NO2E [V16QI V8HI V4SI V8HF V8BF V4SF])
> > 
> > +;; Modes that can be duplicated into a 64/128-bit register.
> > +(define_mode_iterator VDUP [V8QI V4QI V2QI QI V4HI V2HI HI V2SI SI DI
> > +                            V4BF V2BF BF V4HF V2HF HF V2SF SF DF])
> > +
> >  ;; Double integer vector modes.
> >  (define_mode_iterator VD_I [V8QI V4HI V2SI DI])
> > 
> > @@ -1488,7 +1492,9 @@
> > 
> >  ;; The number of bits in a vector element, or controlled by a predicate
> >  ;; element.
> > -(define_mode_attr elem_bits [(VNx16BI "8") (VNx8BI "16")
> > +(define_mode_attr elem_bits [(V2QI "8") (V4QI "8") (V2HF "16") (V2HI
> > "16")
> > +                        (V2BF "16")
> > +                        (VNx16BI "8") (VNx8BI "16")
> >                          (VNx4BI "32") (VNx2BI "64")
> >                          (VNx16QI "8") (VNx32QI "8") (VNx64QI "8")
> >                          (VNx8HI "16") (VNx16HI "16") (VNx32HI "16")
> > @@ -1593,11 +1599,12 @@
> > 
> >  ;; Mode-to-individual element type mapping.
> >  (define_mode_attr Vetype [(V8QI "b") (V16QI "b")
> > -                     (V4HI "h") (V8HI  "h")
> > +                     (V2QI "b") (V4QI "b")
> > +                     (V4HI "h") (V8HI  "h") (V2HI "h")
> >                       (V2SI "s") (V4SI  "s")
> >                       (V2DI "d") (V1DI  "d")
> > -                     (V4HF "h") (V8HF  "h")
> > -                     (V2SF "s") (V4SF  "s")
> > +                     (V4HF "h") (V8HF  "h") (V2HF "h")
> > +                     (V2SF "s") (V4SF  "s") (V2BF "h")
> >                       (V2DF "d") (V1DF  "d")
> >                       (V2x8QI "b") (V2x4HI "h")
> >                       (V2x2SI "s") (V2x1DI "d")
> > @@ -1772,8 +1779,10 @@
> >                            (V4x2DF "v2df") (V4x8BF "v8bf")])
> > 
> >  ;; Define element mode for each vector mode.
> > -(define_mode_attr VEL [(V8QI  "QI") (V16QI "QI")
> > +(define_mode_attr VEL [(V8QI "QI") (V16QI "QI")
> > +                  (V2QI "QI") (V4QI  "QI")
> >                    (V4HI "HI") (V8HI  "HI")
> > +                  (V2HI "HI") (V2HF  "HF")
> >                    (V2SI "SI") (V4SI  "SI")
> >                    (DI   "DI") (V1DI  "DI")
> >                    (V2DI "DI")
> > @@ -1784,6 +1793,7 @@
> >                    (SI   "SI") (HI    "HI")
> >                    (QI   "QI")
> >                    (V4BF "BF") (V8BF "BF")
> > +                  (V2BF "BF")
> >                    (V2x8QI "QI") (V2x4HI "HI")
> >                    (V2x2SI "SI") (V2x1DI "DI")
> >                    (V2x4HF "HF") (V2x2SF "SF")
> > @@ -2037,6 +2047,26 @@
> >  (define_mode_attr V2ntype [(V8HI "16b") (V4SI "8h")
> >                        (V2DI "4s")])
> > 
> > +;; Register suffix used when duplicating a value of a certain mode
> > +;; into a full 128-bit AdvSIMD register.
> > +(define_mode_attr Vqduptype [(QI "16b") (V2QI "8h") (V4QI "4s") (V8QI
> > "2d")
> > +                        (HI "8h") (V2HI "4s") (V4HI "2d")
> > +                        (HF "8h") (V2HF "4s") (V4HF "2d")
> > +                        (BF "8h") (V2BF "4s") (V4BF "2d")
> > +                        (SI "4s") (V2SI "2d")
> > +                        (SF "4s") (V2SF "2d")
> > +                        (DI "2d") (DF "2d")])
> > +
> > +;; Register suffix used when duplicating a value of a certain mode
> > +;; into a partial 64-bit AdvSIMD register.
> > +(define_mode_attr Vdduptype [(QI "8b") (V2QI "4h") (V4QI "2s") (V8QI "")
> > +                        (HI "4h") (V2HI "2s") (V4HI "")
> > +                        (HF "4h") (V2HF "2s") (V4HF "")
> > +                        (BF "4h") (V2BF "2s") (V4BF "")
> > +                        (SI "2s") (V2SI "")
> > +                        (SF "2s") (V2SF "")
> > +                        (DI "") (DF "")])
> > +
> >  ;; The result of FCVTN on two vectors of the given mode.  The result has
> >  ;; twice as many QI elements as the input.
> >  (define_mode_attr VPACKB [(V4HF "V8QI") (V8HF "V16QI") (V4SF "V8QI")])
> > @@ -2161,8 +2191,13 @@
> >  ;; Whether a mode fits in W or X registers (i.e. "w" for 32-bit modes
> >  ;; and "x" for 64-bit modes).
> >  (define_mode_attr single_wx [(SI   "w") (SF   "w")
> > +                        (V2QI "w") (V4QI "w")
> >                          (V8QI "x") (V4HI "x")
> >                          (V4HF "x") (V4BF "x")
> > +                        (V2HI "w") (V2HF "w")
> > +                        (HF   "w") (QI   "w")
> > +                        (V2BF "w") (BF   "w")
> > +                        (HI   "w")
> >                          (V2SI "x") (V2SF "x")
> >                          (DI   "x") (DF   "x")])
> > 
> > @@ -2172,7 +2207,12 @@
> >                            (V8QI "d") (V4HI "d")
> >                            (V4HF "d") (V4BF "d")
> >                            (V2SI "d") (V2SF "d")
> > -                          (DI   "d") (DF   "d")])
> > +                          (DI   "d") (DF   "d")
> > +                          (QI   "b") (BF   "h")
> > +                          (V2HF "s") (HI   "h")
> > +                          (V4QI "s") (V2QI "h")
> > +                          (V2HI "s") (V2BF "s")
> > +                          (HF   "h")])
> > 
> >  ;; Whether a double-width mode fits in D or Q registers (i.e. "d" for
> >  ;; 32-bit modes and "q" for 64-bit modes).
> > @@ -2182,9 +2222,13 @@
> >                             (V2SI "q") (V2SF "q")
> >                             (DI   "q") (DF   "q")])
> > 
> > -;; Scalar size of a sub-64-bit vector mode.
> > -(define_mode_attr vstype [(V4QI "s") (V2QI "h")
> > -                     (V2HI "s") (V2BF "s") (V2HF "s")])
> > +;; Scalar size of a sub-128-bit vector or scalar mode.
> > +(define_mode_attr vstype [(V8QI "d") (V4QI "s") (V2QI "h") (QI "b")
> > +                     (V4HI "d") (V2HI "s") (HI "h")
> > +                     (V2SI "d") (SI "s") (DI "d")
> > +                     (V4BF "d") (V2BF "s") (BF "h")
> > +                     (V4HF "d") (V2HF "s") (HF "h")
> > +                     (V2SF "d") (SF "s") (DF "d")])
> > 
> >  ;; Define corresponding core/FP element mode for each vector mode.
> >  (define_mode_attr vw [(V8QI "w") (V16QI "w")
> > diff --git a/gcc/testsuite/gcc.target/aarch64/ldp_stp_16.c
> > b/gcc/testsuite/gcc.target/aarch64/ldp_stp_16.c
> > index 95835aa2eb4..a6b4d50f34f 100644
> > --- a/gcc/testsuite/gcc.target/aarch64/ldp_stp_16.c
> > +++ b/gcc/testsuite/gcc.target/aarch64/ldp_stp_16.c
> > @@ -96,9 +96,8 @@ CONS2_FN (4, float);
> > 
> >  /*
> >  ** cons2_8_float:
> > -** dup     v[0-9]+\.2s, v[0-9]+\.s\[0\]
> > -** dup     v[0-9]+\.2s, v[0-9]+\.s\[0\]
> > -** zip1    v([0-9]+)\.4s, v[0-9]+\.4s, v[0-9]+\.4s
> > +** uzp1    v1\.2s, v0\.2s, v1\.2s
> > +** dup     v([0-9]+)\.2d, v1\.d\[0\]
> >  ** stp     q\1, q\1, \[x0\]
> >  ** stp     q\1, q\1, \[x0, #?32\]
> >  ** ret
> > diff --git a/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
> > b/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
> > index 98e8ac3c785..2bb2c04fa20 100644
> > --- a/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
> > +++ b/gcc/testsuite/gcc.target/aarch64/sve/slp_1.c
> > @@ -30,14 +30,13 @@ vec_slp_##TYPE (TYPE *restrict a, TYPE b, TYPE c, int
> > n)  \
> >  TEST_ALL (VEC_PERM)
> > 
> >  /* We should use one DUP for each of the 8-, 16- and 32-bit types,
> > -   (for now, insert both elements with umov + ins for _Float16).  We should
> > use two
> > +   (for now, insert both elements with ins for _Float16).  We should use 
> > two
> >     DUPs for each of the three 64-bit types.  */
> >  /* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.h, [hw]} 2 } } */
> >  /* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.s, [sw]} 3 } } */
> >  /* { dg-final { scan-assembler-times {\tmov\tz[0-9]+\.d, [dx]} 9 } } */
> > -/* { dg-final { scan-assembler-times {\tumov\tw[0-9]+, v[0-9]+\.h} 2 } } */
> > -/* { dg-final { scan-assembler-times {\tins\tv[0-9]+\.h\[0\], w[0-9]+} 1 } 
> > } */
> > -/* { dg-final { scan-assembler-times {\tins\tv[0-9]+\.h\[1\], w[0-9]+} 1 } 
> > } */
> > +/* { dg-final { scan-assembler-times {\tins\tv[0-9]+\.h\[0\], 
> > v[0-9]+\.h\[0\]}
> > 1 } } */
> > +/* { dg-final { scan-assembler-times {\tins\tv[0-9]+\.h\[1\], 
> > v[0-9]+\.h\[0\]}
> > 1 } } */
> >  /* { dg-final { scan-assembler-times {\tzip1\tz[0-9]+\.d, z[0-9]+\.d, z[0-
> > 9]+\.d\n} 3 } } */
> >  /* { dg-final { scan-assembler-not {\tzip2\t} } } */
> > 
> > diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-18.c
> > b/gcc/testsuite/gcc.target/aarch64/vec-init-18.c
> > index ecb59fe510b..feeb181a0b5 100644
> > --- a/gcc/testsuite/gcc.target/aarch64/vec-init-18.c
> > +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-18.c
> > @@ -15,6 +15,7 @@ int16x8_t foo2(int16_t x)
> >    return v;
> >  }
> > 
> > -/* { dg-final { scan-assembler-times {\tdup\tv[0-9]+\.4h, w[0-9]+} 3 } } */
> > -/* { dg-final { scan-assembler {\tmovi\tv[0-9]+\.4h, 0x1} } } */
> > -/* { dg-final { scan-assembler-times {\tzip1\tv[0-9]+\.8h, v[0-9]+\.8h, 
> > v[0-
> > 9]+\.8h} 2 } } */
> > +/* { dg-final { scan-assembler-times {\tdup\tv[0-9]+\.4s, w[0-9]+} 2 } } */
> > +/* { dg-final { scan-assembler-times {\tmov\tw[0-9]+, 65537} 1 } } */
> > +/* { dg-final { scan-assembler-times {\tbfi\tw[0-9]+, w[0-9]+, 0, 16} 1 } 
> > } */
> > +/* { dg-final { scan-assembler-times {\tbfi\tw[0-9]+, w[0-9]+, 16, 16} 1 } 
> > } */
> > diff --git a/gcc/testsuite/gcc.target/aarch64/vec-init-23.c
> > b/gcc/testsuite/gcc.target/aarch64/vec-init-23.c
> > new file mode 100644
> > index 00000000000..595470b29fb
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/vec-init-23.c
> > @@ -0,0 +1,435 @@
> > +/* { dg-do compile } */
> > +/* { dg-options "-O2 -march=armv8.2-a+fp16" } */
> > +/* { dg-final { check-function-bodies "**" "" "" } } */
> > +
> > +#include <arm_neon.h>
> > +
> > +/* Check vector initialization with a repeating sequence of elements.  */
> > +
> > +#ifndef TESTCASE
> > +#define TESTCASE(TYPE, ETYPE, T, SZ, NUM, MULT, ...)\
> > +  TYPE##SZ##MULT##_t test_##TYPE##SZ##_##NUM (ETYPE x0, ETYPE x1,
> > ETYPE x2, ETYPE x3,\
> > +                                      ETYPE x4, ETYPE x5, ETYPE x6, ETYPE
> > x7)\
> > +  {\
> > +    return (TYPE##SZ##MULT##_t) {__VA_ARGS__};\
> > +  }
> > +#endif
> > +
> > +#define TEST_8(TYPE, ETYPE, T)\
> > +    TESTCASE (TYPE, ETYPE, T, 8, 1, x16, x0, x0, x0, x0, x0, x0, x0, x0,\
> > +                          x0, x0, x0, x0, x0, x0, x0, x0)\
> > +    TESTCASE (TYPE, ETYPE, T, 8, 2, x16, x0, x1, x0, x1, x0, x1, x0, x1,\
> > +                          x0, x1, x0, x1, x0, x1, x0, x1)\
> > +    TESTCASE (TYPE, ETYPE, T, 8, 3, x16, x0, x1, x2, x3, x0, x1, x2, x3,\
> > +                          x0, x1, x2, x3, x0, x1, x2, x3)\
> > +    TESTCASE (TYPE, ETYPE, T, 8, 4, x16, x0, x1, x2, x3, x4, x5, x6, x7,\
> > +                          x0, x1, x2, x3, x4, x5, x6, x7)\
> > +    TESTCASE (TYPE, ETYPE, T, 8, 5, x16, x0, 0, x0, 0, x0, 0, x0, 0,\
> > +                          x0, 0, x0, 0, x0, 0, x0, 0)\
> > +    TESTCASE (TYPE, ETYPE, T, 8, 6, x16, 0, x0, 0, x0, 0, x0, 0, x0,\
> > +                          0, x0, 0, x0, 0, x0, 0, x0)\
> > +    TESTCASE (TYPE, ETYPE, T, 8, 7, x16, x0, x1, 0, 1, x0, x1, 0, 1,\
> > +                          x0, x1, 0, 1, x0, x1, 0, 1)\
> > +    TESTCASE (TYPE, ETYPE, T, 8, 8, x16, 0, 1, x0, x1, 0, 1, x0, x1,\
> > +                      0, 1, x0, x1, 0, 1, x0, x1)\
> > +    TESTCASE (TYPE, ETYPE, T, 8, 9, x16, x0, 0, x1, 1, x0, 0, x1, 1,\
> > +                          x0, 0, x1, 1, x0, 0, x1, 1)\
> > +    TESTCASE (TYPE, ETYPE, T, 8, 10, x16, x0, 0, x1, 1, x2, 2, x3, 3,\
> > +                          x0, 0, x1, 1, x2, 2, x3, 3)\
> > +    TESTCASE (TYPE, ETYPE, T, 8, 11, x16, 0, x0, 1, x1, 2, x2, 3, x3,\
> > +                          0, x0, 1, x1, 2, x2, 3, x3)\
> > +    TESTCASE (TYPE, ETYPE, T, 8, 12, x16, x0, x1, 0, 1, x2, x3, 2, 3,\
> > +                          x0, x1, 0, 1, x2, x3, 2, 3)\
> > +    TESTCASE (TYPE, ETYPE, T, 8, 13, x16, 0, 1, x0, x1, 2, 3, x2, x3,\
> > +                          0, 1, x0, x1, 2, 3, x2, x3)
> > +
> > +#define TEST_16(TYPE, ETYPE, T)\
> > +    TESTCASE (TYPE, ETYPE, T, 16, 1, x8, x0, x0, x0, x0, x0, x0, x0, x0)\
> > +    TESTCASE (TYPE, ETYPE, T, 16, 2, x8, x0, x1, x0, x1, x0, x1, x0, x1)\
> > +    TESTCASE (TYPE, ETYPE, T, 16, 3, x8, x0, x1, x2, x3, x0, x1, x2, x3)\
> > +    TESTCASE (TYPE, ETYPE, T, 16, 4, x8, x0, 0, x0, 0, x0, 0, x0, 0)\
> > +    TESTCASE (TYPE, ETYPE, T, 16, 5, x8, 0, x0, 0, x0, 0, x0, 0, x0)\
> > +    TESTCASE (TYPE, ETYPE, T, 16, 6, x8, x0, x1, 0, 1, x0, x1, 0, 1)\
> > +    TESTCASE (TYPE, ETYPE, T, 16, 7, x8, 0, 1, x0, x1, 0, 1, x0, x1)\
> > +    TESTCASE (TYPE, ETYPE, T, 16, 8, x8, 0, x0, 1, x1, 0, x0, 1, x1)\
> > +
> > +#define TEST_32(TYPE, ETYPE, T)\
> > +    TESTCASE (TYPE, ETYPE, T, 32, 1, x4, x0, x0, x0, x0)\
> > +    TESTCASE (TYPE, ETYPE, T, 32, 2, x4, x0, x1, x0, x1)\
> > +    TESTCASE (TYPE, ETYPE, T, 32, 3, x4, x0, 0, x0, 0)\
> > +    TESTCASE (TYPE, ETYPE, T, 32, 4, x4, 0, x0, 0, x0)
> > +
> > +#define TEST_64(TYPE, ETYPE, T)\
> > +    TESTCASE (TYPE, ETYPE, T, 64, 1, x2, x0, x0)
> > +
> > +TEST_8(int, int8_t, s)
> > +
> > +TEST_16(float, float, f)
> > +TEST_16(int, int16_t, s)
> > +
> > +TEST_32(float, float, f)
> > +TEST_32(int, int32_t, s)
> > +
> > +TEST_64(float, double, f)
> > +TEST_64(int, int64_t, s)
> > +
> > +/*
> > +** test_int8_1:
> > +** dup     v0\.16b, w0
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_int8_2:
> > +** bfi     w0, w1, 8, 8
> > +** dup     v0\.8h, w0
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_int8_3:
> > +** bfi     w0, w1, 8, 8
> > +** bfi     w0, w2, 16, 8
> > +** bfi     w0, w3, 24, 8
> > +** dup     v0\.4s, w0
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_int8_4:
> > +** bfi     w0, w2, 8, 8
> > +** bfi     w1, w3, 8, 8
> > +** bfi     w0, w4, 16, 8
> > +** bfi     w1, w5, 16, 8
> > +** bfi     w0, w6, 24, 8
> > +** bfi     w1, w7, 24, 8
> > +** dup     v31\.2s, w0
> > +** dup     v0\.2s, w1
> > +** zip1    v0\.16b, v31\.16b, v0\.16b
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_int8_5:
> > +** mov     w1, 0
> > +** bfi     w1, w0, 0, 8
> > +** dup     v0\.8h, w1
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_int8_6:
> > +** mov     w1, 0
> > +** bfi     w1, w0, 8, 8
> > +** dup     v0\.8h, w1
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_int8_7:
> > +** mov     w2, 16777472
> > +** bfi     w2, w0, 0, 8
> > +** bfi     w2, w1, 8, 8
> > +** dup     v0\.4s, w2
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_int8_8:
> > +** mov     w2, 16777472
> > +** bfi     w2, w0, 16, 8
> > +** bfi     w2, w1, 24, 8
> > +** dup     v0\.4s, w2
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_int8_9:
> > +** mov     w2, 16842752
> > +** bfi     w2, w0, 0, 8
> > +** bfi     w2, w1, 16, 8
> > +** dup     v0\.4s, w2
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_int8_10:
> > +** bfi     w0, w1, 8, 8
> > +** bfi     w0, w2, 16, 8
> > +** bfi     w0, w3, 24, 8
> > +** dup     v31\.2s, w0
> > +** adrp    x0, .LANCHOR[0-9]+
> > +** ldr     d0, \[x0, #:lo12:.LANCHOR[0-9]+\]
> > +** zip1    v0\.16b, v31\.16b, v0\.16b
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_int8_11:
> > +** bfi     w0, w1, 8, 8
> > +** adrp    x4, .LANCHOR[0-9]+
> > +** bfi     w0, w2, 16, 8
> > +** ldr     d0, \[x4, #:lo12:\.LANCHOR[0-9]+\]
> > +** bfi     w0, w3, 24, 8
> > +** dup     v31\.2s, w0
> > +** zip1    v0\.16b, v0\.16b, v31\.16b
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_int8_12:
> > +** mov     w4, 33685504
> > +** bfi     w4, w0, 0, 8
> > +** mov     w0, 257
> > +** movk    w0, 0x303, lsl 16
> > +** bfi     w0, w1, 0, 8
> > +** bfi     w4, w2, 16, 8
> > +** bfi     w0, w3, 16, 8
> > +** dup     v31\.2s, w4
> > +** dup     v0\.2s, w0
> > +** zip1    v0\.16b, v31\.16b, v0\.16b
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_int8_13:
> > +** mov     w4, 33685504
> > +** bfi     w4, w0, 8, 8
> > +** mov     w0, 257
> > +** movk    w0, 0x303, lsl 16
> > +** bfi     w0, w1, 8, 8
> > +** bfi     w4, w2, 24, 8
> > +** bfi     w0, w3, 24, 8
> > +** dup     v31\.2s, w4
> > +** dup     v0\.2s, w0
> > +** zip1    v0\.16b, v31\.16b, v0\.16b
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_float16_1:
> > +** fcvt    h0, s0
> > +** dup     v0\.8h, v0\.h\[0\]
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_float16_2:
> > +** fcvt    h1, s1
> > +** fcvt    h0, s0
> > +** ins     v0\.h\[1\], v1\.h\[0\]
> > +** dup     v0\.4s, v0\.s\[0\]
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_float16_3:
> > +** uzp1    v2\.2s, v0\.2s, v2\.2s
> > +** uzp1    v3\.2s, v1\.2s, v3\.2s
> > +** zip1    v3\.4s, v2\.4s, v3\.4s
> > +** fcvtn   v0\.4h, v3\.4s
> > +** uzp1    v0\.2d, v0\.2d, v0\.2d
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_float16_4:
> > +** fcvt    h0, s0
> > +** movi    v31\.2d, #0
> > +** ins     v31\.h\[0\], v0\.h\[0\]
> > +** dup     v0\.4s, v31\.s\[0\]
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_float16_5:
> > +** fcvt    h0, s0
> > +** movi    v31\.2d, #0
> > +** ins     v31\.h\[1\], v0\.h\[0\]
> > +** dup     v0\.4s, v31\.s\[0\]
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_float16_6:
> > +** fcvt    h1, s1
> > +** fcvt    h0, s0
> > +** movi    v31\.2d, #0
> > +** mov     w0, 1006648320
> > +** umov    w1, v1\.h\[0\]
> > +** ins     v31\.h\[0\], v0\.h\[0\]
> > +** bfi     w0, w1, 0, 16
> > +** dup     v31\.2s, v31\.s\[0\]
> > +** dup     v0\.2s, w0
> > +** zip1    v0\.8h, v31\.8h, v0\.8h
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_float16_7:
> > +** fcvt    h1, s1
> > +** fcvt    h0, s0
> > +** movi    v31\.2d, #0
> > +** mov     w0, 1006648320
> > +** umov    w1, v1\.h\[0\]
> > +** ins     v31\.h\[1\], v0\.h\[0\]
> > +** bfi     w0, w1, 16, 16
> > +** dup     v31\.2s, v31\.s\[0\]
> > +** dup     v0\.2s, w0
> > +** zip1    v0\.8h, v31\.8h, v0\.8h
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_float16_8:
> > +** fcvt    h1, s1
> > +** fcvt    h0, s0
> > +** movi    v31\.2s, 0x3c, lsl 24
> > +** ins     v0\.h\[1\], v1\.h\[0\]
> > +** dup     v0\.2s, v0\.s\[0\]
> > +** zip1    v0\.8h, v31\.8h, v0\.8h
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_int16_1:
> > +** dup     v0\.8h, w0
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_int16_2:
> > +** bfi     w0, w1, 16, 16
> > +** dup     v0\.4s, w0
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_int16_3:
> > +** bfi     w0, w2, 16, 16
> > +** bfi     w1, w3, 16, 16
> > +** dup     v31\.2s, w0
> > +** dup     v0\.2s, w1
> > +** zip1    v0\.8h, v31\.8h, v0\.8h
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_int16_4:
> > +** mov     w1, 0
> > +** bfi     w1, w0, 0, 16
> > +** dup     v0\.4s, w1
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_int16_5:
> > +** mov     w1, 0
> > +** bfi     w1, w0, 16, 16
> > +** dup     v0\.4s, w1
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_int16_6:
> > +** mov     w2, 0
> > +** bfi     w2, w0, 0, 16
> > +** mov     w0, 65537
> > +** bfi     w0, w1, 0, 16
> > +** dup     v31\.2s, w2
> > +** dup     v0\.2s, w0
> > +** zip1    v0\.8h, v31\.8h, v0\.8h
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_int16_7:
> > +** mov     w2, 0
> > +** bfi     w2, w0, 16, 16
> > +** mov     w0, 65537
> > +** bfi     w0, w1, 16, 16
> > +** dup     v31\.2s, w2
> > +** dup     v0\.2s, w0
> > +** zip1    v0\.8h, v31\.8h, v0\.8h
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_int16_8:
> > +** bfi     w0, w1, 16, 16
> > +** movi    v0\.2s, 0x1, lsl 16
> > +** dup     v31\.2s, w0
> > +** zip1    v0\.8h, v0\.8h, v31\.8h
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_float32_1:
> > +** dup     v0\.4s, v0\.s\[0\]
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_float32_2:
> > +** uzp1    v0\.2s, v0\.2s, v1\.2s
> > +** dup     v0\.2d, v0\.d\[0\]
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_float32_3:
> > +** movi    v31\.2s, 0
> > +** dup     v0\.2s, v0\.s\[0\]
> > +** zip1    v0\.4s, v0\.4s, v31\.4s
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_float32_4:
> > +** movi    v31\.2s, 0
> > +** dup     v0\.2s, v0\.s\[0\]
> > +** zip1    v0\.4s, v31\.4s, v0\.4s
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_int32_1:
> > +** dup     v0\.4s, w0
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_int32_2:
> > +** fmov    s0, w0
> > +** ins     v0\.s\[1\], w1
> > +** dup     v0\.2d, v0\.d\[0\]
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_int32_3:
> > +** dup     v31\.2s, w0
> > +** movi    v0\.2s, 0
> > +** zip1    v0\.4s, v31\.4s, v0\.4s
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_int32_4:
> > +** dup     v31\.2s, w0
> > +** movi    v0\.2s, 0
> > +** zip1    v0\.4s, v0\.4s, v31\.4s
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_float64_1:
> > +** dup     v0\.2d, v0\.d\[0\]
> > +** ret
> > +*/
> > +
> > +/*
> > +** test_int64_1:
> > +** dup     v0\.2d, x0
> > +** ret
> > +*/
> > --
> > 2.43.0
>

Re: [PATCH 2/4] aarch64: initialize vectors from starting subsequence

Reply via email to