Re: [PATCH][GCC][AArch64]: Break apart paradoxical subregs for VSTRUCT writes (PR target/94052)

Tamar Christina Thu, 12 Mar 2020 06:43:25 -0700

Hi Richard,

I have updated the patch, changelog is the same.


bootstrapped and regtested on aarch64-none-linux-gnu and no issues.

OK for gcc 9 and 8?

Thanks,
Tamar

The 03/10/2020 11:49, Richard Sandiford wrote:
> Tamar Christina <tamar.christ...@arm.com> writes:
> > Hi All,
> >
> > This works around an ICE in reload where from expand we get the following 
> > RTL
> > generated for VSTRUCT mode writes:
> >
> > (insn 446 354 445 2 (set (reg:CI 383)
> >  (subreg:CI (reg:V4SI 291) 0)) "small.i":146:22 3408 {*aarch64_movci}
> >  (nil))
> >
> > This sequence is trying to say two things:
> >
> > 1) liveliness: It's trying to say that eventually the whole CI reg will be
> >            written to. It does this by generating the paradoxical subreg.
> > 2) write data: It's trying to in the same instruction also write the V4SI 
> > mode
> >            component at offset 0 in the CI reg.
> >
> > Reload is unable to understand this concept and so it attempts to handle 
> > this
> > instruction by breaking apart the instruction, first writing the data and 
> > then
> > tries to reload the paradoxical part.  This gets it to the same instruction
> > again and eventually we ICE since we reach the limit of no. reloads.
> 
> reload/LRA does understand the concept.  It just isn't handling this
> particular case very well :-)
> 
> The pre-RA insn is:
> 
> (insn 210 218 209 6 (set (reg/v:OI 182 [ vres ])
>         (subreg:OI (reg:V4SI 289) 0)) "pr94052.C":157:31 3518 {*aarch64_movoi}
>      (expr_list:REG_DEAD (reg:V4SI 289)
>         (nil)))
> 
> IRA allocates a hard register to r182 but not r289:
> 
>   126:r182 l0    32  ...
>   ...
>   129:r289 l0   mem  ...
> 
> This is because memory appears to be cheaper:
> 
>   a129(r289,l2) costs: FP_LO8_REGS:136,136 FP_LO_REGS:136,136 FP_REGS:136,136 
> MEM:110,110
> 
> That's probably a bug in itself (possibly in the target costs), but it
> shouldn't be a correctness issue.
> 
> LRA then handles insn 210 like this:
> 
>       Creating newreg=317, assigning class ALL_REGS to slow/invalid mem r317
>       Creating newreg=318, assigning class ALL_REGS to slow/invalid mem r318
>   210: r182:OI=r318:OI
>       REG_DEAD r289:V4SI
>     Inserting slow/invalid mem reload before:
>   355: r317:V4SI=[sfp:DI+0x60]
>   356: r318:OI=r317:V4SI#0
> 
>             1 Non pseudo reload: reject++
>           alt=0,overall=1,losers=0,rld_nregs=0
>          Choosing alt 0 in insn 210:  (0) =w  (1) w {*aarch64_movoi}
>       Change to class FP_REGS for r318
> 
> That looks OK so far, given the allocation.  But later we have:
> 
> ********** Assignment #2: **********
> 
>          Assigning to 318 (cl=FP_REGS, orig=318, freq=44, tfirst=318, 
> tfreq=44)...
>            Assign 32 to subreg reload r318 (freq=44)
>          Assigning to 317 (cl=ALL_REGS, orig=317, freq=44, tfirst=317, 
> tfreq=44)...
>            Assign 0 to subreg reload r317 (freq=44)
> 
> So LRA is assigning a GPR (x0) to the new V4SI register r317.  It's this
> allocation that induces the cycling, because we get:
> 
>            Cycle danger: overall += LRA_MAX_REJECT
>           alt=0,overall=606,losers=1,rld_nregs=2
>             0 Spill pseudo into memory: reject+=3
>             Using memory insn operand 0: reject+=3
>             0 Non input pseudo reload: reject++
>             Cycle danger: overall += LRA_MAX_REJECT
>           alt=1,overall=619,losers=2,rld_nregs=2
>             1 Spill pseudo into memory: reject+=3
>             Using memory insn operand 1: reject+=3
>           alt=2,overall=12,losers=1,rld_nregs=0
>          Choosing alt 2 in insn 356:  (0) w  (1) Utv {*aarch64_movoi}
>       Creating newreg=319, assigning class NO_REGS to r319
>   356: r318:OI=r319:OI
>       REG_DEAD r317:V4SI
>     Inserting insn reload before:
>   357: r319:OI=r317:V4SI#0
> 
>             0 Non input pseudo reload: reject++
>           alt=0,overall=13,losers=2,rld_nregs=4
>             0 Non pseudo reload: reject++
>           alt=1,overall=7,losers=1,rld_nregs=2
>             0 Non input pseudo reload: reject++
>             1 Spill pseudo into memory: reject+=3
>             Using memory insn operand 1: reject+=3
>             alt=2,overall=19,losers=2 -- refuse
>          Choosing alt 1 in insn 357:  (0) Utv  (1) w {*aarch64_movoi}
>       Creating newreg=320, assigning class FP_REGS to r320
>   357: r319:OI=r320:OI
>     Inserting insn reload before:
>   358: r320:OI=r317:V4SI#0
> 
> and we keep oscillating between those two choices (alt 2 vs alt 1).
> This wouldn't have happened if we'd allocated an FPR instead.
> 
> I think the problem here is that we're always trying to reload the
> subreg rather than the inner register, even though the allocation for
> the inner register isn't valid for the subreg.  There is code in
> simplify_operand_subreg to detect this kind of situation, but it
> seems to be missing a check for hard_regno_mode_ok.  The first patch
> below seems to fix that.
> 
> > This patch fixes it by in the backend when we see such a paradoxical
> > construction breaking it apart and issuing a clobber to correct the 
> > liveliness
> > information and then emitting a normal subreg write for the component that 
> > the
> > paradoxical subreg was trying to write to.
> >
> > Concretely we generate this:
> >
> > (insn 42 41 43 (clobber (reg/v:CI 122 [ diD.5226 ])) "small.i":121:23 -1
> >      (nil))
> >
> > (insn 43 42 44 (set (subreg:V4SI (reg/v:CI 122 [ diD.5226 ]) 0)
> >         (reg:V4SI 136)) "small.i":121:23 -1
> >      (nil))
> >
> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> >
> > Ok for master and back-port to GCC 9 and GCC 8 after some stew?
> >
> > I will look into seeing if we can not generate these at all, but I'm not 
> > sure
> > this is possible since the mid-end would need both the Mode and the Class to
> > know that a pseudo will be assigned to multiple hardregs.
> 
> Subreg-based register liveness divides the register into
> REGMODE_NATURAL_SIZE chunks (traditionally word-sized chunks).
> E.g. read_modify_subreg_p checks whether a particular subreg
> preserves part of the old value or not.  I think it's these rules
> that matter here.
> 
> The second patch below adds that check to emit_group_store and generates
> slightly better code than the paradoxical version did.
> 
> Both patches fix the bug independently, although the second is more
> of a workaround than a fix.  Unfortunately I messed up overnight testing
> so they've only been tested on aarch64-linux-gnu so far.  I'll try again
> tonight...
> 
> I think we should aim to fix this in target-independent code for master,
> but I agree that's probably too invasive to backport and so the AArch64
> workaround might be more appropriate for branches.
> 
> > +  /* If we have a paradoxical subreg trying to write to <MODE> from and the
> > +     registers don't overlap then we need to break it apart.  What it's 
> > trying
> > +     to do is give two kind of information at the same time.  It's trying 
> > to
> > +     convey liveness information by saying that the entire register will be
> > +     written to eventually, but it also only wants to write a single part 
> > of the
> > +     register.  Hence the paradoxical subreg.
> > +
> > +     However reload doesn't understand this concept and it will ultimately 
> > ICE.
> > +     Instead of allowing this we will split the two concerns.  The liveness
> > +     information will be conveyed using a clobber and then we break apart 
> > the
> > +     paradoxical subreg into just a normal write of the part that it 
> > wanted to
> > +     write originally.  */
> 
> As above, we should avoid saying that reload doesn't understand
> the concept.  We're just working around a specific bug.
> 
> > +  if (paradoxical_subreg_p (operands[1]))
> > +    {
> > +      if (!reg_overlap_mentioned_p (operands[0], operands[1]))
> > +   emit_clobber (operands[0]);
> > +      poly_uint64 offset = SUBREG_BYTE (operands[1]);
> > +      operands[1] = SUBREG_REG (operands[1]);
> > +      operands[0] = simplify_gen_subreg (GET_MODE (operands[1]), 
> > operands[0],
> > +                                    <MODE>mode, offset);
> > +    }
> 
> SUBREG_BYTE is always 0 for paradoxical subregs, so this will do the
> wrong thing for big-endian.  It's probably better to use gen_lowpart
> to get the destination.
> 
> We should probably also check that operands[0] is a REG.  In practice
> it probably always will be at this point if the source is a SUBREG,
> but it doesn't follow automatically, and given that this is supposed
> to be a safe backport...
> 
> FWIW, another option would be to make aarch64_preferred_reload_class
> return FP_REGS for 128-bit vector modes, if the given class is a
> superset of FP_REGS.  On the one hand that feels cleaner: there's a
> good argument that we should be doing that anyway, since it means
> we can reload in Q registers rather than pairs of X registers.
> But on the other hand, it feels less safe for backporting.  E.g. maybe
> we'll find another corner case that fails if a GPR *isn't* chosen.
> 
> Thanks,
> Richard
> 

> From f75d209ab3c27fcfea915fe77f37fa4f973f9dd0 Mon Sep 17 00:00:00 2001
> From: Richard Sandiford <richard.sandif...@arm.com>
> Date: Mon, 9 Mar 2020 19:42:57 +0000
> Subject: [PATCH 1/2] lra: Tighten check for reloading paradoxical subregs
>  [PR94052]
> 
> simplify_operand_subreg tries to detect whether the allocation for
> a pseudo in a paradoxical subreg is also valid for the outer mode.
> The condition it used to check for an invalid combination was:
> 
>   else if (REG_P (reg)
>          && REGNO (reg) >= FIRST_PSEUDO_REGISTER
>          && (hard_regno = lra_get_regno_hard_regno (REGNO (reg))) >= 0
>          && (hard_regno_nregs (hard_regno, innermode)
>              < hard_regno_nregs (hard_regno, mode))
>          && (regclass = lra_get_allocno_class (REGNO (reg)))
>          && (type != OP_IN
>              || !in_hard_reg_set_p (reg_class_contents[regclass],
>                                     mode, hard_regno)
>              || overlaps_hard_reg_set_p (lra_no_alloc_regs,
>                                          mode, hard_regno)))
> 
> I think there are two problems with this:
> 
> (1) It never actually checks whether the hard register is valid for the
>     outer mode (in the hard_regno_mode_ok sense).  If it isn't, any attempt
>     to reload in the outer mode is likely to cycle, because the implied
>     regno/mode combination will be just as invalid next time
>     curr_insn_transform sees the subreg.
> 
> (2) The check is valid for little-endian only.  For big-endian we need
>     to move hard_regno backwards.
> 
> Using simplify_subreg_regno should avoid both problems.
> 
> As the existing comment says, IRA should always take subreg references
> into account when allocating hard registers, so this fix-up should only
> really be needed for pseudos allocated by LRA itself.
> 
> 2020-03-10  Richard Sandiford  <richard.sandif...@arm.com>
> 
> gcc/
>       PR rtl-optimization/94052
>       * lra-constraints.c (simplify_operand_subreg): Reload the inner
>       register of a paradoxical subreg if simplify_subreg_regno fails
>       to give a valid hard register for the outer mode.
> ---
>  gcc/lra-constraints.c | 24 ++++++++++++++----------
>  1 file changed, 14 insertions(+), 10 deletions(-)
> 
> diff --git a/gcc/lra-constraints.c b/gcc/lra-constraints.c
> index f71e0c9ff8d..bf6d4a2fd4b 100644
> --- a/gcc/lra-constraints.c
> +++ b/gcc/lra-constraints.c
> @@ -1489,7 +1489,7 @@ static bool process_address (int, bool, rtx_insn **, 
> rtx_insn **);
>  static bool
>  simplify_operand_subreg (int nop, machine_mode reg_mode)
>  {
> -  int hard_regno;
> +  int hard_regno, inner_hard_regno;
>    rtx_insn *before, *after;
>    machine_mode mode, innermode;
>    rtx reg, new_reg;
> @@ -1735,15 +1735,19 @@ simplify_operand_subreg (int nop, machine_mode 
> reg_mode)
>       for the new uses.  */
>    else if (REG_P (reg)
>          && REGNO (reg) >= FIRST_PSEUDO_REGISTER
> -        && (hard_regno = lra_get_regno_hard_regno (REGNO (reg))) >= 0
> -        && (hard_regno_nregs (hard_regno, innermode)
> -            < hard_regno_nregs (hard_regno, mode))
> -        && (regclass = lra_get_allocno_class (REGNO (reg)))
> -        && (type != OP_IN
> -            || !in_hard_reg_set_p (reg_class_contents[regclass],
> -                                   mode, hard_regno)
> -            || overlaps_hard_reg_set_p (lra_no_alloc_regs,
> -                                        mode, hard_regno)))
> +        && paradoxical_subreg_p (operand)
> +        && (inner_hard_regno = lra_get_regno_hard_regno (REGNO (reg))) >= 0
> +        && ((hard_regno
> +             = simplify_subreg_regno (inner_hard_regno, innermode,
> +                                      SUBREG_BYTE (operand), mode)) < 0
> +            || ((hard_regno_nregs (inner_hard_regno, innermode)
> +                 < hard_regno_nregs (hard_regno, mode))
> +                && (regclass = lra_get_allocno_class (REGNO (reg)))
> +                && (type != OP_IN
> +                    || !in_hard_reg_set_p (reg_class_contents[regclass],
> +                                           mode, hard_regno)
> +                    || overlaps_hard_reg_set_p (lra_no_alloc_regs,
> +                                                mode, hard_regno)))))
>      {
>        /* The class will be defined later in curr_insn_transform.  */
>        enum reg_class rclass
> -- 
> 2.17.1
> 

> From 19ae0b272c7a191634249cdab90e5d9512b1474b Mon Sep 17 00:00:00 2001
> From: Richard Sandiford <richard.sandif...@arm.com>
> Date: Tue, 10 Mar 2020 09:17:17 +0000
> Subject: [PATCH 2/2] expand: Tweak paradoxical subreg handling in
>  emit_group_store [PR94052]
> 
> emit_group_store normally tries to start the sequence by storing
> a paradoxical subreg of one source value into the whole destination.
> That seems reasonable when the destination is effectively a single
> register, but as explained in the comment, it seems likely to make
> things worse then the destination is bigger than that.
> 
> 2020-03-10  Richard Sandiford  <richard.sandif...@arm.com>
> 
> gcc/
>       PR middle-end/94052
>       * expr.c (emit_group_store): Clobber a multi-register destination
>       before storing to it.
> ---
>  gcc/expr.c | 31 +++++++++++++++++++++++++++++--
>  1 file changed, 29 insertions(+), 2 deletions(-)
> 
> diff --git a/gcc/expr.c b/gcc/expr.c
> index b97c217e86d..57529f25769 100644
> --- a/gcc/expr.c
> +++ b/gcc/expr.c
> @@ -2489,7 +2489,9 @@ emit_group_store (rtx orig_dst, rtx src, tree type 
> ATTRIBUTE_UNUSED,
>    for (i = start; i < finish; i++)
>      {
>        rtx reg = XEXP (XVECEXP (src, 0, i), 0);
> -      if (!REG_P (reg) || REGNO (reg) < FIRST_PSEUDO_REGISTER)
> +      /* Checking for reg == dst (which is very unlikely to be true) ensures
> +      that we can emit a clobber of dst when dst is a pseudo register.  */
> +      if (!REG_P (reg) || REGNO (reg) < FIRST_PSEUDO_REGISTER || reg == dst)
>       {
>         tmps[i] = gen_reg_rtx (GET_MODE (reg));
>         emit_move_insn (tmps[i], reg);
> @@ -2530,11 +2532,36 @@ emit_group_store (rtx orig_dst, rtx src, tree type 
> ATTRIBUTE_UNUSED,
>        if (!REG_P (dst) || REGNO (dst) < FIRST_PSEUDO_REGISTER)
>       dst = gen_reg_rtx (outer);
>  
> +      /* Clobber the destination if pieces of it can be modified
> +      independently.  This ensures that we can assign to the individual
> +      pieces of dst without leaving dst upwards exposed.
> +
> +      The alternatives below are less likely to be useful for this case.
> +      Creating a paradoxical subreg leads to moves like:
> +
> +          (set (reg:TI X) (subreg:TI (reg:DI Y) 0))
> +
> +      On a 64-bit target, this would normally be split into two DI moves
> +      after register allocation.  We'd then expect one of those DI moves
> +      to be removed as dead due to later assignments to parts of X.
> +      However, the instruction must still be reloaded as a TImode move
> +      and the allocation of Y must be valid for TImode as well as DImode,
> +      even though we only really want a single DImode move.
> +
> +      If we set dst to zero instead, there's a risk that the
> +      redundancy will never be eliminated, particularly if the
> +      destination occupies more than two registers.  */
> +      if (known_gt (GET_MODE_SIZE (outer), REGMODE_NATURAL_SIZE (outer)))
> +     {
> +       emit_clobber (dst);
> +       done = 1;
> +     }
> +
>        /* Make life a bit easier for combine.  */
>        /* If the first element of the vector is the low part
>        of the destination mode, use a paradoxical subreg to
>        initialize the destination.  */
> -      if (start < finish)
> +      if (!done && start < finish)
>       {
>         inner = GET_MODE (tmps[start]);
>         bytepos = subreg_lowpart_offset (inner, outer);
> -- 
> 2.17.1
> 


--

diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md
index 89aaf8c018e3340dd2d53fc2a6538d3d1220b103..16abe70be0ad90d3befdc5c4dfdb1471f2c98c58 100644
--- a/gcc/config/aarch64/aarch64-simd.md
+++ b/gcc/config/aarch64/aarch64-simd.md
@@ -5330,6 +5330,26 @@ (define_expand "mov<mode>"
       if (GET_CODE (operands[0]) != REG)
 	operands[1] = force_reg (<MODE>mode, operands[1]);
     }
+
+  /* If we have a paradoxical subreg trying to write to <MODE> from and the
+     registers don't overlap then we need to break it apart.  What it's trying
+     to do is give two kind of information at the same time.  It's trying to
+     convey liveness information by saying that the entire register will be
+     written to eventually, but it also only wants to write a single part of the
+     register.  Hence the paradoxical subreg.
+
+     Instead of allowing this we will split the two concerns.  The liveness
+     information will be conveyed using a clobber and then we break apart the
+     paradoxical subreg into just a normal write of the part that it wanted to
+     write originally.  */
+
+  if (REG_P(operands[0]) && paradoxical_subreg_p (operands[1]))
+    {
+      if (!reg_overlap_mentioned_p (operands[0], operands[1]))
+	emit_clobber (operands[0]);
+      operands[1] = SUBREG_REG (operands[1]);
+      operands[0] = gen_lowpart (GET_MODE (operands[1]), operands[0]);
+    }
 })
 
 
diff --git a/gcc/testsuite/g++.target/aarch64/pr94052.C b/gcc/testsuite/g++.target/aarch64/pr94052.C
new file mode 100644
index 0000000000000000000000000000000000000000..d36c9bdc1588533db35eb3cbd2502034edd25452
--- /dev/null
+++ b/gcc/testsuite/g++.target/aarch64/pr94052.C
@@ -0,0 +1,174 @@
+/* { dg-do compile } */
+/* { dg-additional-options "-O2 -std=gnu++11 -w" } */
+
+namespace c {
+typedef int d;
+template <typename e> struct f { typedef e g; };
+template <bool, typename> struct h;
+template <typename e> e aa(typename f<e>::g i) { return i; }
+template <typename, typename> struct j {};
+template <d, typename> struct k;
+template <class l, class m> struct k<1, j<l, m>> { typedef m g; };
+template <d n, class l, class m> typename k<n, j<l, m>>::g ab(j<l, m>);
+} // namespace c
+typedef long d;
+typedef char o;
+typedef int p;
+typedef char q;
+typedef int r;
+namespace {
+struct s;
+constexpr d t = 6;
+template <typename> class ad {
+public:
+  static constexpr d u = t;
+  d v();
+  d x();
+  d y();
+};
+class z : ad<int> {};
+struct ae {
+  p af;
+};
+class ag {
+public:
+  ae ah();
+};
+} // namespace
+typedef __Int32x4_t ai;
+typedef struct {
+  ai aj[2];
+} ak;
+typedef int al;
+void am(p *a, ai b) { __builtin_aarch64_st1v4si(a, b); }
+namespace an {
+class ao {
+public:
+  bool operator==(ao);
+  d v();
+  d x();
+};
+class ap : public ad<r> {};
+class aq {
+public:
+  c::j<int, int> ar();
+  int as();
+  int at();
+};
+class au {
+public:
+  virtual d av(d);
+  virtual ap aw();
+  virtual ag ax();
+};
+class ay {};
+class az {
+  virtual void ba(const ay &, const s &);
+};
+using bb = az;
+class bc;
+class bd : bb {
+  void ba(const ay &, const s &);
+  bc *be;
+  bc *bf;
+  bc *bg;
+  aq bh;
+  int bi;
+  int bj;
+  ao bk;
+};
+namespace bl {
+namespace bm {
+namespace bn {
+class bo;
+}
+} // namespace bm
+} // namespace bl
+namespace bn {
+template <typename ac = c::h<0, bl::bm ::bn::bo>>
+ai bp(ac *, ac *, ac *, al, al, al, d, p);
+template <typename ac = c::h<0, bl::bm ::bn::bo>>
+ak bq(ac *br, ac *bs, ac *bt, al bu, al bv, al bw, d bx, int, int by) {
+  ak{bp(br, bs, bt, bu, bv, bw, bx, by), bp(br, bs, bt, bu, bv, bw, bx, by)};
+}
+template <typename ac = c::h<0, bl::bm ::bn::bo>>
+ak bz(ac *, ac *, ac *, al, al, al &, int, p);
+template <int> void ca(p *, const ak &);
+template <> void ca<1>(p *buffer, const ak &cb) {
+  am(buffer, cb.aj[0]);
+  am(buffer + 4, cb.aj[1]);
+}
+int cc(int, int);
+} // namespace bn
+class bc {
+public:
+  virtual au *cd();
+};
+class ce {
+public:
+  q *cf();
+};
+template <d> struct cg {
+  template <typename ch> static void ci(ay, z cj, ch ck) { ck(cj); }
+};
+template <typename ch> void cl(ay w, ch ck) {
+  z cj;
+  cg<z::u>::ci(w, cj, c::aa<ch>(ck));
+}
+namespace {
+template <typename T1, typename cm, int cn> class co {
+public:
+  static void convolve(ay, int cs, bc *cp, bc *cq, bc *cr, aq cw, int, ao ct) {
+    int by = cp->cd()->ax().ah().af;
+    int cu = cq->cd()->ax().ah().af;
+    cp->cd()->aw().v();
+    int cv = cp->cd()->aw().x();
+    cp->cd()->aw().y();
+    cp->cd()->aw();
+    int da = cr->cd()->aw().x();
+    int cx = cq->cd()->aw().x();
+    cq->cd()->aw().y();
+    int cy = cr->cd()->av(0);
+    int cz = cr->cd()->av(1);
+    bn::cc(cs, cn);
+    int de = c::ab<1>(cw.ar());
+    cw.as();
+    cw.at();
+    ay db;
+    ce dc;
+    ce dd;
+    ce w;
+    q *di = w.cf();
+    cl(db, [&](z) {
+      int df;
+      dc;
+      di;
+      cx;
+      auto dg(cu);
+      auto dh(cu);
+      auto dl(cu);
+      for (; cz; df += de) {
+        auto br = reinterpret_cast<T1 *>(cv);
+        auto bs = reinterpret_cast<T1 *>(cv);
+        auto bt = reinterpret_cast<T1 *>(df * ct.x());
+        auto dj = reinterpret_cast<cm *>(dd.cf() + da);
+        for (int dk; dk < cy; dk += cs, dj += cs)
+          if (ct == ao()) {
+            auto vres = bn::bz(br, bs, bt, dg, dh, dl, cn, by);
+            bn::ca<cn>(dj, vres);
+          } else
+            bn::bq(br, bs, bt, dg, dh, dl, ct.v(), cn, by);
+      }
+    });
+  }
+};
+template <typename T1, typename cm>
+void bz(ay dm, int cs, bc *cp, bc *cq, bc *cr, aq cw, int dn, ao ct) {
+  co<T1, cm, 1>::convolve(dm, cs, cp, cq, cr, cw, dn, ct);
+  co<T1, cm, 2>::convolve(dm, cs, cp, cq, cr, cw, dn, ct);
+}
+} // namespace
+void bd::ba(const ay &dm, const s &) {
+  bz<o, p>(dm, bi, be, bg, bf, bh, bj, bk);
+}
+} // namespace an

Re: [PATCH][GCC][AArch64]: Break apart paradoxical subregs for VSTRUCT writes (PR target/94052)

Reply via email to