Hi Richard, I have updated the patch, changelog is the same.
bootstrapped and regtested on aarch64-none-linux-gnu and no issues. OK for gcc 9 and 8? Thanks, Tamar The 03/10/2020 11:49, Richard Sandiford wrote: > Tamar Christina <tamar.christ...@arm.com> writes: > > Hi All, > > > > This works around an ICE in reload where from expand we get the following > > RTL > > generated for VSTRUCT mode writes: > > > > (insn 446 354 445 2 (set (reg:CI 383) > > (subreg:CI (reg:V4SI 291) 0)) "small.i":146:22 3408 {*aarch64_movci} > > (nil)) > > > > This sequence is trying to say two things: > > > > 1) liveliness: It's trying to say that eventually the whole CI reg will be > > written to. It does this by generating the paradoxical subreg. > > 2) write data: It's trying to in the same instruction also write the V4SI > > mode > > component at offset 0 in the CI reg. > > > > Reload is unable to understand this concept and so it attempts to handle > > this > > instruction by breaking apart the instruction, first writing the data and > > then > > tries to reload the paradoxical part. This gets it to the same instruction > > again and eventually we ICE since we reach the limit of no. reloads. > > reload/LRA does understand the concept. It just isn't handling this > particular case very well :-) > > The pre-RA insn is: > > (insn 210 218 209 6 (set (reg/v:OI 182 [ vres ]) > (subreg:OI (reg:V4SI 289) 0)) "pr94052.C":157:31 3518 {*aarch64_movoi} > (expr_list:REG_DEAD (reg:V4SI 289) > (nil))) > > IRA allocates a hard register to r182 but not r289: > > 126:r182 l0 32 ... > ... > 129:r289 l0 mem ... > > This is because memory appears to be cheaper: > > a129(r289,l2) costs: FP_LO8_REGS:136,136 FP_LO_REGS:136,136 FP_REGS:136,136 > MEM:110,110 > > That's probably a bug in itself (possibly in the target costs), but it > shouldn't be a correctness issue. > > LRA then handles insn 210 like this: > > Creating newreg=317, assigning class ALL_REGS to slow/invalid mem r317 > Creating newreg=318, assigning class ALL_REGS to slow/invalid mem r318 > 210: r182:OI=r318:OI > REG_DEAD r289:V4SI > Inserting slow/invalid mem reload before: > 355: r317:V4SI=[sfp:DI+0x60] > 356: r318:OI=r317:V4SI#0 > > 1 Non pseudo reload: reject++ > alt=0,overall=1,losers=0,rld_nregs=0 > Choosing alt 0 in insn 210: (0) =w (1) w {*aarch64_movoi} > Change to class FP_REGS for r318 > > That looks OK so far, given the allocation. But later we have: > > ********** Assignment #2: ********** > > Assigning to 318 (cl=FP_REGS, orig=318, freq=44, tfirst=318, > tfreq=44)... > Assign 32 to subreg reload r318 (freq=44) > Assigning to 317 (cl=ALL_REGS, orig=317, freq=44, tfirst=317, > tfreq=44)... > Assign 0 to subreg reload r317 (freq=44) > > So LRA is assigning a GPR (x0) to the new V4SI register r317. It's this > allocation that induces the cycling, because we get: > > Cycle danger: overall += LRA_MAX_REJECT > alt=0,overall=606,losers=1,rld_nregs=2 > 0 Spill pseudo into memory: reject+=3 > Using memory insn operand 0: reject+=3 > 0 Non input pseudo reload: reject++ > Cycle danger: overall += LRA_MAX_REJECT > alt=1,overall=619,losers=2,rld_nregs=2 > 1 Spill pseudo into memory: reject+=3 > Using memory insn operand 1: reject+=3 > alt=2,overall=12,losers=1,rld_nregs=0 > Choosing alt 2 in insn 356: (0) w (1) Utv {*aarch64_movoi} > Creating newreg=319, assigning class NO_REGS to r319 > 356: r318:OI=r319:OI > REG_DEAD r317:V4SI > Inserting insn reload before: > 357: r319:OI=r317:V4SI#0 > > 0 Non input pseudo reload: reject++ > alt=0,overall=13,losers=2,rld_nregs=4 > 0 Non pseudo reload: reject++ > alt=1,overall=7,losers=1,rld_nregs=2 > 0 Non input pseudo reload: reject++ > 1 Spill pseudo into memory: reject+=3 > Using memory insn operand 1: reject+=3 > alt=2,overall=19,losers=2 -- refuse > Choosing alt 1 in insn 357: (0) Utv (1) w {*aarch64_movoi} > Creating newreg=320, assigning class FP_REGS to r320 > 357: r319:OI=r320:OI > Inserting insn reload before: > 358: r320:OI=r317:V4SI#0 > > and we keep oscillating between those two choices (alt 2 vs alt 1). > This wouldn't have happened if we'd allocated an FPR instead. > > I think the problem here is that we're always trying to reload the > subreg rather than the inner register, even though the allocation for > the inner register isn't valid for the subreg. There is code in > simplify_operand_subreg to detect this kind of situation, but it > seems to be missing a check for hard_regno_mode_ok. The first patch > below seems to fix that. > > > This patch fixes it by in the backend when we see such a paradoxical > > construction breaking it apart and issuing a clobber to correct the > > liveliness > > information and then emitting a normal subreg write for the component that > > the > > paradoxical subreg was trying to write to. > > > > Concretely we generate this: > > > > (insn 42 41 43 (clobber (reg/v:CI 122 [ diD.5226 ])) "small.i":121:23 -1 > > (nil)) > > > > (insn 43 42 44 (set (subreg:V4SI (reg/v:CI 122 [ diD.5226 ]) 0) > > (reg:V4SI 136)) "small.i":121:23 -1 > > (nil)) > > > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. > > > > Ok for master and back-port to GCC 9 and GCC 8 after some stew? > > > > I will look into seeing if we can not generate these at all, but I'm not > > sure > > this is possible since the mid-end would need both the Mode and the Class to > > know that a pseudo will be assigned to multiple hardregs. > > Subreg-based register liveness divides the register into > REGMODE_NATURAL_SIZE chunks (traditionally word-sized chunks). > E.g. read_modify_subreg_p checks whether a particular subreg > preserves part of the old value or not. I think it's these rules > that matter here. > > The second patch below adds that check to emit_group_store and generates > slightly better code than the paradoxical version did. > > Both patches fix the bug independently, although the second is more > of a workaround than a fix. Unfortunately I messed up overnight testing > so they've only been tested on aarch64-linux-gnu so far. I'll try again > tonight... > > I think we should aim to fix this in target-independent code for master, > but I agree that's probably too invasive to backport and so the AArch64 > workaround might be more appropriate for branches. > > > + /* If we have a paradoxical subreg trying to write to <MODE> from and the > > + registers don't overlap then we need to break it apart. What it's > > trying > > + to do is give two kind of information at the same time. It's trying > > to > > + convey liveness information by saying that the entire register will be > > + written to eventually, but it also only wants to write a single part > > of the > > + register. Hence the paradoxical subreg. > > + > > + However reload doesn't understand this concept and it will ultimately > > ICE. > > + Instead of allowing this we will split the two concerns. The liveness > > + information will be conveyed using a clobber and then we break apart > > the > > + paradoxical subreg into just a normal write of the part that it > > wanted to > > + write originally. */ > > As above, we should avoid saying that reload doesn't understand > the concept. We're just working around a specific bug. > > > + if (paradoxical_subreg_p (operands[1])) > > + { > > + if (!reg_overlap_mentioned_p (operands[0], operands[1])) > > + emit_clobber (operands[0]); > > + poly_uint64 offset = SUBREG_BYTE (operands[1]); > > + operands[1] = SUBREG_REG (operands[1]); > > + operands[0] = simplify_gen_subreg (GET_MODE (operands[1]), > > operands[0], > > + <MODE>mode, offset); > > + } > > SUBREG_BYTE is always 0 for paradoxical subregs, so this will do the > wrong thing for big-endian. It's probably better to use gen_lowpart > to get the destination. > > We should probably also check that operands[0] is a REG. In practice > it probably always will be at this point if the source is a SUBREG, > but it doesn't follow automatically, and given that this is supposed > to be a safe backport... > > FWIW, another option would be to make aarch64_preferred_reload_class > return FP_REGS for 128-bit vector modes, if the given class is a > superset of FP_REGS. On the one hand that feels cleaner: there's a > good argument that we should be doing that anyway, since it means > we can reload in Q registers rather than pairs of X registers. > But on the other hand, it feels less safe for backporting. E.g. maybe > we'll find another corner case that fails if a GPR *isn't* chosen. > > Thanks, > Richard > > From f75d209ab3c27fcfea915fe77f37fa4f973f9dd0 Mon Sep 17 00:00:00 2001 > From: Richard Sandiford <richard.sandif...@arm.com> > Date: Mon, 9 Mar 2020 19:42:57 +0000 > Subject: [PATCH 1/2] lra: Tighten check for reloading paradoxical subregs > [PR94052] > > simplify_operand_subreg tries to detect whether the allocation for > a pseudo in a paradoxical subreg is also valid for the outer mode. > The condition it used to check for an invalid combination was: > > else if (REG_P (reg) > && REGNO (reg) >= FIRST_PSEUDO_REGISTER > && (hard_regno = lra_get_regno_hard_regno (REGNO (reg))) >= 0 > && (hard_regno_nregs (hard_regno, innermode) > < hard_regno_nregs (hard_regno, mode)) > && (regclass = lra_get_allocno_class (REGNO (reg))) > && (type != OP_IN > || !in_hard_reg_set_p (reg_class_contents[regclass], > mode, hard_regno) > || overlaps_hard_reg_set_p (lra_no_alloc_regs, > mode, hard_regno))) > > I think there are two problems with this: > > (1) It never actually checks whether the hard register is valid for the > outer mode (in the hard_regno_mode_ok sense). If it isn't, any attempt > to reload in the outer mode is likely to cycle, because the implied > regno/mode combination will be just as invalid next time > curr_insn_transform sees the subreg. > > (2) The check is valid for little-endian only. For big-endian we need > to move hard_regno backwards. > > Using simplify_subreg_regno should avoid both problems. > > As the existing comment says, IRA should always take subreg references > into account when allocating hard registers, so this fix-up should only > really be needed for pseudos allocated by LRA itself. > > 2020-03-10 Richard Sandiford <richard.sandif...@arm.com> > > gcc/ > PR rtl-optimization/94052 > * lra-constraints.c (simplify_operand_subreg): Reload the inner > register of a paradoxical subreg if simplify_subreg_regno fails > to give a valid hard register for the outer mode. > --- > gcc/lra-constraints.c | 24 ++++++++++++++---------- > 1 file changed, 14 insertions(+), 10 deletions(-) > > diff --git a/gcc/lra-constraints.c b/gcc/lra-constraints.c > index f71e0c9ff8d..bf6d4a2fd4b 100644 > --- a/gcc/lra-constraints.c > +++ b/gcc/lra-constraints.c > @@ -1489,7 +1489,7 @@ static bool process_address (int, bool, rtx_insn **, > rtx_insn **); > static bool > simplify_operand_subreg (int nop, machine_mode reg_mode) > { > - int hard_regno; > + int hard_regno, inner_hard_regno; > rtx_insn *before, *after; > machine_mode mode, innermode; > rtx reg, new_reg; > @@ -1735,15 +1735,19 @@ simplify_operand_subreg (int nop, machine_mode > reg_mode) > for the new uses. */ > else if (REG_P (reg) > && REGNO (reg) >= FIRST_PSEUDO_REGISTER > - && (hard_regno = lra_get_regno_hard_regno (REGNO (reg))) >= 0 > - && (hard_regno_nregs (hard_regno, innermode) > - < hard_regno_nregs (hard_regno, mode)) > - && (regclass = lra_get_allocno_class (REGNO (reg))) > - && (type != OP_IN > - || !in_hard_reg_set_p (reg_class_contents[regclass], > - mode, hard_regno) > - || overlaps_hard_reg_set_p (lra_no_alloc_regs, > - mode, hard_regno))) > + && paradoxical_subreg_p (operand) > + && (inner_hard_regno = lra_get_regno_hard_regno (REGNO (reg))) >= 0 > + && ((hard_regno > + = simplify_subreg_regno (inner_hard_regno, innermode, > + SUBREG_BYTE (operand), mode)) < 0 > + || ((hard_regno_nregs (inner_hard_regno, innermode) > + < hard_regno_nregs (hard_regno, mode)) > + && (regclass = lra_get_allocno_class (REGNO (reg))) > + && (type != OP_IN > + || !in_hard_reg_set_p (reg_class_contents[regclass], > + mode, hard_regno) > + || overlaps_hard_reg_set_p (lra_no_alloc_regs, > + mode, hard_regno))))) > { > /* The class will be defined later in curr_insn_transform. */ > enum reg_class rclass > -- > 2.17.1 > > From 19ae0b272c7a191634249cdab90e5d9512b1474b Mon Sep 17 00:00:00 2001 > From: Richard Sandiford <richard.sandif...@arm.com> > Date: Tue, 10 Mar 2020 09:17:17 +0000 > Subject: [PATCH 2/2] expand: Tweak paradoxical subreg handling in > emit_group_store [PR94052] > > emit_group_store normally tries to start the sequence by storing > a paradoxical subreg of one source value into the whole destination. > That seems reasonable when the destination is effectively a single > register, but as explained in the comment, it seems likely to make > things worse then the destination is bigger than that. > > 2020-03-10 Richard Sandiford <richard.sandif...@arm.com> > > gcc/ > PR middle-end/94052 > * expr.c (emit_group_store): Clobber a multi-register destination > before storing to it. > --- > gcc/expr.c | 31 +++++++++++++++++++++++++++++-- > 1 file changed, 29 insertions(+), 2 deletions(-) > > diff --git a/gcc/expr.c b/gcc/expr.c > index b97c217e86d..57529f25769 100644 > --- a/gcc/expr.c > +++ b/gcc/expr.c > @@ -2489,7 +2489,9 @@ emit_group_store (rtx orig_dst, rtx src, tree type > ATTRIBUTE_UNUSED, > for (i = start; i < finish; i++) > { > rtx reg = XEXP (XVECEXP (src, 0, i), 0); > - if (!REG_P (reg) || REGNO (reg) < FIRST_PSEUDO_REGISTER) > + /* Checking for reg == dst (which is very unlikely to be true) ensures > + that we can emit a clobber of dst when dst is a pseudo register. */ > + if (!REG_P (reg) || REGNO (reg) < FIRST_PSEUDO_REGISTER || reg == dst) > { > tmps[i] = gen_reg_rtx (GET_MODE (reg)); > emit_move_insn (tmps[i], reg); > @@ -2530,11 +2532,36 @@ emit_group_store (rtx orig_dst, rtx src, tree type > ATTRIBUTE_UNUSED, > if (!REG_P (dst) || REGNO (dst) < FIRST_PSEUDO_REGISTER) > dst = gen_reg_rtx (outer); > > + /* Clobber the destination if pieces of it can be modified > + independently. This ensures that we can assign to the individual > + pieces of dst without leaving dst upwards exposed. > + > + The alternatives below are less likely to be useful for this case. > + Creating a paradoxical subreg leads to moves like: > + > + (set (reg:TI X) (subreg:TI (reg:DI Y) 0)) > + > + On a 64-bit target, this would normally be split into two DI moves > + after register allocation. We'd then expect one of those DI moves > + to be removed as dead due to later assignments to parts of X. > + However, the instruction must still be reloaded as a TImode move > + and the allocation of Y must be valid for TImode as well as DImode, > + even though we only really want a single DImode move. > + > + If we set dst to zero instead, there's a risk that the > + redundancy will never be eliminated, particularly if the > + destination occupies more than two registers. */ > + if (known_gt (GET_MODE_SIZE (outer), REGMODE_NATURAL_SIZE (outer))) > + { > + emit_clobber (dst); > + done = 1; > + } > + > /* Make life a bit easier for combine. */ > /* If the first element of the vector is the low part > of the destination mode, use a paradoxical subreg to > initialize the destination. */ > - if (start < finish) > + if (!done && start < finish) > { > inner = GET_MODE (tmps[start]); > bytepos = subreg_lowpart_offset (inner, outer); > -- > 2.17.1 > --
diff --git a/gcc/config/aarch64/aarch64-simd.md b/gcc/config/aarch64/aarch64-simd.md index 89aaf8c018e3340dd2d53fc2a6538d3d1220b103..16abe70be0ad90d3befdc5c4dfdb1471f2c98c58 100644 --- a/gcc/config/aarch64/aarch64-simd.md +++ b/gcc/config/aarch64/aarch64-simd.md @@ -5330,6 +5330,26 @@ (define_expand "mov<mode>" if (GET_CODE (operands[0]) != REG) operands[1] = force_reg (<MODE>mode, operands[1]); } + + /* If we have a paradoxical subreg trying to write to <MODE> from and the + registers don't overlap then we need to break it apart. What it's trying + to do is give two kind of information at the same time. It's trying to + convey liveness information by saying that the entire register will be + written to eventually, but it also only wants to write a single part of the + register. Hence the paradoxical subreg. + + Instead of allowing this we will split the two concerns. The liveness + information will be conveyed using a clobber and then we break apart the + paradoxical subreg into just a normal write of the part that it wanted to + write originally. */ + + if (REG_P(operands[0]) && paradoxical_subreg_p (operands[1])) + { + if (!reg_overlap_mentioned_p (operands[0], operands[1])) + emit_clobber (operands[0]); + operands[1] = SUBREG_REG (operands[1]); + operands[0] = gen_lowpart (GET_MODE (operands[1]), operands[0]); + } }) diff --git a/gcc/testsuite/g++.target/aarch64/pr94052.C b/gcc/testsuite/g++.target/aarch64/pr94052.C new file mode 100644 index 0000000000000000000000000000000000000000..d36c9bdc1588533db35eb3cbd2502034edd25452 --- /dev/null +++ b/gcc/testsuite/g++.target/aarch64/pr94052.C @@ -0,0 +1,174 @@ +/* { dg-do compile } */ +/* { dg-additional-options "-O2 -std=gnu++11 -w" } */ + +namespace c { +typedef int d; +template <typename e> struct f { typedef e g; }; +template <bool, typename> struct h; +template <typename e> e aa(typename f<e>::g i) { return i; } +template <typename, typename> struct j {}; +template <d, typename> struct k; +template <class l, class m> struct k<1, j<l, m>> { typedef m g; }; +template <d n, class l, class m> typename k<n, j<l, m>>::g ab(j<l, m>); +} // namespace c +typedef long d; +typedef char o; +typedef int p; +typedef char q; +typedef int r; +namespace { +struct s; +constexpr d t = 6; +template <typename> class ad { +public: + static constexpr d u = t; + d v(); + d x(); + d y(); +}; +class z : ad<int> {}; +struct ae { + p af; +}; +class ag { +public: + ae ah(); +}; +} // namespace +typedef __Int32x4_t ai; +typedef struct { + ai aj[2]; +} ak; +typedef int al; +void am(p *a, ai b) { __builtin_aarch64_st1v4si(a, b); } +namespace an { +class ao { +public: + bool operator==(ao); + d v(); + d x(); +}; +class ap : public ad<r> {}; +class aq { +public: + c::j<int, int> ar(); + int as(); + int at(); +}; +class au { +public: + virtual d av(d); + virtual ap aw(); + virtual ag ax(); +}; +class ay {}; +class az { + virtual void ba(const ay &, const s &); +}; +using bb = az; +class bc; +class bd : bb { + void ba(const ay &, const s &); + bc *be; + bc *bf; + bc *bg; + aq bh; + int bi; + int bj; + ao bk; +}; +namespace bl { +namespace bm { +namespace bn { +class bo; +} +} // namespace bm +} // namespace bl +namespace bn { +template <typename ac = c::h<0, bl::bm ::bn::bo>> +ai bp(ac *, ac *, ac *, al, al, al, d, p); +template <typename ac = c::h<0, bl::bm ::bn::bo>> +ak bq(ac *br, ac *bs, ac *bt, al bu, al bv, al bw, d bx, int, int by) { + ak{bp(br, bs, bt, bu, bv, bw, bx, by), bp(br, bs, bt, bu, bv, bw, bx, by)}; +} +template <typename ac = c::h<0, bl::bm ::bn::bo>> +ak bz(ac *, ac *, ac *, al, al, al &, int, p); +template <int> void ca(p *, const ak &); +template <> void ca<1>(p *buffer, const ak &cb) { + am(buffer, cb.aj[0]); + am(buffer + 4, cb.aj[1]); +} +int cc(int, int); +} // namespace bn +class bc { +public: + virtual au *cd(); +}; +class ce { +public: + q *cf(); +}; +template <d> struct cg { + template <typename ch> static void ci(ay, z cj, ch ck) { ck(cj); } +}; +template <typename ch> void cl(ay w, ch ck) { + z cj; + cg<z::u>::ci(w, cj, c::aa<ch>(ck)); +} +namespace { +template <typename T1, typename cm, int cn> class co { +public: + static void convolve(ay, int cs, bc *cp, bc *cq, bc *cr, aq cw, int, ao ct) { + int by = cp->cd()->ax().ah().af; + int cu = cq->cd()->ax().ah().af; + cp->cd()->aw().v(); + int cv = cp->cd()->aw().x(); + cp->cd()->aw().y(); + cp->cd()->aw(); + int da = cr->cd()->aw().x(); + int cx = cq->cd()->aw().x(); + cq->cd()->aw().y(); + int cy = cr->cd()->av(0); + int cz = cr->cd()->av(1); + bn::cc(cs, cn); + int de = c::ab<1>(cw.ar()); + cw.as(); + cw.at(); + ay db; + ce dc; + ce dd; + ce w; + q *di = w.cf(); + cl(db, [&](z) { + int df; + dc; + di; + cx; + auto dg(cu); + auto dh(cu); + auto dl(cu); + for (; cz; df += de) { + auto br = reinterpret_cast<T1 *>(cv); + auto bs = reinterpret_cast<T1 *>(cv); + auto bt = reinterpret_cast<T1 *>(df * ct.x()); + auto dj = reinterpret_cast<cm *>(dd.cf() + da); + for (int dk; dk < cy; dk += cs, dj += cs) + if (ct == ao()) { + auto vres = bn::bz(br, bs, bt, dg, dh, dl, cn, by); + bn::ca<cn>(dj, vres); + } else + bn::bq(br, bs, bt, dg, dh, dl, ct.v(), cn, by); + } + }); + } +}; +template <typename T1, typename cm> +void bz(ay dm, int cs, bc *cp, bc *cq, bc *cr, aq cw, int dn, ao ct) { + co<T1, cm, 1>::convolve(dm, cs, cp, cq, cr, cw, dn, ct); + co<T1, cm, 2>::convolve(dm, cs, cp, cq, cr, cw, dn, ct); +} +} // namespace +void bd::ba(const ay &dm, const s &) { + bz<o, p>(dm, bi, be, bg, bf, bh, bj, bk); +} +} // namespace an