RE: [PATCH 0/4]AArch64: support conditional early clobbers on certain operations.

Tamar Christina Mon, 20 May 2024 20:25:02 -0700

> -----Original Message-----
> From: Richard Sandiford <richard.sandif...@arm.com>
> Sent: Wednesday, May 15, 2024 10:31 PM
> To: Tamar Christina <tamar.christ...@arm.com>
> Cc: Richard Biener <richard.guent...@gmail.com>; gcc-patches@gcc.gnu.org; nd
> <n...@arm.com>; Richard Earnshaw <richard.earns...@arm.com>; Marcus
> Shawcroft <marcus.shawcr...@arm.com>; ktkac...@gcc.gnu.org
> Subject: Re: [PATCH 0/4]AArch64: support conditional early clobbers on certain
> operations.
> 
> Tamar Christina <tamar.christ...@arm.com> writes:
> >> >> On Wed, May 15, 2024 at 12:29 PM Tamar Christina
> >> >> <tamar.christ...@arm.com> wrote:
> >> >> >
> >> >> > Hi All,
> >> >> >
> >> >> > Some Neoverse Software Optimization Guides (SWoG) have a clause that
> state
> >> >> > that for predicated operations that also produce a predicate it is 
> >> >> > preferred
> >> >> > that the codegen should use a different register for the destination 
> >> >> > than
> that
> >> >> > of the input predicate in order to avoid a performance overhead.
> >> >> >
> >> >> > This of course has the problem that it increases register pressure 
> >> >> > and so
> >> should
> >> >> > be done with care.  Additionally not all micro-architectures have this
> >> >> > consideration and so it shouldn't be done as a default thing.
> >> >> >
> >> >> > The patch series adds support for doing conditional early clobbers 
> >> >> > through
> a
> >> >> > combination of new alternatives and attributes to control their 
> >> >> > availability.
> >> >>
> >> >> You could have two alternatives, one with early clobber and one with
> >> >> a matching constraint where you'd disparage the matching constraint one?
> >> >>
> >> >
> >> > Yeah, that's what I do, though there's no need to disparage the non-early
> clobber
> >> > alternative as the early clobber alternative will naturally get a 
> >> > penalty if it
> needs a
> >> > reload.
> >>
> >> But I think Richard's suggestion was to disparage the one with a matching
> >> constraint (not the earlyclobber), to reflect the increased cost of
> >> reusing the register.
> >>
> >> We did take that approach for gathers, e.g.:
> >>
> >>      [&w, Z,   w, Ui1, Ui1, Upl] ld1<Vesize>\t%0.s, %5/z, [%2.s]
> >>      [?w, Z,   0, Ui1, Ui1, Upl] ^
> >>
> >> The (supposed) advantage is that, if register pressure is so tight
> >> that using matching registers is the only alternative, we still
> >> have the opportunity to do that, as a last resort.
> >>
> >> Providing only an earlyclobber version means that using the same
> >> register is prohibited outright.  If no other register is free, the RA
> >> would need to spill something else to free up a temporary register.
> >> And it might then do the equivalent of (pseudo-code):
> >>
> >>       not p1.b, ..., p0.b
> >>       mov p0.d, p1.d
> >>
> >> after spilling what would otherwise have occupied p1.  In that
> >> situation it would be better use:
> >>
> >>       not p0.b, ..., p0.b
> >>
> >> and not introduce the spill of p1.
> >
> > I think I understood what Richi meant, but I thought it was already working 
> > that
> way.
> 
> The suggestion was to use matching constraints (like "0") though,
> whereas the patch doesn't.  I think your argument is that you don't
> need to use matching constraints.  But that's different from the
> suggestion (and from how we handle gathers).
> 
> I was going to say in response to patch 3 (but got distracted, sorry):
> I don't think we should have:
> 
>    &Upa, Upa, ...
>    Upa, Upa, ...
> 
> (taken from the pure logic ops) enabled at the same time.  Even though
> it works for the testcases, I don't think it has well-defined semantics.
> 
> The problem is that, taken on its own, the second alternative says that
> matching operands are free.  And fundamentally, I don't think the costs
> *must* take the earlyclobber alternative over the non-earlyclobber one
> (when costing during IRA, for instance).  In principle, the cheapest
> is best.
> 
> The aim of the gather approach is to make each alternative correct in
> isolation.  In:
> 
>       [&w, Z,   w, Ui1, Ui1, Upl] ld1<Vesize>\t%0.s, %5/z, [%2.s]
>       [?w, Z,   0, Ui1, Ui1, Upl] ^
> 
> the second alternative says that it is possible to have operands 0
> and 2 be the same vector register, but using that version has the
> cost of an extra reload.  In that sense the alternatives are
> (essentially) consistent about the restriction.
> 
> > i.e. as one of the testcases I had:
> >
> >> aarch64-none-elf-gcc -O3 -g0 -S -o - pred-clobber.c -mcpu=neoverse-n2 
> >> -ffixed-
> p[1-15]
> >
> > foo:
> >         mov     z31.h, w0
> >         ptrue   p0.b, all
> >         cmplo   p0.h, p0/z, z0.h, z31.h
> >         b       use
> >
> > and reload did not force a spill.
> >
> > My understanding of how this works, and how it seems to be working is that
> since reload costs
> > Alternative from front to back the cheapest one wins and it stops 
> > evaluating the
> rest.
> >
> > The early clobber case is first and preferred, however when it's not 
> > possible, i.e.
> requires a non-pseudo
> > reload, the reload cost is added to the alternative.
> >
> > However you're right that in the following testcase:
> >
> > -mcpu=neoverse-n2 -ffixed-p1 -ffixed-p2 -ffixed-p3 -ffixed-p4 -ffixed-p5 
> > -ffixed-
> p6 -ffixed-p7 -ffixed-p8 -ffixed-p9 -ffixed-p10 -ffixed-p11 -ffixed-p12 
> -ffixed-p12 -
> ffixed-p13 -ffixed-p14 -ffixed-p14 -fdump-rtl-reload
> >
> > i.e. giving it an extra free register inexplicably causes a spill:
> >
> > foo:
> >         addvl   sp, sp, #-1
> >         mov     z31.h, w0
> >         ptrue   p0.b, all
> >         str     p15, [sp]
> >         cmplo   p15.h, p0/z, z0.h, z31.h
> >         mov     p0.b, p15.b
> >         ldr     p15, [sp]
> >         addvl   sp, sp, #1
> >         b       use
> >
> > so that's unexpected and is very weird as p15 has no defined value..
> 
> This is because the function implicitly uses the SVE PCS, and so needs
> to preserve p15 for the caller.  It looks like the correct behaviour.


Sure, but p15 isn't live after the call. 
It is somewhat of a regression in that if it had chosen the tie version
then p0 wouldn't need preserving.

It's a bit of an artificial case I guess but are we ok with this regression?
Or is there a way to query df to see if a value is live after the call?

I can only see ways to tell if the register is live before the call..

Thanks,
Tamar

> 
> > Now adding the ? as suggested to the non-early clobber alternative does not 
> > fix
> it, and my mental model for how this is supposed to work does not quite line 
> up..
> > Why would making the non-clobber alternative even more expensive help it
> during high register pressure??
> 
> Hopefully the above answers this.  The non-clobber alternative has
> zero extra cost as things stand.  The costs from one alternative
> (the earlyclobber one) don't carry forward to other alternatives.
> 
> > But with that suggestion the above case does not get fixed
> > and the following case
> >
> > -mcpu=neoverse-n2 -ffixed-p1 -ffixed-p2 -ffixed-p3 -ffixed-p4 -ffixed-p5 
> > -ffixed-
> p6 -ffixed-p7 -ffixed-p8 -ffixed-p9 -ffixed-p10 -ffixed-p11 -ffixed-p12 
> -ffixed-p12 -
> ffixed-p13 -ffixed-p14 -ffixed-p15 -fdump-rtl-reload
> >
> > ICEs:
> >
> > pred-clobber.c: In function 'foo':
> > pred-clobber.c:9:1: error: unable to find a register to spill
> >     9 | }
> >       | ^
> > pred-clobber.c:9:1: error: this is the insn:
> > (insn 10 22 19 2 (parallel [
> >             (set (reg:VNx8BI 110 [104])
> >                 (unspec:VNx8BI [
> >                         (reg:VNx8BI 112)
> >                         (const_int 1 [0x1])
> >                         (ltu:VNx8BI (reg:VNx8HI 32 v0)
> >                             (reg:VNx8HI 63 v31))
> >                     ] UNSPEC_PRED_Z))
> >             (clobber (reg:CC_NZC 66 cc))
> >         ]) "pred-clobber.c":7:19 8687 {aarch64_pred_cmplovnx8hi}
> >      (expr_list:REG_DEAD (reg:VNx8BI 112)
> >         (expr_list:REG_DEAD (reg:VNx8HI 63 v31)
> >             (expr_list:REG_DEAD (reg:VNx8HI 32 v0)
> >                 (expr_list:REG_UNUSED (reg:CC_NZC 66 cc)
> >                     (nil))))))
> > during RTL pass: reload
> > dump file: pred-clobber.c.315r.reload
> 
> Which pattern did you use?
> 
> > and this is because the use of ? has the unintended side-effect of blocking 
> > a
> register class entirely during Sched1 as we've recently discovered.
> > i.e. see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114766
> 
> (Is sched1 the problem here, or is it purely an RA thing?  What happens
> when scheduling is disabled?)
> 
> > in this case it marked the alternative as NO_REGS during sched1 and so it's
> completely dead.
> > the use of the ? alternatives has caused quite the code bloat as we've 
> > recently
> discovered because of this unexpected and undocumented behavior.
> >
> > To me,
> >
> > diff --git a/gcc/config/aarch64/aarch64-sve.md b/gcc/config/aarch64/aarch64-
> sve.md
> > index 93ec59e58af..2ee3d8ea35e 100644
> > --- a/gcc/config/aarch64/aarch64-sve.md
> > +++ b/gcc/config/aarch64/aarch64-sve.md
> > @@ -8120,10 +8120,10 @@ (define_insn
> "@aarch64_pred_cmp<cmp_op><mode>"
> >     (clobber (reg:CC_NZC CC_REGNUM))]
> >    "TARGET_SVE"
> >    {@ [ cons: =0 , 1   , 3 , 4            ; attrs: pred_clobber ]
> > -     [ &Upa     , Upl , w , <sve_imm_con>; yes                 ]
> cmp<cmp_op>\t%0.<Vetype>, %1/z, %3.<Vetype>, #%4
> > -     [ Upa      , Upl , w , <sve_imm_con>; *                   ] ^
> > -     [ &Upa     , Upl , w , w            ; yes                 ] 
> > cmp<cmp_op>\t%0.<Vetype>,
> %1/z, %3.<Vetype>, %4.<Vetype>
> > -     [ Upa      , Upl , w , w            ; *                   ] ^
> > +     [ ^&Upa    , Upl , w , <sve_imm_con>; yes                 ]
> cmp<cmp_op>\t%0.<Vetype>, %1/z, %3.<Vetype>, #%4
> > +     [  Upa     , Upl , w , <sve_imm_con>; *                   ] ^
> > +     [ ^&Upa    , Upl , w , w            ; yes                 ] 
> > cmp<cmp_op>\t%0.<Vetype>,
> %1/z, %3.<Vetype>, %4.<Vetype>
> > +     [  Upa     , Upl , w , w            ; *                   ] ^
> >    }
> >  )
> >
> > Would have been the right approach, i.e. we prefer the alternative unless a 
> > reload
> is needed, which should work no? well. if ^ wasn't broken the same way
> > as ?.  Perhaps I need to use Wilco's new alternative that doesn't block a 
> > register
> class?
> 
> Hmm, I'm not sure.  It seems odd to mark only the output with ^, since
> reloading the output isn't fundamentally different (costwise) from
> reloading the input.
> 
> But to me, it's the alternative without the earlyclobber that should be
> disparaged, since it's the inherently expensive one.
> 
> The gather-like approach would be something like:
> 
>      [ &Upa     , Upl , w , <sve_imm_con>; yes                 ]
> cmp<cmp_op>\t%0.<Vetype>, %1/z, %3.<Vetype>, #%4
>      [ ?Upl     , 0   , w , <sve_imm_con>; yes                 ] ^
>      [ Upa      , Upl , w , <sve_imm_con>; no                  ] ^
>      [ &Upa     , Upl , w , w            ; yes                 ] 
> cmp<cmp_op>\t%0.<Vetype>, %1/z,
> %3.<Vetype>, %4.<Vetype>
>      [ ?Upl     , 0   , w , w            ; yes                 ] ^
>      [ Upa      , Upl , w , w            ; no                  ] ^
> 
> with:
> 
>   (define_attr "pred_clobber" "any,no,yes" (const_string "any"))
> 
> Thanks,
> Richard

RE: [PATCH 0/4]AArch64: support conditional early clobbers on certain operations.

Reply via email to