RE: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

Yangfei (Felix) Thu, 28 May 2020 05:49:12 -0700

Hi,

> -----Original Message-----
> From: Richard Sandiford [mailto:richard.sandif...@arm.com]
> Sent: Thursday, May 28, 2020 12:07 AM
> To: Yangfei (Felix) <felix.y...@huawei.com>
> Cc: gcc-patches@gcc.gnu.org
> Subject: Re: [PATCH PR95254] aarch64: gcc generate inefficient code with
> fixed sve vector length
>


Snip...

> 
> Ah, OK.  But in that case, shouldn't we allow the change if the original
> unaligned MEM was also “slow”?
> 
> I guess there might be cases in which both modes are slow enough for the
> hook to return true for them, but one is worse than the other.
> But I don't think there's much we can do about that as things stand:
> changing the mode might move from a slow mode to a slower mode, but it
> might move in the other direction too.

Good point.

> > +2020-05-27  Felix Yang  <felix.y...@huawei.com>
> > +           Richard Sandiford  <richard.sandif...@arm.com>
> 
> I appreciate the gesture, but I don't think it's appropriate to list me as an
> author.  I haven't written any of the code, I've just reviewed it. :-)

OK.

> > diff --git a/gcc/expr.c b/gcc/expr.c
> > index dfbeae71518..3035791c764 100644
> > --- a/gcc/expr.c
> > +++ b/gcc/expr.c
> > @@ -3814,6 +3814,69 @@ emit_move_insn (rtx x, rtx y)
> >    gcc_assert (mode != BLKmode
> >           && (GET_MODE (y) == mode || GET_MODE (y) == VOIDmode));
> >
> > +  /* If we have a copy which looks like one of the following patterns:
> 
> s/which/that/ (....I think)

OK.

> > +       (set (subreg:M1 (reg:M2 ...)) (subreg:M1 (reg:M2 ...)))
> > +       (set (subreg:M1 (reg:M2 ...)) (mem:M1 ADDR))
> > +       (set (mem:M1 ADDR) (subreg:M1 (reg:M2 ...)))
> > +       (set (subreg:M1 (reg:M2 ...)) (constant C))
> > +     where mode M1 is equal in size to M2 and target hook
> can_change_mode_class
> > +     (M1, M2, ALL_REGS) returns false, try to remove the subreg.  This
> avoids
> > +     an implicit round trip through memory.  */
> 
> How about:
> 
>      where mode M1 is equal in size to M2, try to detect whether the
>      mode change involves an implicit round trip through memory.
>      If so, see if we can avoid that by removing the subregs and
>      doing the move in mode M2 instead.  */
> 
> > +  else if (x_inner != NULL_RTX
> > +      && MEM_P (y)
> > +      && ! targetm.can_change_mode_class (GET_MODE (x_inner),
> > +                                          mode, ALL_REGS)
> > +      /* Stop if the inner mode requires too much alignment.  */
> > +      && (! targetm.slow_unaligned_access (GET_MODE (x_inner),
> > +                                           MEM_ALIGN (y))
> > +          || MEM_ALIGN (y) >= GET_MODE_ALIGNMENT (GET_MODE
> (x_inner))))
> 
> It's better to check the alignment first, since it's cheaper.
> So taking the comment above into account, I think this ends up as:
> 
>          && (MEM_ALIGN (y) >= GET_MODE_ALIGNMENT (GET_MODE
> (x_inner))
>              || targetm.slow_unaligned_access (mode, MEM_ALIGN (y)
>              || !targetm.slow_unaligned_access (GET_MODE (x_inner),
>                                                 MEM_ALIGN (y))
> 
> (Note: no space after "!", although the sources aren't as consistent about
> that as they could be.)

OK.

> TBH I think it would be good to avoid duplicating such a complicated condition
> in both directions, so at the risk of getting flamed, how about using a 
> lambda?
> 
>   auto candidate_mem_p = [&](machine_mode inner_mode, rtx mem) {
>     return ...;
>   };
> 
> with ... containing everything after the MEM_P check?

Yes, this avoids duplicating code.

> Looks good otherwise, thanks,

Thanks for reviewing this.
Attached please find the v5 patch.
Note: we also need to modify local variable "mode" once we catch one case.  I 
see test failure without this change.

Bootstrapped and tested on aarch64-linux-gnu.
Also bootstrapped on x86_64-linux-gnu.  Regression test show two failed tests 
on this platform:

1> FAIL: gcc.target/i386/avx512f-vcvtps2ph-2.c (test for excess errors)
2> FAIL: gcc.target/i386/pr67609.c scan-assembler movdqa

I have adjust the second one in the v4 patch. But The first one looks strange 
to me.
I see gcc emits invalid x86 vcvtps2ph instrunctions which looks like:

125         vcvtps2ph       $0, %zmm0, -112(%rbp){%k1}
126         vcvtps2ph       $0, %zmm0, -80(%rbp){%k1}{z}

This happens in the combine phase, where an combined insn looks like:

1989 Trying 31 -> 33:
1990    31: r106:V16HI=vec_merge(unspec[r103:V16SF,0] 
133,[frame:DI-0x60],r109:HI)
1991    33: [frame:DI-0x60]=r106:V16HI
1992       REG_DEAD r106:V16HI
1993 Successfully matched this instruction:
1994 (set (mem/j/c:V16HI (plus:DI (reg/f:DI 19 frame)
1995             (const_int -96 [0xffffffffffffffa0])) [2 res2.x+0 S32 A256])
1996     (vec_merge:V16HI (unspec:V16HI [
1997                 (reg:V16SF 103)
1998                 (const_int 0 [0])
1999             ] UNSPEC_VCVTPS2PH)
2000         (mem/j/c:V16HI (plus:DI (reg/f:DI 19 frame)
2001                 (const_int -96 [0xffffffffffffffa0])) [2 res2.x+0 S32 
A256])
2002         (reg:HI 109)))
2003 allowing combination of insns 31 and 33
2004 original costs 16 + 4 = 20
2005 replacement cost 16
2006 deferring deletion of insn with uid = 31.
2007 modifying insn i3    33: [frame:DI-0x60]=vec_merge(unspec[r103:V16SF,0] 
133,[frame:DI-0x60],r109:HI)
2008 deferring rescan insn with uid = 33.

And this can be matched with pattern: avx512f_vcvtps2ph512_mask
2282 (insn 33 31 37 4 (set (mem/j/c:V16HI (plus:DI (reg/f:DI 19 frame)
2283                 (const_int -96 [0xffffffffffffffa0])) [2 res2.x+0 S32 
A256])
2284         (vec_merge:V16HI (unspec:V16HI [
2285                     (reg:V16SF 103)
2286                     (const_int 0 [0])
2287                 ] UNSPEC_VCVTPS2PH)
2288             (mem/j/c:V16HI (plus:DI (reg/f:DI 19 frame)
2289                     (const_int -96 [0xffffffffffffffa0])) [2 res2.x+0 S32 
A256])
2290             (reg:HI 109))) "avx512f-vcvtps2ph-2.c":80:10 5324 
{avx512f_vcvtps2ph512_mask}
2291      (nil))

gcc/config/i386/sse.md:
21663 (define_insn "<mask_codefor>avx512f_vcvtps2ph512<mask_name>"
21664   [(set (match_operand:V16HI 0 "nonimmediate_operand" "=vm")
21665         (unspec:V16HI
21666           [(match_operand:V16SF 1 "register_operand" "v")
21667            (match_operand:SI 2 "const_0_to_255_operand" "N")]
21668           UNSPEC_VCVTPS2PH))]
21669   "TARGET_AVX512F"
21670   "vcvtps2ph\t{%2, %1, %0<mask_operand3>|%0<mask_operand3>, %1, %2}"
21671   [(set_attr "type" "ssecvt")
21672    (set_attr "prefix" "evex")
21673    (set_attr "mode" "V16SF")])

How can that happen? 

Felix

pr95254-v4.diff
Description: pr95254-v4.diff

RE: [PATCH PR95254] aarch64: gcc generate inefficient code with fixed sve vector length

Reply via email to