Hi, > -----Original Message----- > From: Richard Sandiford [mailto:richard.sandif...@arm.com] > Sent: Thursday, May 28, 2020 12:07 AM > To: Yangfei (Felix) <felix.y...@huawei.com> > Cc: gcc-patches@gcc.gnu.org > Subject: Re: [PATCH PR95254] aarch64: gcc generate inefficient code with > fixed sve vector length >
Snip... > > Ah, OK. But in that case, shouldn't we allow the change if the original > unaligned MEM was also “slow”? > > I guess there might be cases in which both modes are slow enough for the > hook to return true for them, but one is worse than the other. > But I don't think there's much we can do about that as things stand: > changing the mode might move from a slow mode to a slower mode, but it > might move in the other direction too. Good point. > > +2020-05-27 Felix Yang <felix.y...@huawei.com> > > + Richard Sandiford <richard.sandif...@arm.com> > > I appreciate the gesture, but I don't think it's appropriate to list me as an > author. I haven't written any of the code, I've just reviewed it. :-) OK. > > diff --git a/gcc/expr.c b/gcc/expr.c > > index dfbeae71518..3035791c764 100644 > > --- a/gcc/expr.c > > +++ b/gcc/expr.c > > @@ -3814,6 +3814,69 @@ emit_move_insn (rtx x, rtx y) > > gcc_assert (mode != BLKmode > > && (GET_MODE (y) == mode || GET_MODE (y) == VOIDmode)); > > > > + /* If we have a copy which looks like one of the following patterns: > > s/which/that/ (....I think) OK. > > + (set (subreg:M1 (reg:M2 ...)) (subreg:M1 (reg:M2 ...))) > > + (set (subreg:M1 (reg:M2 ...)) (mem:M1 ADDR)) > > + (set (mem:M1 ADDR) (subreg:M1 (reg:M2 ...))) > > + (set (subreg:M1 (reg:M2 ...)) (constant C)) > > + where mode M1 is equal in size to M2 and target hook > can_change_mode_class > > + (M1, M2, ALL_REGS) returns false, try to remove the subreg. This > avoids > > + an implicit round trip through memory. */ > > How about: > > where mode M1 is equal in size to M2, try to detect whether the > mode change involves an implicit round trip through memory. > If so, see if we can avoid that by removing the subregs and > doing the move in mode M2 instead. */ > > > + else if (x_inner != NULL_RTX > > + && MEM_P (y) > > + && ! targetm.can_change_mode_class (GET_MODE (x_inner), > > + mode, ALL_REGS) > > + /* Stop if the inner mode requires too much alignment. */ > > + && (! targetm.slow_unaligned_access (GET_MODE (x_inner), > > + MEM_ALIGN (y)) > > + || MEM_ALIGN (y) >= GET_MODE_ALIGNMENT (GET_MODE > (x_inner)))) > > It's better to check the alignment first, since it's cheaper. > So taking the comment above into account, I think this ends up as: > > && (MEM_ALIGN (y) >= GET_MODE_ALIGNMENT (GET_MODE > (x_inner)) > || targetm.slow_unaligned_access (mode, MEM_ALIGN (y) > || !targetm.slow_unaligned_access (GET_MODE (x_inner), > MEM_ALIGN (y)) > > (Note: no space after "!", although the sources aren't as consistent about > that as they could be.) OK. > TBH I think it would be good to avoid duplicating such a complicated condition > in both directions, so at the risk of getting flamed, how about using a > lambda? > > auto candidate_mem_p = [&](machine_mode inner_mode, rtx mem) { > return ...; > }; > > with ... containing everything after the MEM_P check? Yes, this avoids duplicating code. > Looks good otherwise, thanks, Thanks for reviewing this. Attached please find the v5 patch. Note: we also need to modify local variable "mode" once we catch one case. I see test failure without this change. Bootstrapped and tested on aarch64-linux-gnu. Also bootstrapped on x86_64-linux-gnu. Regression test show two failed tests on this platform: 1> FAIL: gcc.target/i386/avx512f-vcvtps2ph-2.c (test for excess errors) 2> FAIL: gcc.target/i386/pr67609.c scan-assembler movdqa I have adjust the second one in the v4 patch. But The first one looks strange to me. I see gcc emits invalid x86 vcvtps2ph instrunctions which looks like: 125 vcvtps2ph $0, %zmm0, -112(%rbp){%k1} 126 vcvtps2ph $0, %zmm0, -80(%rbp){%k1}{z} This happens in the combine phase, where an combined insn looks like: 1989 Trying 31 -> 33: 1990 31: r106:V16HI=vec_merge(unspec[r103:V16SF,0] 133,[frame:DI-0x60],r109:HI) 1991 33: [frame:DI-0x60]=r106:V16HI 1992 REG_DEAD r106:V16HI 1993 Successfully matched this instruction: 1994 (set (mem/j/c:V16HI (plus:DI (reg/f:DI 19 frame) 1995 (const_int -96 [0xffffffffffffffa0])) [2 res2.x+0 S32 A256]) 1996 (vec_merge:V16HI (unspec:V16HI [ 1997 (reg:V16SF 103) 1998 (const_int 0 [0]) 1999 ] UNSPEC_VCVTPS2PH) 2000 (mem/j/c:V16HI (plus:DI (reg/f:DI 19 frame) 2001 (const_int -96 [0xffffffffffffffa0])) [2 res2.x+0 S32 A256]) 2002 (reg:HI 109))) 2003 allowing combination of insns 31 and 33 2004 original costs 16 + 4 = 20 2005 replacement cost 16 2006 deferring deletion of insn with uid = 31. 2007 modifying insn i3 33: [frame:DI-0x60]=vec_merge(unspec[r103:V16SF,0] 133,[frame:DI-0x60],r109:HI) 2008 deferring rescan insn with uid = 33. And this can be matched with pattern: avx512f_vcvtps2ph512_mask 2282 (insn 33 31 37 4 (set (mem/j/c:V16HI (plus:DI (reg/f:DI 19 frame) 2283 (const_int -96 [0xffffffffffffffa0])) [2 res2.x+0 S32 A256]) 2284 (vec_merge:V16HI (unspec:V16HI [ 2285 (reg:V16SF 103) 2286 (const_int 0 [0]) 2287 ] UNSPEC_VCVTPS2PH) 2288 (mem/j/c:V16HI (plus:DI (reg/f:DI 19 frame) 2289 (const_int -96 [0xffffffffffffffa0])) [2 res2.x+0 S32 A256]) 2290 (reg:HI 109))) "avx512f-vcvtps2ph-2.c":80:10 5324 {avx512f_vcvtps2ph512_mask} 2291 (nil)) gcc/config/i386/sse.md: 21663 (define_insn "<mask_codefor>avx512f_vcvtps2ph512<mask_name>" 21664 [(set (match_operand:V16HI 0 "nonimmediate_operand" "=vm") 21665 (unspec:V16HI 21666 [(match_operand:V16SF 1 "register_operand" "v") 21667 (match_operand:SI 2 "const_0_to_255_operand" "N")] 21668 UNSPEC_VCVTPS2PH))] 21669 "TARGET_AVX512F" 21670 "vcvtps2ph\t{%2, %1, %0<mask_operand3>|%0<mask_operand3>, %1, %2}" 21671 [(set_attr "type" "ssecvt") 21672 (set_attr "prefix" "evex") 21673 (set_attr "mode" "V16SF")]) How can that happen? Felix
pr95254-v4.diff
Description: pr95254-v4.diff