[Bug rtl-optimization/113597] [14 Regression] aarch64: Significant code quality regression since r14-8346-ga98d5130a6dcff
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113597 Jeffrey A. Law changed: What|Removed |Added Priority|P3 |P2 CC||law at gcc dot gnu.org
[Bug rtl-optimization/113597] [14 Regression] aarch64: Significant code quality regression since r14-8346-ga98d5130a6dcff
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113597 Richard Biener changed: What|Removed |Added Attachment #57214|0 |1 is obsolete|| --- Comment #13 from Richard Biener --- Created attachment 57252 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57252=edit prototype fix Note when I extended the patch to also cover a PARM_DECL base to extent coverage I see FAIL: gcc.dg/torture/pr70421.c -O1 execution test FAIL: gcc.dg/torture/pr70421.c -O2 execution test FAIL: gcc.dg/torture/pr70421.c -O3 -g execution test FAIL: gcc.dg/torture/pr70421.c -Os execution test FAIL: gcc.dg/torture/pr70421.c -O2 -flto -fno-use-linker-plugin -flto-partitio n=none execution test FAIL: gcc.dg/torture/pr70421.c -O2 -flto -fuse-linker-plugin -fno-fat-lto-obje cts execution test on x86_64. It seems that arg_base_value isn't the correct thing to use but it eventually should have been unique_base_value (UNIQUE_BASE_VALUE_ARGP)? I'm not sure whether all the different unique base values mean we'll not be able to derive exactly those classes from MEM_EXPRs.
[Bug rtl-optimization/113597] [14 Regression] aarch64: Significant code quality regression since r14-8346-ga98d5130a6dcff
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113597 --- Comment #12 from Richard Biener --- Created attachment 57214 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57214=edit prototype fix The attached prototype fixes the testcase for me.
[Bug rtl-optimization/113597] [14 Regression] aarch64: Significant code quality regression since r14-8346-ga98d5130a6dcff
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113597 --- Comment #11 from Richard Biener --- In DSE the only differences is fbt (0x751a1a50: (plus:DI (reg/v/f:DI 117 [ u ]) -(reg:DI 146 [ _44 ]))) == (address 0) +(reg:DI 146 [ _44 ]))) == (nil) fbt (0x7700b3c0: (reg/f:DI 64 sfp)) == (address:DI -3) -bac false +bac true that's for (mem:BLK (reg/f:DI 64 sfp) [0 A8]) vs (mem:V4SF (plus:DI (reg/v/f:DI 117 [ u ]) (reg:DI 146 [ _44 ])) [0 MEM <__Float32x4_t> [(float * {ref-all})_42]+0 S16 A32]) from #0 0x02ff3796 in scan_reads (insn_info=0x5e5b680, gen=0x5ec2338, kill=0x5ec2358) at /space/rguenther/src/gcc/gcc/dse.cc:3156 #1 0x02ff39b1 in dse_step3_scan (bb=) at /space/rguenther/src/gcc/gcc/dse.cc:3238 processing (insn 62 61 64 5 (set (reg:V4SF 147 [ MEM <__Float32x4_t> [(float * {ref-all})_42] ]) (mem:V4SF (plus:DI (reg/v/f:DI 117 [ u ]) (reg:DI 146 [ _44 ])) [0 MEM <__Float32x4_t> [(float * {ref-all})_42]+0 S16 A32])) "include/arm_neon.h":12531:36 1274 {*aarch64_simd_movv4sf} (expr_list:REG_DEAD (reg:DI 146 [ _44 ]) (nil))) in this case we have _44 point to NONLOCAL only. It got arg_base_value as base value (from the MEM_EXPR and that points-to set we could eventually derive this very same base term as well). But I'll note that (mem:BLK (reg/f:DI 64 sfp) [0 A8]) is artificial, generated by DSE get_group_info via record_store on (insn 13 12 14 2 (set (mem/c:V2x16QI (reg/f:DI 119) [0 +0 S32 A128]) (unspec:V2x16QI [ (reg:V16QI 121) repeated x2 ] UNSPEC_STP)) "t.cc":12:10 discrim 1 92 {*store_pair_16} (nil)) which is figured to be const_or_frame_p () based. That notably lacks a MEM_EXPR (though the bare MEM means only base_alias_check would ever be able to disambiguate here).
[Bug rtl-optimization/113597] [14 Regression] aarch64: Significant code quality regression since r14-8346-ga98d5130a6dcff
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113597 --- Comment #10 from Richard Biener --- Created attachment 57212 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57212=edit patch for debugging Btw, I've used the attached to investigate other issues with the change. It will show the outcome of base_alias_check and find_base_term in dumps. One issue is that we're much more dependent on MEM_EXPRs being present. Before figuring there wouldn't be much important regressions the idea was to instead of doing find_base_term have a known base value recorded in the MEM_ATTRs, and as the only important ones should be the special ones for argument frame and stack-based represent that by an enum (rather than the other possibility of using ADDRESS). I'll also note that for spill slots we get around to use spill_slot_decl and set_mem_attrs_for_spill. I've not yet convinced myself that the other special bases we have really form a completely separate memory class. But if they do then accesses should do something similar there (but mind scheduling of frame related instructions ...). Argument stack slots are one important class, set up by init_alias_analysis. But those are also backed by regular decls at times (but not always)? assign_stack_temp "allocated" memory is another class, we're reusing slots during RTL expansion and they get (even if shared) a specific alias set. I don't think we ever release those temps and say re-use the space for spilling so assigning a different decl to each slot should eventually work.
[Bug rtl-optimization/113597] [14 Regression] aarch64: Significant code quality regression since r14-8346-ga98d5130a6dcff
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113597 --- Comment #9 from Alex Coplan --- (In reply to Andrew Pinski from comment #8) > (In reply to Alex Coplan from comment #7) > > I expect the store pairs come from memcpy lowering/expansion in the aarch64 > > backend, that is the only way we get store pairs so early in the RTL > > pipeline IIRC. > > In this case, memset is more likely. Right, yeah. I was using "memcpy lowering" to refer to all the mem{cpy,set,move} expansion we have in the backend. > > Either: > for (int i = 0; i < j; i++) > m[i] = vdupq_n_f32(0.F); > Or > for (int i = 0; i < l; i++) > n[i] = vdupq_n_f32(0.F);
[Bug rtl-optimization/113597] [14 Regression] aarch64: Significant code quality regression since r14-8346-ga98d5130a6dcff
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113597 --- Comment #8 from Andrew Pinski --- (In reply to Alex Coplan from comment #7) > I expect the store pairs come from memcpy lowering/expansion in the aarch64 > backend, that is the only way we get store pairs so early in the RTL > pipeline IIRC. In this case, memset is more likely. Either: for (int i = 0; i < j; i++) m[i] = vdupq_n_f32(0.F); Or for (int i = 0; i < l; i++) n[i] = vdupq_n_f32(0.F);
[Bug rtl-optimization/113597] [14 Regression] aarch64: Significant code quality regression since r14-8346-ga98d5130a6dcff
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113597 --- Comment #7 from Alex Coplan --- I expect the store pairs come from memcpy lowering/expansion in the aarch64 backend, that is the only way we get store pairs so early in the RTL pipeline IIRC.
[Bug rtl-optimization/113597] [14 Regression] aarch64: Significant code quality regression since r14-8346-ga98d5130a6dcff
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113597 --- Comment #6 from Alex Coplan --- Looking at the dump files, the first difference seems to be in 292r.dse1: 8: NOTE_INSN_BASIC_BLOCK 2 2: r116:SI=zero_extend(x0:HI) REG_DEAD x0:HI @@ -178,7 +161,26 @@ 5: NOTE_INSN_FUNCTION_BEG 10: r119:DI=sfp:DI-0x200 12: r121:V16QI=const_vector + 13: [r119:DI]=unspec[r121:V16QI,r121:V16QI] 38 + 14: [r119:DI+0x20]=unspec[r121:V16QI,r121:V16QI] 38 + 15: [r119:DI+0x40]=unspec[r121:V16QI,r121:V16QI] 38 + 16: [r119:DI+0x60]=unspec[r121:V16QI,r121:V16QI] 38 + 17: [r119:DI+0x80]=unspec[r121:V16QI,r121:V16QI] 38 + 18: [r119:DI+0xa0]=unspec[r121:V16QI,r121:V16QI] 38 + 19: [r119:DI+0xc0]=unspec[r121:V16QI,r121:V16QI] 38 + 20: [r119:DI+0xe0]=unspec[r121:V16QI,r121:V16QI] 38 + REG_DEAD r119:DI 21: r122:DI=sfp:DI-0x100 + 24: [r122:DI]=unspec[r121:V16QI,r121:V16QI] 38 + 25: [r122:DI+0x20]=unspec[r121:V16QI,r121:V16QI] 38 + 26: [r122:DI+0x40]=unspec[r121:V16QI,r121:V16QI] 38 + 27: [r122:DI+0x60]=unspec[r121:V16QI,r121:V16QI] 38 + 28: [r122:DI+0x80]=unspec[r121:V16QI,r121:V16QI] 38 + 29: [r122:DI+0xa0]=unspec[r121:V16QI,r121:V16QI] 38 + 30: [r122:DI+0xc0]=unspec[r121:V16QI,r121:V16QI] 38 + 31: [r122:DI+0xe0]=unspec[r121:V16QI,r121:V16QI] 38 + REG_DEAD r122:DI + REG_DEAD r121:V16QI 6: r100:V4SF=const_vector 7: r106:SI=0 32: cc:CC=cmp(r116:SI,0) @@ -254,6 +256,7 @@ 73: r100:V4SF={r147:V4SF*r147:V4SF+r115:V4SF} REG_DEAD r147:V4SF REG_DEAD r115:V4SF + 74: [sfp:DI-0x200]=r100:V4SF 75: r148:SI=r106:SI+0x2 REG_DEAD r106:SI 76: r106:SI=zero_extend(r148:SI#0) (the unspec 38s are store pairs).
[Bug rtl-optimization/113597] [14 Regression] aarch64: Significant code quality regression since r14-8346-ga98d5130a6dcff
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113597 --- Comment #5 from Andrew Pinski --- Note I think this testcase has been reduced too much, but maybe that can be "fixed". The stores to the arguments go past the bounds.
[Bug rtl-optimization/113597] [14 Regression] aarch64: Significant code quality regression since r14-8346-ga98d5130a6dcff
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113597 --- Comment #4 from Alex Coplan --- Created attachment 57211 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57211=edit after.s
[Bug rtl-optimization/113597] [14 Regression] aarch64: Significant code quality regression since r14-8346-ga98d5130a6dcff
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113597 --- Comment #3 from Alex Coplan --- Created attachment 57210 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57210=edit before.s
[Bug rtl-optimization/113597] [14 Regression] aarch64: Significant code quality regression since r14-8346-ga98d5130a6dcff
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113597 --- Comment #2 from Alex Coplan --- (In reply to Richard Biener from comment #1) > I will have a look - but can you explain for me what I see? I suppose the > testcase was reduced from something? Yeah, the testcase is reduced. > > Is the assembly diff complete? That is, do we really have more fmla or > are they just moved? I think the diff is complete, I can upload the full before/after asm. > > + stp q31, q31, [sp, 256] > > that's a store? A paired store? Aka, the sequence fills a stack(?) > region with replications of q31? That's right. I'll try to take a look at the RTL dumps too to see if I can figure out anything, too.
[Bug rtl-optimization/113597] [14 Regression] aarch64: Significant code quality regression since r14-8346-ga98d5130a6dcff
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113597 Richard Biener changed: What|Removed |Added Target Milestone|--- |14.0 Ever confirmed|0 |1 Last reconfirmed||2024-01-25 Status|UNCONFIRMED |ASSIGNED Assignee|unassigned at gcc dot gnu.org |rguenth at gcc dot gnu.org
[Bug rtl-optimization/113597] [14 Regression] aarch64: Significant code quality regression since r14-8346-ga98d5130a6dcff
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113597 --- Comment #1 from Richard Biener --- I will have a look - but can you explain for me what I see? I suppose the testcase was reduced from something? Is the assembly diff complete? That is, do we really have more fmla or are they just moved? + stp q31, q31, [sp, 256] that's a store? A paired store? Aka, the sequence fills a stack(?) region with replications of q31?