[Bug rtl-optimization/113597] [14 Regression] aarch64: Significant code quality regression since r14-8346-ga98d5130a6dcff

2024-03-07 Thread law at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113597

Jeffrey A. Law  changed:

   What|Removed |Added

   Priority|P3  |P2
 CC||law at gcc dot gnu.org

[Bug rtl-optimization/113597] [14 Regression] aarch64: Significant code quality regression since r14-8346-ga98d5130a6dcff

2024-01-29 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113597

Richard Biener  changed:

   What|Removed |Added

  Attachment #57214|0   |1
is obsolete||

--- Comment #13 from Richard Biener  ---
Created attachment 57252
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57252=edit
prototype fix

Note when I extended the patch to also cover a PARM_DECL base to extent
coverage I see

FAIL: gcc.dg/torture/pr70421.c   -O1  execution test
FAIL: gcc.dg/torture/pr70421.c   -O2  execution test
FAIL: gcc.dg/torture/pr70421.c   -O3 -g  execution test
FAIL: gcc.dg/torture/pr70421.c   -Os  execution test
FAIL: gcc.dg/torture/pr70421.c   -O2 -flto -fno-use-linker-plugin
-flto-partitio
n=none  execution test
FAIL: gcc.dg/torture/pr70421.c   -O2 -flto -fuse-linker-plugin
-fno-fat-lto-obje
cts  execution test

on x86_64.  It seems that arg_base_value isn't the correct thing to use
but it eventually should have been unique_base_value (UNIQUE_BASE_VALUE_ARGP)?
I'm not sure whether all the different unique base values mean we'll not
be able to derive exactly those classes from MEM_EXPRs.

[Bug rtl-optimization/113597] [14 Regression] aarch64: Significant code quality regression since r14-8346-ga98d5130a6dcff

2024-01-25 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113597

--- Comment #12 from Richard Biener  ---
Created attachment 57214
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57214=edit
prototype fix

The attached prototype fixes the testcase for me.

[Bug rtl-optimization/113597] [14 Regression] aarch64: Significant code quality regression since r14-8346-ga98d5130a6dcff

2024-01-25 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113597

--- Comment #11 from Richard Biener  ---
In DSE the only differences is

 fbt (0x751a1a50: (plus:DI (reg/v/f:DI 117 [ u ])
-(reg:DI 146 [ _44 ]))) == (address 0)
+(reg:DI 146 [ _44 ]))) == (nil)
 fbt (0x7700b3c0: (reg/f:DI 64 sfp)) == (address:DI -3)
-bac false
+bac true

that's for

(mem:BLK (reg/f:DI 64 sfp) [0  A8])

vs

(mem:V4SF (plus:DI (reg/v/f:DI 117 [ u ])
(reg:DI 146 [ _44 ])) [0 MEM <__Float32x4_t> [(float * {ref-all})_42]+0
S16 A32])

from

#0  0x02ff3796 in scan_reads (insn_info=0x5e5b680, gen=0x5ec2338, 
kill=0x5ec2358) at /space/rguenther/src/gcc/gcc/dse.cc:3156
#1  0x02ff39b1 in dse_step3_scan (bb=)
at /space/rguenther/src/gcc/gcc/dse.cc:3238

processing

(insn 62 61 64 5 (set (reg:V4SF 147 [ MEM <__Float32x4_t> [(float *
{ref-all})_42] ])
(mem:V4SF (plus:DI (reg/v/f:DI 117 [ u ])
(reg:DI 146 [ _44 ])) [0 MEM <__Float32x4_t> [(float *
{ref-all})_42]+0 S16 A32])) "include/arm_neon.h":12531:36 1274
{*aarch64_simd_movv4sf}
 (expr_list:REG_DEAD (reg:DI 146 [ _44 ])
(nil)))

in this case we have _44 point to NONLOCAL only.  It got arg_base_value
as base value (from the MEM_EXPR and that points-to set we could
eventually derive this very same base term as well).

But I'll note that (mem:BLK (reg/f:DI 64 sfp) [0  A8]) is artificial,
generated by DSE get_group_info via record_store on

(insn 13 12 14 2 (set (mem/c:V2x16QI (reg/f:DI 119) [0 +0 S32 A128])
(unspec:V2x16QI [
(reg:V16QI 121) repeated x2
] UNSPEC_STP)) "t.cc":12:10 discrim 1 92 {*store_pair_16}
 (nil))

which is figured to be const_or_frame_p () based.  That notably
lacks a MEM_EXPR (though the bare MEM means only base_alias_check would
ever be able to disambiguate here).

[Bug rtl-optimization/113597] [14 Regression] aarch64: Significant code quality regression since r14-8346-ga98d5130a6dcff

2024-01-25 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113597

--- Comment #10 from Richard Biener  ---
Created attachment 57212
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57212=edit
patch for debugging

Btw, I've used the attached to investigate other issues with the change.  It
will show the outcome of base_alias_check and find_base_term in dumps.

One issue is that we're much more dependent on MEM_EXPRs being present.

Before figuring there wouldn't be much important regressions the idea was to
instead of doing find_base_term have a known base value recorded in the
MEM_ATTRs, and as the only important ones should be the special ones for
argument frame and stack-based represent that by an enum (rather than
the other possibility of using ADDRESS).  I'll also note that for spill
slots we get around to use spill_slot_decl and set_mem_attrs_for_spill.

I've not yet convinced myself that the other special bases we have really
form a completely separate memory class.  But if they do then accesses
should do something similar there (but mind scheduling of frame related
instructions ...).

Argument stack slots are one important class, set up by init_alias_analysis.
But those are also backed by regular decls at times (but not always)?

assign_stack_temp "allocated" memory is another class, we're reusing
slots during RTL expansion and they get (even if shared) a specific
alias set.  I don't think we ever release those temps and say re-use
the space for spilling so assigning a different decl to each slot
should eventually work.

[Bug rtl-optimization/113597] [14 Regression] aarch64: Significant code quality regression since r14-8346-ga98d5130a6dcff

2024-01-25 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113597

--- Comment #9 from Alex Coplan  ---
(In reply to Andrew Pinski from comment #8)
> (In reply to Alex Coplan from comment #7)
> > I expect the store pairs come from memcpy lowering/expansion in the aarch64
> > backend, that is the only way we get store pairs so early in the RTL
> > pipeline IIRC.
> 
> In this case, memset is more likely.

Right, yeah.  I was using "memcpy lowering" to refer to all the
mem{cpy,set,move} expansion we have in the backend.

> 
> Either:
> for (int i = 0; i < j; i++)
> m[i] = vdupq_n_f32(0.F);
> Or
> for (int i = 0; i < l; i++)
> n[i] = vdupq_n_f32(0.F);

[Bug rtl-optimization/113597] [14 Regression] aarch64: Significant code quality regression since r14-8346-ga98d5130a6dcff

2024-01-25 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113597

--- Comment #8 from Andrew Pinski  ---
(In reply to Alex Coplan from comment #7)
> I expect the store pairs come from memcpy lowering/expansion in the aarch64
> backend, that is the only way we get store pairs so early in the RTL
> pipeline IIRC.

In this case, memset is more likely.  

Either:
for (int i = 0; i < j; i++)
m[i] = vdupq_n_f32(0.F);
Or
for (int i = 0; i < l; i++)
n[i] = vdupq_n_f32(0.F);

[Bug rtl-optimization/113597] [14 Regression] aarch64: Significant code quality regression since r14-8346-ga98d5130a6dcff

2024-01-25 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113597

--- Comment #7 from Alex Coplan  ---
I expect the store pairs come from memcpy lowering/expansion in the aarch64
backend, that is the only way we get store pairs so early in the RTL pipeline
IIRC.

[Bug rtl-optimization/113597] [14 Regression] aarch64: Significant code quality regression since r14-8346-ga98d5130a6dcff

2024-01-25 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113597

--- Comment #6 from Alex Coplan  ---
Looking at the dump files, the first difference seems to be in 292r.dse1:

 8: NOTE_INSN_BASIC_BLOCK 2
 2: r116:SI=zero_extend(x0:HI)
   REG_DEAD x0:HI
@@ -178,7 +161,26 @@
 5: NOTE_INSN_FUNCTION_BEG
10: r119:DI=sfp:DI-0x200
12: r121:V16QI=const_vector
+   13: [r119:DI]=unspec[r121:V16QI,r121:V16QI] 38
+   14: [r119:DI+0x20]=unspec[r121:V16QI,r121:V16QI] 38
+   15: [r119:DI+0x40]=unspec[r121:V16QI,r121:V16QI] 38
+   16: [r119:DI+0x60]=unspec[r121:V16QI,r121:V16QI] 38
+   17: [r119:DI+0x80]=unspec[r121:V16QI,r121:V16QI] 38
+   18: [r119:DI+0xa0]=unspec[r121:V16QI,r121:V16QI] 38
+   19: [r119:DI+0xc0]=unspec[r121:V16QI,r121:V16QI] 38
+   20: [r119:DI+0xe0]=unspec[r121:V16QI,r121:V16QI] 38
+  REG_DEAD r119:DI
21: r122:DI=sfp:DI-0x100
+   24: [r122:DI]=unspec[r121:V16QI,r121:V16QI] 38
+   25: [r122:DI+0x20]=unspec[r121:V16QI,r121:V16QI] 38
+   26: [r122:DI+0x40]=unspec[r121:V16QI,r121:V16QI] 38
+   27: [r122:DI+0x60]=unspec[r121:V16QI,r121:V16QI] 38
+   28: [r122:DI+0x80]=unspec[r121:V16QI,r121:V16QI] 38
+   29: [r122:DI+0xa0]=unspec[r121:V16QI,r121:V16QI] 38
+   30: [r122:DI+0xc0]=unspec[r121:V16QI,r121:V16QI] 38
+   31: [r122:DI+0xe0]=unspec[r121:V16QI,r121:V16QI] 38
+  REG_DEAD r122:DI
+  REG_DEAD r121:V16QI
 6: r100:V4SF=const_vector
 7: r106:SI=0
32: cc:CC=cmp(r116:SI,0)
@@ -254,6 +256,7 @@
73: r100:V4SF={r147:V4SF*r147:V4SF+r115:V4SF}
   REG_DEAD r147:V4SF
   REG_DEAD r115:V4SF
+   74: [sfp:DI-0x200]=r100:V4SF
75: r148:SI=r106:SI+0x2
   REG_DEAD r106:SI
76: r106:SI=zero_extend(r148:SI#0)

(the unspec 38s are store pairs).

[Bug rtl-optimization/113597] [14 Regression] aarch64: Significant code quality regression since r14-8346-ga98d5130a6dcff

2024-01-25 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113597

--- Comment #5 from Andrew Pinski  ---
Note I think this testcase has been reduced too much, but maybe that can be
"fixed". The stores to the arguments go past the bounds.

[Bug rtl-optimization/113597] [14 Regression] aarch64: Significant code quality regression since r14-8346-ga98d5130a6dcff

2024-01-25 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113597

--- Comment #4 from Alex Coplan  ---
Created attachment 57211
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57211=edit
after.s

[Bug rtl-optimization/113597] [14 Regression] aarch64: Significant code quality regression since r14-8346-ga98d5130a6dcff

2024-01-25 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113597

--- Comment #3 from Alex Coplan  ---
Created attachment 57210
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=57210=edit
before.s

[Bug rtl-optimization/113597] [14 Regression] aarch64: Significant code quality regression since r14-8346-ga98d5130a6dcff

2024-01-25 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113597

--- Comment #2 from Alex Coplan  ---
(In reply to Richard Biener from comment #1)
> I will have a look - but can you explain for me what I see?  I suppose the
> testcase was reduced from something?

Yeah, the testcase is reduced.

> 
> Is the assembly diff complete?  That is, do we really have more fmla or
> are they just moved?

I think the diff is complete, I can upload the full before/after asm.

> 
> + stp q31, q31, [sp, 256] 
> 
> that's a store?  A paired store?  Aka, the sequence fills a stack(?)
> region with replications of q31?

That's right.

I'll try to take a look at the RTL dumps too to see if I can figure out
anything, too.

[Bug rtl-optimization/113597] [14 Regression] aarch64: Significant code quality regression since r14-8346-ga98d5130a6dcff

2024-01-25 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113597

Richard Biener  changed:

   What|Removed |Added

   Target Milestone|--- |14.0
 Ever confirmed|0   |1
   Last reconfirmed||2024-01-25
 Status|UNCONFIRMED |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |rguenth at gcc dot 
gnu.org

[Bug rtl-optimization/113597] [14 Regression] aarch64: Significant code quality regression since r14-8346-ga98d5130a6dcff

2024-01-25 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113597

--- Comment #1 from Richard Biener  ---
I will have a look - but can you explain for me what I see?  I suppose the
testcase was reduced from something?

Is the assembly diff complete?  That is, do we really have more fmla or
are they just moved?

+ stp   q31, q31, [sp, 256]

that's a store?  A paired store?  Aka, the sequence fills a stack(?)
region with replications of q31?