[Bug target/113618] [14 Regression] AArch64: memmove idiom regression

2024-03-13 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113618

Wilco  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|NEW |RESOLVED

--- Comment #6 from Wilco  ---
Fixed.

[Bug target/113618] [14 Regression] AArch64: memmove idiom regression

2024-03-07 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113618

--- Comment #5 from GCC Commits  ---
The master branch has been updated by Wilco Dijkstra :

https://gcc.gnu.org/g:19b23bf3c32df3cbb96b3d898a1d7142f7bea4a0

commit r14-9373-g19b23bf3c32df3cbb96b3d898a1d7142f7bea4a0
Author: Wilco Dijkstra 
Date:   Wed Feb 21 23:33:58 2024 +

AArch64: memcpy/memset expansions should not emit LDP/STP [PR113618]

The new RTL introduced for LDP/STP results in regressions due to use of
UNSPEC.
Given the new LDP fusion pass is good at finding LDP opportunities, change
the
memcpy, memmove and memset expansions to emit single vector loads/stores.
This fixes the regression and enables more RTL optimization on the standard
memory accesses.  Handling of unaligned tail of memcpy/memmove is improved
with -mgeneral-regs-only.  SPEC2017 performance improves slightly. 
Codesize
is a bit worse due to missed LDP opportunities as discussed in the PR.

gcc/ChangeLog:
PR target/113618
* config/aarch64/aarch64.cc (aarch64_copy_one_block): Remove.
(aarch64_expand_cpymem): Emit single load/store only.
(aarch64_set_one_block): Emit single stores only.

gcc/testsuite/ChangeLog:
PR target/113618
* gcc.target/aarch64/pr113618.c: New test.

[Bug target/113618] [14 Regression] AArch64: memmove idiom regression

2024-03-07 Thread law at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113618

Jeffrey A. Law  changed:

   What|Removed |Added

 CC||law at gcc dot gnu.org
   Priority|P3  |P2

[Bug target/113618] [14 Regression] AArch64: memmove idiom regression

2024-01-31 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113618

Wilco  changed:

   What|Removed |Added

   Assignee|unassigned at gcc dot gnu.org  |wilco at gcc dot gnu.org

--- Comment #4 from Wilco  ---
(In reply to Alex Coplan from comment #1)
> Confirmed.
> 
> (In reply to Wilco from comment #0)
> > A possible fix would be to avoid emitting LDP/STP in memcpy/memmove/memset
> > expansions.
> 
> Yeah, so I had posted
> https://gcc.gnu.org/pipermail/gcc-patches/2023-November/636855.html for that
> but held off from committing it at the time as IMO there wasn't enough
> evidence to show that this helps in general (and the pass could in theory
> miss opportunities which would lead to regressions). 
> 
> But perhaps this is a good argument for going ahead with that change (of
> course it will need rebasing).

Yes I have a patch based on current trunk + my outstanding memset cleanup
patch. It's slightly faster but causes a small codesize regression. This
appears mostly due to GCC being overly aggressive in changing loads/stores with
a zero offset into indexing, a non-zero offset or a lo_sym. This not only
blocks LDP opportunities but also increases register pressure and spilling.

[Bug target/113618] [14 Regression] AArch64: memmove idiom regression

2024-01-29 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113618

--- Comment #3 from Wilco  ---
(In reply to Richard Biener from comment #2)
> It might be good to recognize this pattern in strlenopt or a related pass.
> 
> A purely local transform would turn it into
> 
> memcpy (temp, a, 64);
> memmove (b, a, 64);
> 
> relying on DSE to eliminate the copy to temp if possible.  Not sure if
> that possibly would be a bad transform if copying to temp is required.

This would only be beneficial if you know memmove is inlined if memcpy is - on
almost all targets memmove becomes a library call, so the transformation would
be worse if memcpy can be inlined.

> stp q30, q31, [sp]
> ldp q30, q31, [sp]
> 
> why is CSE not able to catch this?

The new RTL now has UNSPECs in them, so CSE doesn't know it is a plain
load/store:

STP: 

(insn 12 11 13 2 (set (mem/c:V2x16QI (reg:DI 102) [0 +0 S32 A128])
(unspec:V2x16QI [
(reg:V4SI 104)
(reg:V4SI 105)
] UNSPEC_STP)) "/app/example.c":5:5 -1
 (nil))

LDP:

(insn 16 15 17 2 (parallel [
(set (reg:V4SI 108)
(unspec:V4SI [
(mem/c:V2x16QI (reg:DI 107) [0 +0 S32 A128])
] UNSPEC_LDP_FST))
(set (reg:V4SI 109)
(unspec:V4SI [
(mem/c:V2x16QI (reg:DI 107) [0 +0 S32 A128])
] UNSPEC_LDP_SND))
]) "/app/example.c":6:5 -1
 (nil))

[Bug target/113618] [14 Regression] AArch64: memmove idiom regression

2024-01-28 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113618

--- Comment #2 from Richard Biener  ---
It might be good to recognize this pattern in strlenopt or a related pass.

A purely local transform would turn it into

memcpy (temp, a, 64);
memmove (b, a, 64);

relying on DSE to eliminate the copy to temp if possible.  Not sure if
that possibly would be a bad transform if copying to temp is required.

stp q30, q31, [sp]
ldp q30, q31, [sp]

why is CSE not able to catch this?

[Bug target/113618] [14 Regression] AArch64: memmove idiom regression

2024-01-26 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113618

Andrew Pinski  changed:

   What|Removed |Added

   Keywords||missed-optimization
 CC||pinskia at gcc dot gnu.org
   Target Milestone|--- |14.0

[Bug target/113618] [14 Regression] AArch64: memmove idiom regression

2024-01-26 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113618

Alex Coplan  changed:

   What|Removed |Added

   Last reconfirmed||2024-01-26
 Ever confirmed|0   |1
 Status|UNCONFIRMED |NEW
 CC||acoplan at gcc dot gnu.org

--- Comment #1 from Alex Coplan  ---
Confirmed.

(In reply to Wilco from comment #0)
> A possible fix would be to avoid emitting LDP/STP in memcpy/memmove/memset
> expansions.

Yeah, so I had posted
https://gcc.gnu.org/pipermail/gcc-patches/2023-November/636855.html for that
but held off from committing it at the time as IMO there wasn't enough evidence
to show that this helps in general (and the pass could in theory miss
opportunities which would lead to regressions). 

But perhaps this is a good argument for going ahead with that change (of course
it will need rebasing).