[Bug target/113618] [14 Regression] AArch64: memmove idiom regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113618 Wilco changed: What|Removed |Added Resolution|--- |FIXED Status|NEW |RESOLVED --- Comment #6 from Wilco --- Fixed.
[Bug target/113618] [14 Regression] AArch64: memmove idiom regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113618 --- Comment #5 from GCC Commits --- The master branch has been updated by Wilco Dijkstra : https://gcc.gnu.org/g:19b23bf3c32df3cbb96b3d898a1d7142f7bea4a0 commit r14-9373-g19b23bf3c32df3cbb96b3d898a1d7142f7bea4a0 Author: Wilco Dijkstra Date: Wed Feb 21 23:33:58 2024 + AArch64: memcpy/memset expansions should not emit LDP/STP [PR113618] The new RTL introduced for LDP/STP results in regressions due to use of UNSPEC. Given the new LDP fusion pass is good at finding LDP opportunities, change the memcpy, memmove and memset expansions to emit single vector loads/stores. This fixes the regression and enables more RTL optimization on the standard memory accesses. Handling of unaligned tail of memcpy/memmove is improved with -mgeneral-regs-only. SPEC2017 performance improves slightly. Codesize is a bit worse due to missed LDP opportunities as discussed in the PR. gcc/ChangeLog: PR target/113618 * config/aarch64/aarch64.cc (aarch64_copy_one_block): Remove. (aarch64_expand_cpymem): Emit single load/store only. (aarch64_set_one_block): Emit single stores only. gcc/testsuite/ChangeLog: PR target/113618 * gcc.target/aarch64/pr113618.c: New test.
[Bug target/113618] [14 Regression] AArch64: memmove idiom regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113618 Jeffrey A. Law changed: What|Removed |Added CC||law at gcc dot gnu.org Priority|P3 |P2
[Bug target/113618] [14 Regression] AArch64: memmove idiom regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113618 Wilco changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |wilco at gcc dot gnu.org --- Comment #4 from Wilco --- (In reply to Alex Coplan from comment #1) > Confirmed. > > (In reply to Wilco from comment #0) > > A possible fix would be to avoid emitting LDP/STP in memcpy/memmove/memset > > expansions. > > Yeah, so I had posted > https://gcc.gnu.org/pipermail/gcc-patches/2023-November/636855.html for that > but held off from committing it at the time as IMO there wasn't enough > evidence to show that this helps in general (and the pass could in theory > miss opportunities which would lead to regressions). > > But perhaps this is a good argument for going ahead with that change (of > course it will need rebasing). Yes I have a patch based on current trunk + my outstanding memset cleanup patch. It's slightly faster but causes a small codesize regression. This appears mostly due to GCC being overly aggressive in changing loads/stores with a zero offset into indexing, a non-zero offset or a lo_sym. This not only blocks LDP opportunities but also increases register pressure and spilling.
[Bug target/113618] [14 Regression] AArch64: memmove idiom regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113618 --- Comment #3 from Wilco --- (In reply to Richard Biener from comment #2) > It might be good to recognize this pattern in strlenopt or a related pass. > > A purely local transform would turn it into > > memcpy (temp, a, 64); > memmove (b, a, 64); > > relying on DSE to eliminate the copy to temp if possible. Not sure if > that possibly would be a bad transform if copying to temp is required. This would only be beneficial if you know memmove is inlined if memcpy is - on almost all targets memmove becomes a library call, so the transformation would be worse if memcpy can be inlined. > stp q30, q31, [sp] > ldp q30, q31, [sp] > > why is CSE not able to catch this? The new RTL now has UNSPECs in them, so CSE doesn't know it is a plain load/store: STP: (insn 12 11 13 2 (set (mem/c:V2x16QI (reg:DI 102) [0 +0 S32 A128]) (unspec:V2x16QI [ (reg:V4SI 104) (reg:V4SI 105) ] UNSPEC_STP)) "/app/example.c":5:5 -1 (nil)) LDP: (insn 16 15 17 2 (parallel [ (set (reg:V4SI 108) (unspec:V4SI [ (mem/c:V2x16QI (reg:DI 107) [0 +0 S32 A128]) ] UNSPEC_LDP_FST)) (set (reg:V4SI 109) (unspec:V4SI [ (mem/c:V2x16QI (reg:DI 107) [0 +0 S32 A128]) ] UNSPEC_LDP_SND)) ]) "/app/example.c":6:5 -1 (nil))
[Bug target/113618] [14 Regression] AArch64: memmove idiom regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113618 --- Comment #2 from Richard Biener --- It might be good to recognize this pattern in strlenopt or a related pass. A purely local transform would turn it into memcpy (temp, a, 64); memmove (b, a, 64); relying on DSE to eliminate the copy to temp if possible. Not sure if that possibly would be a bad transform if copying to temp is required. stp q30, q31, [sp] ldp q30, q31, [sp] why is CSE not able to catch this?
[Bug target/113618] [14 Regression] AArch64: memmove idiom regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113618 Andrew Pinski changed: What|Removed |Added Keywords||missed-optimization CC||pinskia at gcc dot gnu.org Target Milestone|--- |14.0
[Bug target/113618] [14 Regression] AArch64: memmove idiom regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113618 Alex Coplan changed: What|Removed |Added Last reconfirmed||2024-01-26 Ever confirmed|0 |1 Status|UNCONFIRMED |NEW CC||acoplan at gcc dot gnu.org --- Comment #1 from Alex Coplan --- Confirmed. (In reply to Wilco from comment #0) > A possible fix would be to avoid emitting LDP/STP in memcpy/memmove/memset > expansions. Yeah, so I had posted https://gcc.gnu.org/pipermail/gcc-patches/2023-November/636855.html for that but held off from committing it at the time as IMO there wasn't enough evidence to show that this helps in general (and the pass could in theory miss opportunities which would lead to regressions). But perhaps this is a good argument for going ahead with that change (of course it will need rebasing).