[Bug target/106694] Redundant move instructions in ARM SVE intrinsics use cases
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106694 Bug 106694 depends on bug 99161, which changed state. Bug 99161 Summary: Suboptimal SVE code for ld4/st4 MLA code https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99161 What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED
[Bug target/106694] Redundant move instructions in ARM SVE intrinsics use cases
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106694 Richard Sandiford changed: What|Removed |Added Resolution|--- |FIXED Status|ASSIGNED|RESOLVED --- Comment #14 from Richard Sandiford --- Fix for this case. The patch only deals with cases that can be allocated without spilling, but Lehua has a more general fix that should go into GCC 15.
[Bug target/106694] Redundant move instructions in ARM SVE intrinsics use cases
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106694 --- Comment #13 from GCC Commits --- The trunk branch has been updated by Richard Sandiford : https://gcc.gnu.org/g:9f0f7d802482a8958d6cdc72f1fe0c8549db2182 commit r14-6290-g9f0f7d802482a8958d6cdc72f1fe0c8549db2182 Author: Richard Sandiford Date: Thu Dec 7 19:41:19 2023 + aarch64: Add an early RA for strided registers This pass adds a simple register allocator for FP & SIMD registers. Its main purpose is to make use of SME2's strided LD1, ST1 and LUTI2/4 instructions, which require a very specific grouping structure, and so would be difficult to exploit with general allocation. The allocator is very simple. It gives up on anything that would require spilling, or that it might not handle well for other reasons. The allocator needs to track liveness at the level of individual FPRs. Doing that fixes a lot of the PRs relating to redundant moves caused by structure loads and stores. That particular problem is going to be fixed more generally for GCC 15 by Lehua's RA patches. However, the early-RA pass runs before scheduling, so it has a chance to bag a spill-free allocation of vector code before the scheduler moves things around. It could therefore still be useful for non-SME code (e.g. for hand-scheduled ACLE code) even after Lehua's patches are in. The pass is controlled by a tristate switch: - -mearly-ra=all: run on all functions - -mearly-ra=strided: run on functions that have access to strided registers - -mearly-ra=none: don't run on any function The patch makes -mearly-ra=all the default at -O2 and above for now. We can revisit this for GCC 15 once Lehua's patches are in; -mearly-ra=strided might then be more appropriate. As said previously, the pass is very naive. There's much more that we could do, such as handling invariants better. The main focus is on not committing to a bad allocation, rather than on handling as much as possible. gcc/ PR rtl-optimization/106694 PR rtl-optimization/109078 PR rtl-optimization/109391 * config.gcc: Add aarch64-early-ra.o for AArch64 targets. * config/aarch64/t-aarch64 (aarch64-early-ra.o): New rule. * config/aarch64/aarch64-opts.h (aarch64_early_ra_scope): New enum. * config/aarch64/aarch64.opt (mearly_ra): New option. * doc/invoke.texi: Document it. * common/config/aarch64/aarch64-common.cc (aarch_option_optimization_table): Use -mearly-ra=strided by default for -O2 and above. * config/aarch64/aarch64-passes.def (pass_aarch64_early_ra): New pass. * config/aarch64/aarch64-protos.h (aarch64_strided_registers_p) (make_pass_aarch64_early_ra): Declare. * config/aarch64/aarch64-sme.md (@aarch64_sme_lut): Add a stride_type attribute. (@aarch64_sme_lut_strided2): New pattern. (@aarch64_sme_lut_strided4): Likewise. * config/aarch64/aarch64-sve-builtins-base.cc (svld1_impl::expand) (svldnt1_impl::expand, svst1_impl::expand, svstn1_impl::expand): Handle new way of defining multi-register loads and stores. * config/aarch64/aarch64-sve.md (@aarch64_ld1) (@aarch64_ldnt1, @aarch64_st1) (@aarch64_stnt1): Delete. * config/aarch64/aarch64-sve2.md (@aarch64_) (@aarch64__strided2): New patterns. (@aarch64__strided4): Likewise. (@aarch64_): Likewise. (@aarch64__strided2): Likewise. (@aarch64__strided4): Likewise. * config/aarch64/aarch64.cc (aarch64_strided_registers_p): New function. * config/aarch64/aarch64.md (UNSPEC_LD1_SVE_COUNT): Delete. (UNSPEC_ST1_SVE_COUNT, UNSPEC_LDNT1_SVE_COUNT): Likewise. (UNSPEC_STNT1_SVE_COUNT): Likewise. (stride_type): New attribute. * config/aarch64/constraints.md (Uwd, Uwt): New constraints. * config/aarch64/iterators.md (UNSPEC_LD1_COUNT, UNSPEC_LDNT1_COUNT) (UNSPEC_ST1_COUNT, UNSPEC_STNT1_COUNT): New unspecs. (optab): Handle them. (LD1_COUNT, ST1_COUNT): New iterators. * config/aarch64/aarch64-early-ra.cc: New file. gcc/testsuite/ PR rtl-optimization/106694 PR rtl-optimization/109078 PR rtl-optimization/109391 * gcc.target/aarch64/ldp_stp_16.c (cons4_4_float): Tighten expected output test. * gcc.target/aarch64/sve/shift_1.c: Allow reversed shifts for .s as well as .d. * gcc.target/aarch64/sme/strided_1.c: New test. * gcc.target/aarch64/pr109078.c: Likewise. * gcc.target/aarch64/pr109391.c: Likewise. * gcc.target/aarch64/sve/pr106694.c:
[Bug target/106694] Redundant move instructions in ARM SVE intrinsics use cases
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106694 --- Comment #12 from JuzheZhong --- (In reply to Richard Sandiford from comment #10) > Some of the SME changes I'm working on fix this, but I'm not sure how widely > we'll be able to use them on non-SME code. Assigning myself just in case. Hi, Richard. My colleague Lehua has sent patches for general subreg liveness tracking. We are sure it can fixed all subreg issue of RVV and ARM SVE. Not sure SME codes. We weren't able to test it. This is general optimization. Hope we can be possible to make it landed.
[Bug target/106694] Redundant move instructions in ARM SVE intrinsics use cases
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106694 --- Comment #11 from JuzheZhong --- (In reply to Richard Sandiford from comment #10) > Some of the SME changes I'm working on fix this, but I'm not sure how widely > we'll be able to use them on non-SME code. Assigning myself just in case. Hi, Richard. We are going to fix subreg issue by subreg liveness tracking on IRA/LRA. Hopefully today.
[Bug target/106694] Redundant move instructions in ARM SVE intrinsics use cases
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106694 Richard Sandiford changed: What|Removed |Added Assignee|unassigned at gcc dot gnu.org |rsandifo at gcc dot gnu.org Status|NEW |ASSIGNED CC||rsandifo at gcc dot gnu.org --- Comment #10 from Richard Sandiford --- Some of the SME changes I'm working on fix this, but I'm not sure how widely we'll be able to use them on non-SME code. Assigning myself just in case.
[Bug target/106694] Redundant move instructions in ARM SVE intrinsics use cases
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106694 Andrew Pinski changed: What|Removed |Added See Also||https://gcc.gnu.org/bugzill ||a/show_bug.cgi?id=89967 Ever confirmed|0 |1 Status|UNCONFIRMED |NEW Last reconfirmed||2022-10-27 --- Comment #9 from Andrew Pinski --- Oh yes PR 89967.
[Bug target/106694] Redundant move instructions in ARM SVE intrinsics use cases
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106694 --- Comment #8 from Andrew Pinski --- I suspect you could get a similar testcase with ARM neon too.
[Bug target/106694] Redundant move instructions in ARM SVE intrinsics use cases
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106694 --- Comment #7 from Andrew Pinski --- *** Bug 107445 has been marked as a duplicate of this bug. ***
[Bug target/106694] Redundant move instructions in ARM SVE intrinsics use cases
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106694 --- Comment #6 from JuzheZhong --- (In reply to Andrew Pinski from comment #5) > (In reply to JuzheZhong from comment #4) > > No. I am not saying the issue of "movprfx". I am saying the issue of the > > redundant "mov" instructions.: > > mov z5.d, z24.d > > mov z4.d, z25.d > > mov z3.d, z26.d > > mov z2.d, z27.d > > > > > > This is the issue that "subreg" didn't propagate across the basic block. > > Oh ld4 issue. I thought there was another bug filed against that. The > problem is even without SVE too IIR. Recently, I found LLVM optimize this kind of issues. This "subreg" issues are handled by register coalescing. I wonder if there is someone can implement this. Besides, I am working on pushing codes of RVV support to RISC-V support. I have done register coalescing for RVV and open source them in RISC-V foundation repo. I don't whether my register coalescing is appropriate to push upstream: https://github.com/riscv-collab/riscv-gcc/blob/riscv-gcc-rvv-next/gcc/ira-coalesce.cc If this is not appropriate to the global codes. I will try to make it inside RISC-V port withou changing ira.cc.
[Bug target/106694] Redundant move instructions in ARM SVE intrinsics use cases
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106694 --- Comment #5 from Andrew Pinski --- (In reply to JuzheZhong from comment #4) > No. I am not saying the issue of "movprfx". I am saying the issue of the > redundant "mov" instructions.: > mov z5.d, z24.d > mov z4.d, z25.d > mov z3.d, z26.d > mov z2.d, z27.d > > > This is the issue that "subreg" didn't propagate across the basic block. Oh ld4 issue. I thought there was another bug filed against that. The problem is even without SVE too IIRC.
[Bug target/106694] Redundant move instructions in ARM SVE intrinsics use cases
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106694 --- Comment #4 from JuzheZhong --- (In reply to Andrew Pinski from comment #1) > This is backend issue: > //(insn 27 31 28 (set (reg/v:VNx2DI 37 v5 [orig:98 v0 ] [98]) > //(unspec:VNx2DI [ > //(reg:VNx2BI 68 p0 [orig:105 pg ] [105]) > //(plus:VNx2DI (mult:VNx2DI (reg/v:VNx2DI 37 v5 [orig:98 v0 > ] [98]) > //(reg/v:VNx2DI 33 v1 [orig:96 v18 ] [96])) > //(reg/v:VNx2DI 32 v0 [orig:97 v19 ] [97])) > //(const_vector:VNx2DI repeat [ > //(const_int 0 [0]) > //]) > //] UNSPEC_SEL)) "/app/example.c":15:14 7415 > {*cond_fmavnx2di_any} > // (nil)) > movprfx z5.d, p0/z, z5.d // 27 [c=4 l=8] > *cond_fmavnx2di_any/2 > mad z5.d, p0/m, z1.d, z0.d No. I am not saying the issue of "movprfx". I am saying the issue of the redundant "mov" instructions.: mov z5.d, z24.d mov z4.d, z25.d mov z3.d, z26.d mov z2.d, z27.d This is the issue that "subreg" didn't propagate across the basic block.
[Bug target/106694] Redundant move instructions in ARM SVE intrinsics use cases
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106694 Andrew Pinski changed: What|Removed |Added Depends on||99161 --- Comment #3 from Andrew Pinski --- And the same issue as PR 99161. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99161 [Bug 99161] Suboptimal SVE code for ld4/st4 MLA code
[Bug target/106694] Redundant move instructions in ARM SVE intrinsics use cases
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106694 Andrew Pinski changed: What|Removed |Added Depends on||106146 --- Comment #2 from Andrew Pinski --- Most likely the same issue as PR 106146. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106146 [Bug 106146] a redundant movprfx insn compare to llvm
[Bug target/106694] Redundant move instructions in ARM SVE intrinsics use cases
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106694 --- Comment #1 from Andrew Pinski --- This is backend issue: //(insn 27 31 28 (set (reg/v:VNx2DI 37 v5 [orig:98 v0 ] [98]) //(unspec:VNx2DI [ //(reg:VNx2BI 68 p0 [orig:105 pg ] [105]) //(plus:VNx2DI (mult:VNx2DI (reg/v:VNx2DI 37 v5 [orig:98 v0 ] [98]) //(reg/v:VNx2DI 33 v1 [orig:96 v18 ] [96])) //(reg/v:VNx2DI 32 v0 [orig:97 v19 ] [97])) //(const_vector:VNx2DI repeat [ //(const_int 0 [0]) //]) //] UNSPEC_SEL)) "/app/example.c":15:14 7415 {*cond_fmavnx2di_any} // (nil)) movprfx z5.d, p0/z, z5.d // 27 [c=4 l=8] *cond_fmavnx2di_any/2 mad z5.d, p0/m, z1.d, z0.d
[Bug target/106694] Redundant move instructions in ARM SVE intrinsics use cases
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106694 Andrew Pinski changed: What|Removed |Added Keywords||ra Severity|normal |enhancement