[Bug target/105733] riscv: Poor codegen for large stack frames
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105733 Vineet Gupta changed: What|Removed |Added Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #6 from Vineet Gupta --- Fixed with aforementioned commit for gcc-15.
[Bug target/105733] riscv: Poor codegen for large stack frames
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105733 --- Comment #5 from GCC Commits --- The master branch has been updated by Vineet Gupta : https://gcc.gnu.org/g:f9cfc192ed0127edb7e79818917dd2859fce4d44 commit r15-757-gf9cfc192ed0127edb7e79818917dd2859fce4d44 Author: Vineet Gupta Date: Mon May 13 11:46:03 2024 -0700 RISC-V: avoid LUI based const mat in prologue/epilogue expansion [PR/105733] If the constant used for stack offset can be expressed as sum of two S12 values, the constant need not be materialized (in a reg) and instead the two S12 bits can be added to instructions involved with frame pointer. This avoids burning a register and more importantly can often get down to be 2 insn vs. 3. The prev patches to generally avoid LUI based const materialization didn't fix this PR and need this directed fix in funcion prologue/epilogue expansion. This fix doesn't move the neddle for SPEC, at all, but it is still a win considering gcc generates one insn fewer than llvm for the test ;-) gcc-13.1 release | gcc 230823 | | |g6619b3d4c15c| This patch | clang/llvm - li t0,-4096 | lit0,-4096 | addi sp,sp,-2048 | addi sp,sp,-2048 addit0,t0,2016 | addi t0,t0,2032| add sp,sp,-16 | addi sp,sp,-32 li a4,4096 | add sp,sp,t0 | add a5,sp,a0| add a1,sp,16 add sp,sp,t0 | addi a5,sp,-2032 | sbzero,0(a5) | add a0,a0,a1 li a5,-4096 | add a0,a5,a0 | addi sp,sp,2032 | sb zero,0(a0) addia4,a4,-2032 | lit0, 4096 | addi sp,sp,32| addi sp,sp,2032 add a4,a4,a5 | sbzero,2032(a0) | ret | addi sp,sp,48 addia5,sp,16 | addi t0,t0,-2032 | | ret add a5,a4,a5 | add sp,sp,t0 | add a0,a5,a0 | ret | li t0,4096 | sd a5,8(sp) | sb zero,2032(a0)| addit0,t0,-2016 | add sp,sp,t0 | ret | gcc/ChangeLog: PR target/105733 * config/riscv/riscv.h: New macros for with aligned offsets. * config/riscv/riscv.cc (riscv_split_sum_of_two_s12): New function to split a sum of two s12 values into constituents. (riscv_expand_prologue): Handle offset being sum of two S12. (riscv_expand_epilogue): Ditto. * config/riscv/riscv-protos.h (riscv_split_sum_of_two_s12): New. gcc/testsuite/ChangeLog: * gcc.target/riscv/pr105733.c: New Test. * gcc.target/riscv/rvv/autovec/vls/spill-1.c: Adjust to not expect LUI 4096. * gcc.target/riscv/rvv/autovec/vls/spill-2.c: Ditto. * gcc.target/riscv/rvv/autovec/vls/spill-3.c: Ditto. * gcc.target/riscv/rvv/autovec/vls/spill-4.c: Ditto. * gcc.target/riscv/rvv/autovec/vls/spill-5.c: Ditto. * gcc.target/riscv/rvv/autovec/vls/spill-6.c: Ditto. * gcc.target/riscv/rvv/autovec/vls/spill-7.c: Ditto. Tested-by: Edwin Lu # pre-commit-CI #1568 Signed-off-by: Vineet Gupta
[Bug target/105733] riscv: Poor codegen for large stack frames
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105733 Vineet Gupta changed: What|Removed |Added CC||vineetg at gcc dot gnu.org --- Comment #4 from Vineet Gupta --- There has been good improvements in gcc codegen specially with commit below. commit 6619b3d4c15cd754798b1048c67f3806bbcc2e6d Author: Jivan Hakobyan Date: Wed Aug 23 14:10:30 2023 -0600 Improve quality of code from LRA register elimination This is primarily Jivan's work, I'm mostly responsible for the write-up and coordinating with Vlad on a few questions. On targets with limitations on immediates usable in arithmetic instructions, LRA's register elimination phase can construct fairly poor code. Tip W/o commit 6619b3d4c| With 6619b3d4c | foo: | foo: li t0,-4096 | li t0,-4096 addit0,t0,2032 | addit0,t0,2032 li a5,0 | li a4,0 | add sp,sp,t0 | add sp,sp,t0 add a4,a4,a5 | add a5,a4,sp | add a5,a5,a0 add a5,a5,a0 | li t0,4096 | li t0,4096 sb zero,0(a5) | sb zero,0(a5) addit0,t0,-2032 | addit0,t0,-2032 add sp,sp,t0 | add sp,sp,t0 jr ra | jr ra We still have the weird LUI 4096 based constant construction. I have a patch to avoid 4096 for certain ranges [-4096,-2049] or [2048,4094] (cribbed from llvm). e.g. 2064 = 2047 + 17 and we could potentially "spread" the 2 parts over 2 adds to SP, avoiding the LUI. However if a const costs more than 1 insn, gcc wants to force it in a register rather than split the add operation into 2 adds with the split constants. expand_binop expand_binop_directly avoid_expensive_constant /* X is to be used in mode MODE as operand OPN to BINOPTAB. If we're optimizing, and if the operand is a constant that costs more than 1 instruction, force the constant into a register and return that register. Return X otherwise. UNSIGNEDP says whether X is unsigned. */
[Bug target/105733] riscv: Poor codegen for large stack frames
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105733 Andrew Pinski changed: What|Removed |Added Severity|normal |enhancement
[Bug target/105733] riscv: Poor codegen for large stack frames
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105733 Andrew Pinski changed: What|Removed |Added CC||vineet.gupta at linux dot dev --- Comment #3 from Andrew Pinski --- *** Bug 106439 has been marked as a duplicate of this bug. ***
[Bug target/105733] riscv: Poor codegen for large stack frames
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105733 Jim Wilson changed: What|Removed |Added CC||wilson at gcc dot gnu.org --- Comment #2 from Jim Wilson --- This is a known problem that I have commented on before. For instance here https://github.com/riscv-collab/riscv-gcc/issues/193 Copying my reply to here since this is a better place for the bug report: This is a known problem. GCC first emits code using a frame pointer, and then tries to eliminate the frame pointer during register allocation. When the frame size is larger than the immediate field of an add, we need temporaries to compute frame offsets. Compiler optimizations like common subexpression elimination (cse) try to optimize the calculation of the temporaries. Then when we try to eliminate the frame pointer, we run into trouble because the cse opt changes interfere with the frame pointer elimination, and we end up with an ugly inefficient mess. This problem isn't RISC-V specific, but since RISC-V has 12-bit immediates and most everyone else has 16-bit immediates, we hit the problem sooner, and thus makes it much more visible for RISC-V. The example above is putting 12KB on the stack, which is larger than 12-bits of range (4KB), but well within 16-bits of range (64KB). I have a prototype of a patch to fix it by allowing >12-bit offsets for frame pointer references, and then fixing this when eliminating the frame pointer, but this can cause other problems, and needs more work to be usable. I have no idea when someone might try to finish the patch. End of inclusion from github. To improve my earlier answer, we have signed 12-bit immediates, so the trouble actually starts at 2048 bytes, and Jessica's example is larger than that. Reduce BUFSIZ to 2044 and you get a reasonable result. foo: li a5,4096 addia5,a5,-2048 addisp,sp,-2048 add a5,a5,a0 add a0,a5,sp li t0,4096 sb zero,-2048(a0) addit0,t0,-2048 add sp,sp,t0 jr ra There is still a bit of oddness here. We are adding 2048 to a0 and then using an address -2048(a0). I think that is more cse trouble. 2048 requires two instructions to load into a register which is likely confusing something somewhere.
[Bug target/105733] riscv: Poor codegen for large stack frames
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105733 Kito Cheng changed: What|Removed |Added CC||kito at gcc dot gnu.org Ever confirmed|0 |1 Status|UNCONFIRMED |NEW Last reconfirmed||2022-05-26 --- Comment #1 from Kito Cheng --- Confirmed, Thanks for reporting this!