[Bug rtl-optimization/79149] bad optimization on MIPS and ARM leading to excessive stack usage in some cases
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79149 Richard Biener changed: What|Removed |Added Status|NEW |ASSIGNED
[Bug rtl-optimization/79149] bad optimization on MIPS and ARM leading to excessive stack usage in some cases
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79149 Ramana Radhakrishnan changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2017-06-16 CC||ramana at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #15 from Ramana Radhakrishnan --- Well given all the comments, confirmed then ... :)
[Bug rtl-optimization/79149] bad optimization on MIPS and ARM leading to excessive stack usage in some cases
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79149 --- Comment #14 from wilco at gcc dot gnu.org --- (In reply to Arnd Bergmann from comment #13) > (In reply to wilco from comment #12) > > Does wp512 use 64-bit types? If so, this is likely PR77308. > > Yes, as seen in the attachment it uses lots of 64-bit operations. However, > it sounds like PR77308 is ARM specific, but I see the same behavior > on most other architectures, including 64-bit ones. Quoting the > kernel patch I linked to, with stack frame sizes for the function > depending on architecture, optimization flags and compiler version > (only 4.9 and 7.0 here, there is little difference anyway) The 64-bit expansion issues in PR77308 are ARM specific indeed, but it also shows scheduling causes unnecessary high register pressure and extra spilling. Your results indicate that -fsched-pressure should really be the default. And given that it still results in more spilling compared to not scheduling, it probably needs to be made less aggressive or compute pressure more accurately.
[Bug rtl-optimization/79149] bad optimization on MIPS and ARM leading to excessive stack usage in some cases
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79149 --- Comment #13 from Arnd Bergmann --- (In reply to wilco from comment #12) > Does wp512 use 64-bit types? If so, this is likely PR77308. Yes, as seen in the attachment it uses lots of 64-bit operations. However, it sounds like PR77308 is ARM specific, but I see the same behavior on most other architectures, including 64-bit ones. Quoting the kernel patch I linked to, with stack frame sizes for the function depending on architecture, optimization flags and compiler version (only 4.9 and 7.0 here, there is little difference anyway) default: -O2 press: -O2 -fsched-pressure nopress: -O2 -fschedule-insns -fno-sched-pressure nosched: -O2 -no-schedule-insns (disables sched-pressure) default press nopress nosched alpha-linux-gcc-4.9.3 1136848 1136176 am33_2.0-linux-gcc-4.9.32100207621002104 arm-linux-gnueabi-gcc-4.9.3 848 848 1048352 cris-linux-gcc-4.9.3272 272 272 272 frv-linux-gcc-4.9.3 112810001128280 hppa64-linux-gcc-4.9.3 1128336 1128184 hppa-linux-gcc-4.9.3644 308 644 276 i386-linux-gcc-4.9.3352 352 352 352 m32r-linux-gcc-4.9.3720 656 720 268 microblaze-linux-gcc-4.9.3 1108604 1108256 mips64-linux-gcc-4.9.3 1328592 1328208 mips-linux-gcc-4.9.31096624 1096240 powerpc64-linux-gcc-4.9.3 1088432 1088160 powerpc-linux-gcc-4.9.3 1080584 1080224 s390-linux-gcc-4.9.3456 456 624 360 sh3-linux-gcc-4.9.3 292 292 292 292 sparc64-linux-gcc-4.9.3 992 240 992 208 sparc-linux-gcc-4.9.3 680 592 680 312 x86_64-linux-gcc-4.9.3 224 240 272 224 xtensa-linux-gcc-4.9.3 1152704 1152304 aarch64-linux-gcc-7.0.0 224 224 1104208 arm-linux-gnueabi-gcc-7.0.1 824 824 1048352 mips-linux-gcc-7.0.01120648 1120272 mips64-linux-gcc-7.0.0 1072608 1072224 x86_64-linux-gcc-7.0.1 240 240 304 240
[Bug rtl-optimization/79149] bad optimization on MIPS and ARM leading to excessive stack usage in some cases
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79149 wilco at gcc dot gnu.org changed: What|Removed |Added CC||wilco at gcc dot gnu.org --- Comment #12 from wilco at gcc dot gnu.org --- Does wp512 use 64-bit types? If so, this is likely PR77308.
[Bug rtl-optimization/79149] bad optimization on MIPS and ARM leading to excessive stack usage in some cases
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79149 --- Comment #11 from Arnd Bergmann --- I've submitted a workaround for the kernel now, addressing the stack usage warning on MIPS, as well as performance on ARM and others: https://patchwork.kernel.org/patch/9555183/ The patch has two different workarounds, as I found that adding -Wno-schedule-insns gives us the best results on the whirlpool512 code for both stack size and performance by a wide margin, while -fsched-pressure is better on stack size for "serpent" across architectures and compiler versions However, it is interesting to notice that arm-linux-gnueabi-gcc-7 produces worse results with the serpent source code in terms of stack size with the default "-fsched-pressure" ("press") than older versions, and worse than -fno-schedule-insns (nosched): default press nopress nosched arm-linux-gnueabi-gcc-4.4.7 592 440 arm-linux-gnueabi-gcc-4.5.4 776 448 776 544 arm-linux-gnueabi-gcc-4.6.4 776 448 776 544 arm-linux-gnueabi-gcc-4.7.4 768 448 768 544 arm-linux-gnueabi-gcc-4.8.5 488 488 776 544 arm-linux-gnueabi-gcc-4.9.3 552 552 776 536 arm-linux-gnueabi-gcc-5.3.1 552 552 776 536 arm-linux-gnueabi-gcc-6.1.1 560 560 776 536 arm-linux-gnueabi-gcc-7.0.1 616 616 808 536 If we want to continue investigating this, I can try to construct a standalone test case for performance testing on 'serpent' as well.
[Bug rtl-optimization/79149] bad optimization on MIPS and ARM leading to excessive stack usage in some cases
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79149 --- Comment #10 from Arnd Bergmann --- (In reply to Arnd Bergmann from comment #9) > "-fsched-pressure" on mips64 helps a lot > ... > On arm and aarch64, "-fsched-pressure" has no effect I realized later that on arm and aarch64, -fsched-pressure is enabled by default. Disabling it on these two makes it as bad as mips64, which has it disabled by default.
[Bug rtl-optimization/79149] bad optimization on MIPS and ARM leading to excessive stack usage in some cases
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79149 --- Comment #9 from Arnd Bergmann --- The warning seems to reliably disappear with -fno-schedule-insns, on every combination I've tried it produces better (smaller stack and faster code) or identical results to -fno-sched-critical-path-heuristic -fno-sched-dep-count-heuristic for the test case, but the margins vary a lot depending on gcc version and architecture. I tried various gcc versions on ARM, as well as gcc-4.9.3 across many architectures. "-fsched-pressure" on mips64 helps a lot (factor 2 in both frame size and performance) but is still worse than "-fno-sched-critical-path-heuristic -fno-sched-dep-count-heuristic" or "-fno-schedule-insns", which give factor 2.5 to 3.5 in performance and reduce the stack size to what it should be (220 to 272 bytes). I tried gcc-4.9 and gcc-7.0 here, which show the same behavior, though "gcc-4.9 -fno-schedule-insns" is faster by a good margin (factor 1.1 to 1.2) compared to the second fastest ("gcc-7 -fno-schedule-insns"). On arm and aarch64, "-fsched-pressure" has no effect on this test case that I can see (have not compared the object files, but frame size and performance are unchanged). "-fno-schedule-insns" is noticeably better, with frame size of 350 bytes on ARM compared to the default 824, and performance better by factor 1.6 compared to the default -O2. I also looked at the frame sizes on my older arm compilers and saw the same on all 8 versions I have (4.5 through 7.0).
[Bug rtl-optimization/79149] bad optimization on MIPS and ARM leading to excessive stack usage in some cases
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79149 --- Comment #8 from Andrew Pinski --- The other thing to do is try with -fsched-pressure . PowerPC turns on -fsched-pressure by default (see PR 11488).
[Bug rtl-optimization/79149] bad optimization on MIPS and ARM leading to excessive stack usage in some cases
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79149 Andrew Pinski changed: What|Removed |Added Depends on||11488 --- Comment #7 from Andrew Pinski --- See also PR11488. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=11488 [Bug 11488] Pre-regalloc scheduling severely worsens performance
[Bug rtl-optimization/79149] bad optimization on MIPS and ARM leading to excessive stack usage in some cases
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79149 --- Comment #6 from Maxim Kuvyrkov --- Without looking at the code (it's 11pm) my guess is that 1st scheduling pass is misbehaving in some way, most likely it is doing a lot of interblock moves. One of the big differences between x86 and ARM/MIPS scheduling is that x86 disables interblock scheduling. Does -fno-schedule-insns fix the warnings on ARM/MIPS?
[Bug rtl-optimization/79149] bad optimization on MIPS and ARM leading to excessive stack usage in some cases
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79149 --- Comment #5 from Arnd Bergmann --- -fno-schedule-insns is comparable in stack frame size to "-fno-sched-critical-path-heuristic -fno-sched-dep-count-heuristic" on all architectures (give or take a few bytes), but actually produces much better code. In my simulated mips64 run, I see these numbers: -O2: 49.0Mbit/s -O2 -fno-sched-critical-path-heuristic -fno-sched-dep-count-heuristic: 109.7 Mbit/s -O2 -fno-schedule-insns: 179.2 Mbit/s The trend is the same on arm an aarch64 for emulated runs, and I confirmed earlier that the results on real hardware are comparable to what we get in qemu.
[Bug rtl-optimization/79149] bad optimization on MIPS and ARM leading to excessive stack usage in some cases
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79149 --- Comment #4 from Andrew Pinski --- Can you try -fno-sched-insns?
[Bug rtl-optimization/79149] bad optimization on MIPS and ARM leading to excessive stack usage in some cases
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79149 --- Comment #3 from Andrew Pinski --- There is another bug referring to the pre-register allocation schedule messing up. X86 does not have this scheduler turNed on by default.
[Bug rtl-optimization/79149] bad optimization on MIPS and ARM leading to excessive stack usage in some cases
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79149 Arnd Bergmann changed: What|Removed |Added Attachment #40546|0 |1 is obsolete|| --- Comment #2 from Arnd Bergmann --- Created attachment 40554 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40554&action=edit wp512 reference source code, standalone version After checking a bit more, I found that the reference source code implementation does behave exactly like the in-kernel version after all, and I was able to do some performance timing (using qemu-user) on it as well. Building Whirlpool.c using "mips64el-linux-gnuabi64-gcc-5 -O2 -Wframe-larger-than=100 Whirlpool.c -o Whirlpool-mips-smallstack -fno-sched-critical-path-heuristic -fno-sched-dep-count-heuristic" in this case uses 256 bytes of stack in the processBuffer and run for 87 seconds doing 1000 iterations in qemu, while the version without "fno-sched-critical-path-heuristic -fno-sched-dep-count-heuristic" takes 230 seconds and needs 1520 bytes of stack. The extra time is apparently spent spilling registers to the stack. The same test with arm32 shows a less significant version of the same behavior, with the stack shrinking from 832 to 352 bytes, and the time improving from 301 seconds to 217 seconds. Obviously it would be helpful to do the same tests on actual hardware, as benchmarking in an emulated machine can be very misleading.
[Bug rtl-optimization/79149] bad optimization on MIPS and ARM leading to excessive stack usage in some cases
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79149 --- Comment #1 from Arnd Bergmann --- Additional information: I see the same behavior to a varying degree on most other architectures (but notably not x86) using the preprocessed source from the MIPS kernel configuration, these are always one run with -fno-sched-critical-path-heuristic -fno-sched-dep-count-heuristic and one run without: === /home/arnd/cross-gcc/bin/aarch64-linux-gcc-5.2.1 === ../../crypto/wp512.c:987:1: warning: the frame size of 224 bytes is larger than 100 bytes [-Wframe-larger-than=] ../../crypto/wp512.c:987:1: warning: the frame size of 368 bytes is larger than 100 bytes [-Wframe-larger-than=] === /home/arnd/cross-gcc/bin/alpha-linux-gcc-4.9.3 === ../../crypto/wp512.c:987:1: warning: the frame size of 240 bytes is larger than 100 bytes [-Wframe-larger-than=] ../../crypto/wp512.c:987:1: warning: the frame size of 1136 bytes is larger than 100 bytes [-Wframe-larger-than=] === /home/arnd/cross-gcc/bin/am33_2.0-linux-gcc-4.9.3 === ../../crypto/wp512.c:987:1: warning: the frame size of 2092 bytes is larger than 100 bytes [-Wframe-larger-than=] ../../crypto/wp512.c:987:1: warning: the frame size of 2084 bytes is larger than 100 bytes [-Wframe-larger-than=] === /home/arnd/cross-gcc/bin/am33_2.0-linux-gcc-5.2.1 === ../../crypto/wp512.c:987:1: warning: the frame size of 2084 bytes is larger than 100 bytes [-Wframe-larger-than=] ../../crypto/wp512.c:987:1: warning: the frame size of 2208 bytes is larger than 100 bytes [-Wframe-larger-than=] === /home/arnd/cross-gcc/bin/cris-linux-gcc-4.9.3 === ../../crypto/wp512.c:987:1: warning: the frame size of 272 bytes is larger than 100 bytes [-Wframe-larger-than=] ../../crypto/wp512.c:987:1: warning: the frame size of 272 bytes is larger than 100 bytes [-Wframe-larger-than=] === /home/arnd/cross-gcc/bin/frv-linux-gcc-4.9.3 === ../../crypto/wp512.c:987:1: warning: the frame size of 296 bytes is larger than 100 bytes [-Wframe-larger-than=] ../../crypto/wp512.c:987:1: warning: the frame size of 1128 bytes is larger than 100 bytes [-Wframe-larger-than=] === /home/arnd/cross-gcc/bin/hppa64-linux-gcc-4.9.3 === ../../crypto/wp512.c:987:1: warning: the frame size of 192 bytes is larger than 100 bytes [-Wframe-larger-than=] ../../crypto/wp512.c:987:1: warning: the frame size of 1128 bytes is larger than 100 bytes [-Wframe-larger-than=] === /home/arnd/cross-gcc/bin/hppa-linux-gcc-4.9.3 === ../../crypto/wp512.c:987:1: warning: the frame size of 276 bytes is larger than 100 bytes [-Wframe-larger-than=] ../../crypto/wp512.c:987:1: warning: the frame size of 644 bytes is larger than 100 bytes [-Wframe-larger-than=] === /home/arnd/cross-gcc/bin/i386-linux-gcc-4.9.3 === ../../crypto/wp512.c:987:1: warning: the frame size of 352 bytes is larger than 100 bytes [-Wframe-larger-than=] ../../crypto/wp512.c:987:1: warning: the frame size of 352 bytes is larger than 100 bytes [-Wframe-larger-than=] === /home/arnd/cross-gcc/bin/m32r-linux-gcc-4.9.3 === ../../crypto/wp512.c:987:1: warning: the frame size of 332 bytes is larger than 100 bytes [-Wframe-larger-than=] ../../crypto/wp512.c:987:1: warning: the frame size of 716 bytes is larger than 100 bytes [-Wframe-larger-than=] === /home/arnd/cross-gcc/bin/m68k-linux-gcc-6.0.0 === ../../crypto/wp512.c:987:1: warning: the frame size of 364 bytes is larger than 100 bytes [-Wframe-larger-than=] ../../crypto/wp512.c:987:1: warning: the frame size of 364 bytes is larger than 100 bytes [-Wframe-larger-than=] === /home/arnd/cross-gcc/bin/microblaze-linux-gcc-4.9.3 === ../../crypto/wp512.c:987:1: warning: the frame size of 280 bytes is larger than 100 bytes [-Wframe-larger-than=] ../../crypto/wp512.c:987:1: warning: the frame size of 1108 bytes is larger than 100 bytes [-Wframe-larger-than=] === /home/arnd/cross-gcc/bin/mips64-linux-gcc-4.9.3 === ../../crypto/wp512.c:987:1: warning: the frame size of 208 bytes is larger than 100 bytes [-Wframe-larger-than=] ../../crypto/wp512.c:987:1: warning: the frame size of 1328 bytes is larger than 100 bytes [-Wframe-larger-than=] === /home/arnd/cross-gcc/bin/mips-linux-gcc-4.9.3 === ../../crypto/wp512.c:987:1: warning: the frame size of 272 bytes is larger than 100 bytes [-Wframe-larger-than=] ../../crypto/wp512.c:987:1: warning: the frame size of 1096 bytes is larger than 100 bytes [-Wframe-larger-than=] === /home/arnd/cross-gcc/bin/mips-linux-gcc-7.0.0 === ../../crypto/wp512.c:987:1: warning: the frame size of 304 bytes is larger than 100 bytes [-Wframe-larger-than=] ../../crypto/wp512.c:987:1: warning: the frame size of 1128 bytes is larger than 100 bytes [-Wframe-larger-than=] === /home/arnd/cross-gcc/bin/powerpc64-linux-gcc-4.9.3 === ../../crypto/wp512.c:987:1: warning: the frame size of 144 bytes is larger than 100 bytes [-Wframe-larger-than=] ../../crypto/wp512.c:987:1: warning: the frame size of 1088 bytes is larger than 100 bytes [-Wframe-larger-than=] === /home/arnd/cross-gcc/bin/powerpc-linux-gcc-4.9.3 === ../../crypto/wp512.c:987:1: wa