[Bug rtl-optimization/79149] bad optimization on MIPS and ARM leading to excessive stack usage in some cases

2021-05-04 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79149

Richard Biener  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED

[Bug rtl-optimization/79149] bad optimization on MIPS and ARM leading to excessive stack usage in some cases

2017-06-16 Thread ramana at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79149

Ramana Radhakrishnan  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2017-06-16
 CC||ramana at gcc dot gnu.org
 Ever confirmed|0   |1

--- Comment #15 from Ramana Radhakrishnan  ---
Well given all the comments, confirmed then ... :)

[Bug rtl-optimization/79149] bad optimization on MIPS and ARM leading to excessive stack usage in some cases

2017-02-08 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79149

--- Comment #14 from wilco at gcc dot gnu.org ---
(In reply to Arnd Bergmann from comment #13)
> (In reply to wilco from comment #12)
> > Does wp512 use 64-bit types? If so, this is likely PR77308.
> 
> Yes, as seen in the attachment it uses lots of 64-bit operations. However,
> it sounds like PR77308 is ARM specific, but I see the same behavior
> on most other architectures, including 64-bit ones. Quoting the
> kernel patch I linked to, with stack frame sizes for the function
> depending on architecture, optimization flags and compiler version
> (only 4.9 and 7.0 here, there is little difference anyway)

The 64-bit expansion issues in PR77308 are ARM specific indeed, but it also
shows scheduling causes unnecessary high register pressure and extra spilling.

Your results indicate that -fsched-pressure should really be the default. And
given that it still results in more spilling compared to not scheduling, it
probably needs to be made less aggressive or compute pressure more accurately.

[Bug rtl-optimization/79149] bad optimization on MIPS and ARM leading to excessive stack usage in some cases

2017-02-07 Thread arnd at linaro dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79149

--- Comment #13 from Arnd Bergmann  ---
(In reply to wilco from comment #12)
> Does wp512 use 64-bit types? If so, this is likely PR77308.

Yes, as seen in the attachment it uses lots of 64-bit operations. However, it
sounds like PR77308 is ARM specific, but I see the same behavior
on most other architectures, including 64-bit ones. Quoting the
kernel patch I linked to, with stack frame sizes for the function
depending on architecture, optimization flags and compiler version
(only 4.9 and 7.0 here, there is little difference anyway)

default: -O2
press:   -O2 -fsched-pressure
nopress: -O2 -fschedule-insns -fno-sched-pressure
nosched: -O2 -no-schedule-insns (disables sched-pressure)

default press   nopress nosched
alpha-linux-gcc-4.9.3   1136848 1136176
am33_2.0-linux-gcc-4.9.32100207621002104
arm-linux-gnueabi-gcc-4.9.3 848 848 1048352
cris-linux-gcc-4.9.3272 272 272 272
frv-linux-gcc-4.9.3 112810001128280
hppa64-linux-gcc-4.9.3  1128336 1128184
hppa-linux-gcc-4.9.3644 308 644 276
i386-linux-gcc-4.9.3352 352 352 352
m32r-linux-gcc-4.9.3720 656 720 268
microblaze-linux-gcc-4.9.3  1108604 1108256
mips64-linux-gcc-4.9.3  1328592 1328208
mips-linux-gcc-4.9.31096624 1096240
powerpc64-linux-gcc-4.9.3   1088432 1088160
powerpc-linux-gcc-4.9.3 1080584 1080224
s390-linux-gcc-4.9.3456 456 624 360
sh3-linux-gcc-4.9.3 292 292 292 292
sparc64-linux-gcc-4.9.3 992 240 992 208
sparc-linux-gcc-4.9.3   680 592 680 312
x86_64-linux-gcc-4.9.3  224 240 272 224
xtensa-linux-gcc-4.9.3  1152704 1152304

aarch64-linux-gcc-7.0.0 224 224 1104208
arm-linux-gnueabi-gcc-7.0.1 824 824 1048352
mips-linux-gcc-7.0.01120648 1120272
mips64-linux-gcc-7.0.0  1072608 1072224
x86_64-linux-gcc-7.0.1  240 240 304 240

[Bug rtl-optimization/79149] bad optimization on MIPS and ARM leading to excessive stack usage in some cases

2017-02-07 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79149

wilco at gcc dot gnu.org changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #12 from wilco at gcc dot gnu.org ---
Does wp512 use 64-bit types? If so, this is likely PR77308.

[Bug rtl-optimization/79149] bad optimization on MIPS and ARM leading to excessive stack usage in some cases

2017-02-06 Thread arnd at linaro dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79149

--- Comment #11 from Arnd Bergmann  ---
I've submitted a workaround for the kernel now, addressing the stack usage
warning on MIPS, as well as performance on ARM and others:

https://patchwork.kernel.org/patch/9555183/

The patch has two different workarounds, as I found that adding
-Wno-schedule-insns gives us the best results on the whirlpool512 code for both
stack size and performance by a wide margin, while -fsched-pressure is better
on stack size for "serpent" across architectures and compiler versions

However, it is interesting to notice that arm-linux-gnueabi-gcc-7 produces
worse
results with the serpent source code in terms of stack size with the default
"-fsched-pressure" ("press") than older versions, and worse than
-fno-schedule-insns (nosched):

default press   nopress nosched
arm-linux-gnueabi-gcc-4.4.7 592 440
arm-linux-gnueabi-gcc-4.5.4 776 448 776 544
arm-linux-gnueabi-gcc-4.6.4 776 448 776 544
arm-linux-gnueabi-gcc-4.7.4 768 448 768 544
arm-linux-gnueabi-gcc-4.8.5 488 488 776 544
arm-linux-gnueabi-gcc-4.9.3 552 552 776 536
arm-linux-gnueabi-gcc-5.3.1 552 552 776 536
arm-linux-gnueabi-gcc-6.1.1 560 560 776 536
arm-linux-gnueabi-gcc-7.0.1 616 616 808 536

If we want to continue investigating this, I can try to construct a standalone
test case for performance testing on 'serpent' as well.

[Bug rtl-optimization/79149] bad optimization on MIPS and ARM leading to excessive stack usage in some cases

2017-01-22 Thread arnd at linaro dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79149

--- Comment #10 from Arnd Bergmann  ---
(In reply to Arnd Bergmann from comment #9)

> "-fsched-pressure" on mips64 helps a lot
> ...
> On arm and aarch64, "-fsched-pressure" has no effect

I realized later that on arm and aarch64, -fsched-pressure is enabled by
default. Disabling it on these two makes it as bad as mips64, which has it
disabled by default.

[Bug rtl-optimization/79149] bad optimization on MIPS and ARM leading to excessive stack usage in some cases

2017-01-20 Thread arnd at linaro dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79149

--- Comment #9 from Arnd Bergmann  ---
The warning seems to reliably disappear with -fno-schedule-insns, on every
combination I've tried it produces better (smaller stack and faster code) or
identical results to -fno-sched-critical-path-heuristic
-fno-sched-dep-count-heuristic for the test case, but the margins vary a lot
depending on gcc version and architecture. I tried various gcc versions on ARM,
as well as gcc-4.9.3 across many architectures.

"-fsched-pressure" on mips64 helps a lot (factor 2 in both frame size and
performance) but is still worse than "-fno-sched-critical-path-heuristic
-fno-sched-dep-count-heuristic" or "-fno-schedule-insns", which give factor 2.5
to 3.5 in performance and reduce the stack size to what it should be (220 to
272 bytes).
I tried gcc-4.9 and gcc-7.0 here, which show the same behavior, though "gcc-4.9
-fno-schedule-insns" is faster by a good margin (factor 1.1 to 1.2) compared to
the second fastest ("gcc-7 -fno-schedule-insns").

On arm and aarch64, "-fsched-pressure" has no effect on this test case that I
can see (have not compared the object files, but frame size and performance are
unchanged). "-fno-schedule-insns" is noticeably better, with frame size of 350
bytes on ARM compared to the default 824, and performance better by factor 1.6
compared to the default -O2. I also looked at the frame sizes on my older arm
compilers and saw the same on all 8 versions I have (4.5 through 7.0).

[Bug rtl-optimization/79149] bad optimization on MIPS and ARM leading to excessive stack usage in some cases

2017-01-20 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79149

--- Comment #8 from Andrew Pinski  ---
The other thing to do is try with -fsched-pressure .  PowerPC turns on
-fsched-pressure by default (see PR 11488).

[Bug rtl-optimization/79149] bad optimization on MIPS and ARM leading to excessive stack usage in some cases

2017-01-20 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79149

Andrew Pinski  changed:

   What|Removed |Added

 Depends on||11488

--- Comment #7 from Andrew Pinski  ---
See also PR11488.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=11488
[Bug 11488] Pre-regalloc scheduling severely worsens performance

[Bug rtl-optimization/79149] bad optimization on MIPS and ARM leading to excessive stack usage in some cases

2017-01-20 Thread mkuvyrkov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79149

--- Comment #6 from Maxim Kuvyrkov  ---
Without looking at the code (it's 11pm) my guess is that 1st scheduling pass is
misbehaving in some way, most likely it is doing a lot of interblock moves. 
One of the big differences between x86 and ARM/MIPS scheduling is that x86
disables interblock scheduling.  Does -fno-schedule-insns fix the warnings on
ARM/MIPS?

[Bug rtl-optimization/79149] bad optimization on MIPS and ARM leading to excessive stack usage in some cases

2017-01-20 Thread arnd at linaro dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79149

--- Comment #5 from Arnd Bergmann  ---
-fno-schedule-insns is comparable in stack frame size to
"-fno-sched-critical-path-heuristic -fno-sched-dep-count-heuristic" on all
architectures (give or take a few bytes), but actually produces much better
code.

In my simulated mips64 run, I see these numbers:

-O2:  49.0Mbit/s
-O2 -fno-sched-critical-path-heuristic -fno-sched-dep-count-heuristic: 109.7
Mbit/s
-O2 -fno-schedule-insns: 179.2 Mbit/s

The trend is the same on arm an aarch64 for emulated runs, and I confirmed
earlier that the results on real hardware are comparable to what we get in
qemu.

[Bug rtl-optimization/79149] bad optimization on MIPS and ARM leading to excessive stack usage in some cases

2017-01-20 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79149

--- Comment #4 from Andrew Pinski  ---
Can you try -fno-sched-insns?

[Bug rtl-optimization/79149] bad optimization on MIPS and ARM leading to excessive stack usage in some cases

2017-01-20 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79149

--- Comment #3 from Andrew Pinski  ---
There is another bug referring to the pre-register allocation schedule messing
up. X86 does not have this scheduler turNed on by default.

[Bug rtl-optimization/79149] bad optimization on MIPS and ARM leading to excessive stack usage in some cases

2017-01-20 Thread arnd at linaro dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79149

Arnd Bergmann  changed:

   What|Removed |Added

  Attachment #40546|0   |1
is obsolete||

--- Comment #2 from Arnd Bergmann  ---
Created attachment 40554
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40554&action=edit
wp512 reference source code, standalone version

After checking a bit more, I found that the reference source code
implementation does behave exactly like the in-kernel version after all, and I
was able to do some performance timing (using qemu-user) on it as well.

Building Whirlpool.c using "mips64el-linux-gnuabi64-gcc-5 -O2
-Wframe-larger-than=100 Whirlpool.c -o Whirlpool-mips-smallstack
-fno-sched-critical-path-heuristic -fno-sched-dep-count-heuristic" in this case
uses 256 bytes of stack in the processBuffer and run for 87 seconds doing
1000 iterations in qemu, while the version without
"fno-sched-critical-path-heuristic -fno-sched-dep-count-heuristic" takes 230
seconds and needs 1520 bytes of stack. The extra time is apparently spent
spilling registers to the stack.

The same test with arm32 shows a less significant version of the same behavior,
with the stack shrinking from 832 to 352 bytes, and the time improving from 301
seconds to 217 seconds.

Obviously it would be helpful to do the same tests on actual hardware, as
benchmarking in an emulated machine can be very misleading.

[Bug rtl-optimization/79149] bad optimization on MIPS and ARM leading to excessive stack usage in some cases

2017-01-20 Thread arnd at linaro dot org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79149

--- Comment #1 from Arnd Bergmann  ---
Additional information:

I see the same behavior to a varying degree on most other architectures (but
notably not x86) using the preprocessed source from the MIPS kernel
configuration, these are always one run with -fno-sched-critical-path-heuristic
-fno-sched-dep-count-heuristic and one run without:

=== /home/arnd/cross-gcc/bin/aarch64-linux-gcc-5.2.1 ===
../../crypto/wp512.c:987:1: warning: the frame size of 224 bytes is larger than
100 bytes [-Wframe-larger-than=]
../../crypto/wp512.c:987:1: warning: the frame size of 368 bytes is larger than
100 bytes [-Wframe-larger-than=]
=== /home/arnd/cross-gcc/bin/alpha-linux-gcc-4.9.3 ===
../../crypto/wp512.c:987:1: warning: the frame size of 240 bytes is larger than
100 bytes [-Wframe-larger-than=]
../../crypto/wp512.c:987:1: warning: the frame size of 1136 bytes is larger
than 100 bytes [-Wframe-larger-than=]
=== /home/arnd/cross-gcc/bin/am33_2.0-linux-gcc-4.9.3 ===
../../crypto/wp512.c:987:1: warning: the frame size of 2092 bytes is larger
than 100 bytes [-Wframe-larger-than=]
../../crypto/wp512.c:987:1: warning: the frame size of 2084 bytes is larger
than 100 bytes [-Wframe-larger-than=]
=== /home/arnd/cross-gcc/bin/am33_2.0-linux-gcc-5.2.1 ===
../../crypto/wp512.c:987:1: warning: the frame size of 2084 bytes is larger
than 100 bytes [-Wframe-larger-than=]
../../crypto/wp512.c:987:1: warning: the frame size of 2208 bytes is larger
than 100 bytes [-Wframe-larger-than=]
=== /home/arnd/cross-gcc/bin/cris-linux-gcc-4.9.3 ===
../../crypto/wp512.c:987:1: warning: the frame size of 272 bytes is larger than
100 bytes [-Wframe-larger-than=]
../../crypto/wp512.c:987:1: warning: the frame size of 272 bytes is larger than
100 bytes [-Wframe-larger-than=]
=== /home/arnd/cross-gcc/bin/frv-linux-gcc-4.9.3 ===
../../crypto/wp512.c:987:1: warning: the frame size of 296 bytes is larger than
100 bytes [-Wframe-larger-than=]
../../crypto/wp512.c:987:1: warning: the frame size of 1128 bytes is larger
than 100 bytes [-Wframe-larger-than=]
=== /home/arnd/cross-gcc/bin/hppa64-linux-gcc-4.9.3 ===
../../crypto/wp512.c:987:1: warning: the frame size of 192 bytes is larger than
100 bytes [-Wframe-larger-than=]
../../crypto/wp512.c:987:1: warning: the frame size of 1128 bytes is larger
than 100 bytes [-Wframe-larger-than=]
=== /home/arnd/cross-gcc/bin/hppa-linux-gcc-4.9.3 ===
../../crypto/wp512.c:987:1: warning: the frame size of 276 bytes is larger than
100 bytes [-Wframe-larger-than=]
../../crypto/wp512.c:987:1: warning: the frame size of 644 bytes is larger than
100 bytes [-Wframe-larger-than=]
=== /home/arnd/cross-gcc/bin/i386-linux-gcc-4.9.3 ===
../../crypto/wp512.c:987:1: warning: the frame size of 352 bytes is larger than
100 bytes [-Wframe-larger-than=]
../../crypto/wp512.c:987:1: warning: the frame size of 352 bytes is larger than
100 bytes [-Wframe-larger-than=]
=== /home/arnd/cross-gcc/bin/m32r-linux-gcc-4.9.3 ===
../../crypto/wp512.c:987:1: warning: the frame size of 332 bytes is larger than
100 bytes [-Wframe-larger-than=]
../../crypto/wp512.c:987:1: warning: the frame size of 716 bytes is larger than
100 bytes [-Wframe-larger-than=]
=== /home/arnd/cross-gcc/bin/m68k-linux-gcc-6.0.0 ===
../../crypto/wp512.c:987:1: warning: the frame size of 364 bytes is larger than
100 bytes [-Wframe-larger-than=]
../../crypto/wp512.c:987:1: warning: the frame size of 364 bytes is larger than
100 bytes [-Wframe-larger-than=]
=== /home/arnd/cross-gcc/bin/microblaze-linux-gcc-4.9.3 ===
../../crypto/wp512.c:987:1: warning: the frame size of 280 bytes is larger than
100 bytes [-Wframe-larger-than=]
../../crypto/wp512.c:987:1: warning: the frame size of 1108 bytes is larger
than 100 bytes [-Wframe-larger-than=]
=== /home/arnd/cross-gcc/bin/mips64-linux-gcc-4.9.3 ===
../../crypto/wp512.c:987:1: warning: the frame size of 208 bytes is larger than
100 bytes [-Wframe-larger-than=]
../../crypto/wp512.c:987:1: warning: the frame size of 1328 bytes is larger
than 100 bytes [-Wframe-larger-than=]
=== /home/arnd/cross-gcc/bin/mips-linux-gcc-4.9.3 ===
../../crypto/wp512.c:987:1: warning: the frame size of 272 bytes is larger than
100 bytes [-Wframe-larger-than=]
../../crypto/wp512.c:987:1: warning: the frame size of 1096 bytes is larger
than 100 bytes [-Wframe-larger-than=]
=== /home/arnd/cross-gcc/bin/mips-linux-gcc-7.0.0 ===
../../crypto/wp512.c:987:1: warning: the frame size of 304 bytes is larger than
100 bytes [-Wframe-larger-than=]
../../crypto/wp512.c:987:1: warning: the frame size of 1128 bytes is larger
than 100 bytes [-Wframe-larger-than=]
=== /home/arnd/cross-gcc/bin/powerpc64-linux-gcc-4.9.3 ===
../../crypto/wp512.c:987:1: warning: the frame size of 144 bytes is larger than
100 bytes [-Wframe-larger-than=]
../../crypto/wp512.c:987:1: warning: the frame size of 1088 bytes is larger
than 100 bytes [-Wframe-larger-than=]
=== /home/arnd/cross-gcc/bin/powerpc-linux-gcc-4.9.3 ===
../../crypto/wp512.c:987:1: wa