[Bug middle-end/87869] Unrolled loop leads to excessive code bloat with -Os on ARC EM.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87869 --- Comment #4 from Nick Bowler --- (In reply to Richard Biener from comment #3) > I think a better target for optimizing would be the RTL side, [...] > I'm sure arc can store to a register address as well. Yes, if the shortest possible store encoding were used on ARC instead of the longest possible encoding, then the unrolled loop would not be nearly as painful, e.g., : 0: 40c3 f000 mov_s r0,0xf000 6: 732cmov_s r1,3 8: a020st_sr1,[r0,0] a: a021st_sr1,[r0,0x4] c: a022st_sr1,[r0,0x8] e: a023st_sr1,[r0,0xc] 10: a024st_sr1,[r0,0x10] 12: a025st_sr1,[r0,0x14] 14: a026st_sr1,[r0,0x18] 16: a027st_sr1,[r0,0x1c] 18: a028st_sr1,[r0,0x20] 1a: a029st_sr1,[r0,0x24] 1c: a02ast_sr1,[r0,0x28] 1e: 7ee0j_s [blink]
[Bug c/87888] New: Behaviour of __builtin_arc_sr differs from its description in the manual.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87888 Bug ID: 87888 Summary: Behaviour of __builtin_arc_sr differs from its description in the manual. Product: gcc Version: 8.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: nbowler at draconx dot ca Target Milestone: --- I happened to notice what appears to be an error in the GCC manual, ยง6.59.4 ARC Built-in Functions[1]: Built-in Function: void __builtin_arc_sr (unsigned int auxr, unsigned int val) The first argument, /auxv/, is the address of an auxiliary register, the second argument, /val/, is a compile time constant to be written to the register. Generates: sr auxr, [val] This function indeed generates the sr instruction with the parameters exactly as described, e.g., __builtin_arc_sr(0x123, 0x456) generates sr 0x123, [0x456] However, the description of those parameters is incorrect: the first operand of sr is the value to be written, and the second is the address, so the previous example stores the value 0x123 to aux address 0x456. Also I think the note about val being a compile-time constant is an error as well... the sr instruction does not require constants, and gcc happily accepts non-constant values as arguments to this builtin. I suggest the documentation of this builtin should be changed to match its actual behaviour, perhaps something like: Built-in Function: void __builtin_arc_sr (unsigned int val, unsigned int auxr) Stores /val/ to the auxiliary register with address /auxr/. Generates: sr val, [auxr] [1] https://gcc.gnu.org/onlinedocs/gcc-8.2.0/gcc/ARC-Built-in-Functions.html#index-_005f_005fbuiltin_005farc_005fsr
[Bug middle-end/87869] Unrolled loop leads to excessive code bloat with -Os on ARC EM.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87869 --- Comment #5 from Nick Bowler --- Looking at some of my other code output, it looks that these long encodings are emitted a lot more frequently than it would seem they are needed. If shorter store encodings were used more generally then I'd expect to see significant size improvements not just to the test case under discussion here but to that other code as well.
[Bug c/87869] New: Unrolled loop leads to excessive code bloat with -Os on ARC EM.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87869 Bug ID: 87869 Summary: Unrolled loop leads to excessive code bloat with -Os on ARC EM. Product: gcc Version: 8.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: nbowler at draconx dot ca Target Milestone: --- Consider the following code: % cat >test.c <<'EOF' #include void do_stuff_12iter(void) { volatile uint32_t *blah = (void *)0xf000; unsigned i; for (i = 0; i < 12; i++) { blah[i] = 3; } } void do_stuff_11iter(void) { volatile uint32_t *blah = (void *)0xf000; unsigned i; for (i = 0; i < 11; i++) { blah[i] = 3; } } EOF When I compile this with gcc: % arc-unknown-elf-gcc -v Using built-in specs. COLLECT_GCC=/usr/x86_64-pc-linux-gnu/arc-unknown-elf/gcc-bin/8.2.0/arc-unknown-elf-gcc COLLECT_LTO_WRAPPER=/usr/libexec/gcc/arc-unknown-elf/8.2.0/lto-wrapper Target: arc-unknown-elf Configured with: /var/tmp/portage/cross-arc-unknown-elf/gcc-8.2.0-r3/work/gcc-8.2.0/configure --host=x86_64-pc-linux-gnu --target=arc-unknown-elf --build=x86_64-pc-linux-gnu --prefix=/usr --bindir=/usr/x86_64-pc-linux-gnu/arc-unknown-elf/gcc-bin/8.2.0 --includedir=/usr/lib/gcc/arc-unknown-elf/8.2.0/include --datadir=/usr/share/gcc-data/arc-unknown-elf/8.2.0 --mandir=/usr/share/gcc-data/arc-unknown-elf/8.2.0/man --infodir=/usr/share/gcc-data/arc-unknown-elf/8.2.0/info --with-gxx-include-dir=/usr/lib/gcc/arc-unknown-elf/8.2.0/include/g++-v8 --with-python-dir=/share/gcc-data/arc-unknown-elf/8.2.0/python --enable-languages=c,c++ --enable-obsolete --enable-secureplt --disable-werror --with-system-zlib --enable-nls --without-included-gettext --enable-checking=release --with-bugurl=https://bugs.gentoo.org/ --with-pkgversion='Gentoo 8.2.0-r3' --disable-esp --enable-libstdcxx-time --enable-poison-system-directories --disable-libstdcxx-time --with-sysroot=/usr/arc-unknown-elf --disable-bootstrap --with-newlib --enable-multilib --disable-altivec --disable-fixed-point --disable-libgomp --disable-libmudflap --disable-libssp --disable-libmpx --disable-systemtap --disable-vtable-verify --disable-libvtv --disable-libquadmath --enable-lto --without-isl --disable-libsanitizer --disable-default-pie --enable-default-ssp Thread model: single gcc version 8.2.0 (Gentoo 8.2.0-r3) % arc-unknown-elf-gcc -c -Os -mcpu=arcem -mno-sdata -mcode-density -mq-class -mbarrel-shifter -mmpy-option=3 -mswap test.c The 11-iteration loop gets fully unrolled with pretty horrible results: 0018 : 18: 730cmov_s r0,3 1a: 1e00 7000 f000 st r0,[0xf000] 22: 1e00 7000 f000 0004 st r0,[0xf004] 2a: 1e00 7000 f000 0008 st r0,[0xf008] 32: 1e00 7000 f000 000c st r0,[0xf00c] 3a: 1e00 7000 f000 0010 st r0,[0xf010] 42: 1e00 7000 f000 0014 st r0,[0xf014] 4a: 1e00 7000 f000 0018 st r0,[0xf018] 52: 1e00 7000 f000 001c st r0,[0xf01c] 5a: 1e00 7000 f000 0020 st r0,[0xf020] 62: 1e00 7000 f000 0024 st r0,[0xf024] 6a: 1e00 7000 f000 0028 st r0,[0xf028] 72: 7ee0j_s [blink] That's almost five times the size of the 12-iteration one which didn't get unrolled: : 0: 41c3 f000 mov_s r1,0xf000 6: 734cmov_s r2,3 8: d80cmov_s r0,0xc a: 240a 7000 mov lp_count,r0 e: 20a8 0140 lp 10 ;16 12: 1904 0090 st.ab r2,[r1,4] 16: 7ee0j_s [blink] That one's pretty good. This specific example could be a _tiny_ bit better, because the constant values moved to r2 and r0 could be immediates in the instructions where those registers are used but I'm not bothered by that. Since I requested size optimizations, it would be nice if my code size didn't get quintupled like this.
[Bug c/87869] Unrolled loop leads to excessive code bloat with -Os on ARC EM.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87869 --- Comment #1 from Nick Bowler --- Er, I can't count, the unrolled loop is only ~four times the size.