[Bug middle-end/87869] Unrolled loop leads to excessive code bloat with -Os on ARC EM.

2018-11-05 Thread nbowler at draconx dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87869

--- Comment #4 from Nick Bowler  ---
(In reply to Richard Biener from comment #3)
> I think a better target for optimizing would be the RTL side,
[...]
> I'm sure arc can store to a register address as well.

Yes, if the shortest possible store encoding were used on ARC instead of
the longest possible encoding, then the unrolled loop would not be nearly
as painful, e.g.,

 :
   0:   40c3 f000   mov_s   r0,0xf000
   6:   732cmov_s   r1,3
   8:   a020st_sr1,[r0,0]
   a:   a021st_sr1,[r0,0x4]
   c:   a022st_sr1,[r0,0x8]
   e:   a023st_sr1,[r0,0xc]
  10:   a024st_sr1,[r0,0x10]
  12:   a025st_sr1,[r0,0x14]
  14:   a026st_sr1,[r0,0x18]
  16:   a027st_sr1,[r0,0x1c]
  18:   a028st_sr1,[r0,0x20]
  1a:   a029st_sr1,[r0,0x24]
  1c:   a02ast_sr1,[r0,0x28]
  1e:   7ee0j_s [blink]

[Bug c/87888] New: Behaviour of __builtin_arc_sr differs from its description in the manual.

2018-11-05 Thread nbowler at draconx dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87888

Bug ID: 87888
   Summary: Behaviour of __builtin_arc_sr differs from its
description in the manual.
   Product: gcc
   Version: 8.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: nbowler at draconx dot ca
  Target Milestone: ---

I happened to notice what appears to be an error in the GCC manual,
ยง6.59.4 ARC Built-in Functions[1]:

Built-in Function: void __builtin_arc_sr (unsigned int auxr, unsigned int val)

The first argument, /auxv/, is the address of an auxiliary register,
the second argument, /val/, is a compile time constant to be written
to the register. Generates:

sr  auxr, [val]

This function indeed generates the sr instruction with the parameters
exactly as described, e.g., __builtin_arc_sr(0x123, 0x456) generates

   sr 0x123, [0x456]

However, the description of those parameters is incorrect: the first
operand of sr is the value to be written, and the second is the address,
so the previous example stores the value 0x123 to aux address 0x456.

Also I think the note about val being a compile-time constant is an
error as well... the sr instruction does not require constants, and
gcc happily accepts non-constant values as arguments to this builtin.

I suggest the documentation of this builtin should be changed to match
its actual behaviour, perhaps something like:

Built-in Function: void __builtin_arc_sr (unsigned int val, unsigned int auxr)

Stores /val/ to the auxiliary register with address /auxr/.  Generates:

sr  val, [auxr]

[1]
https://gcc.gnu.org/onlinedocs/gcc-8.2.0/gcc/ARC-Built-in-Functions.html#index-_005f_005fbuiltin_005farc_005fsr

[Bug middle-end/87869] Unrolled loop leads to excessive code bloat with -Os on ARC EM.

2018-11-06 Thread nbowler at draconx dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87869

--- Comment #5 from Nick Bowler  ---
Looking at some of my other code output, it looks that these long encodings are
emitted a lot more frequently than it would seem they are needed.

If shorter store encodings were used more generally then I'd expect to see
significant size improvements not just to the test case under discussion here
but to that other code as well.

[Bug c/87869] New: Unrolled loop leads to excessive code bloat with -Os on ARC EM.

2018-11-02 Thread nbowler at draconx dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87869

Bug ID: 87869
   Summary: Unrolled loop leads to excessive code bloat with -Os
on ARC EM.
   Product: gcc
   Version: 8.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: nbowler at draconx dot ca
  Target Milestone: ---

Consider the following code:

  % cat >test.c <<'EOF'
  #include 

  void do_stuff_12iter(void)
  {
 volatile uint32_t *blah = (void *)0xf000;
 unsigned i;

 for (i = 0; i < 12; i++) {
blah[i] = 3;
 }
  }

  void do_stuff_11iter(void)
  {
 volatile uint32_t *blah = (void *)0xf000;
 unsigned i;

 for (i = 0; i < 11; i++) {
blah[i] = 3;
 }
  }
EOF

When I compile this with gcc:

  % arc-unknown-elf-gcc -v
  Using built-in specs.
 
COLLECT_GCC=/usr/x86_64-pc-linux-gnu/arc-unknown-elf/gcc-bin/8.2.0/arc-unknown-elf-gcc
  COLLECT_LTO_WRAPPER=/usr/libexec/gcc/arc-unknown-elf/8.2.0/lto-wrapper
  Target: arc-unknown-elf
  Configured with:
/var/tmp/portage/cross-arc-unknown-elf/gcc-8.2.0-r3/work/gcc-8.2.0/configure
--host=x86_64-pc-linux-gnu --target=arc-unknown-elf --build=x86_64-pc-linux-gnu
--prefix=/usr --bindir=/usr/x86_64-pc-linux-gnu/arc-unknown-elf/gcc-bin/8.2.0
--includedir=/usr/lib/gcc/arc-unknown-elf/8.2.0/include
--datadir=/usr/share/gcc-data/arc-unknown-elf/8.2.0
--mandir=/usr/share/gcc-data/arc-unknown-elf/8.2.0/man
--infodir=/usr/share/gcc-data/arc-unknown-elf/8.2.0/info
--with-gxx-include-dir=/usr/lib/gcc/arc-unknown-elf/8.2.0/include/g++-v8
--with-python-dir=/share/gcc-data/arc-unknown-elf/8.2.0/python
--enable-languages=c,c++ --enable-obsolete --enable-secureplt --disable-werror
--with-system-zlib --enable-nls --without-included-gettext
--enable-checking=release --with-bugurl=https://bugs.gentoo.org/
--with-pkgversion='Gentoo 8.2.0-r3' --disable-esp --enable-libstdcxx-time
--enable-poison-system-directories --disable-libstdcxx-time
--with-sysroot=/usr/arc-unknown-elf --disable-bootstrap --with-newlib
--enable-multilib --disable-altivec --disable-fixed-point --disable-libgomp
--disable-libmudflap --disable-libssp --disable-libmpx --disable-systemtap
--disable-vtable-verify --disable-libvtv --disable-libquadmath --enable-lto
--without-isl --disable-libsanitizer --disable-default-pie --enable-default-ssp
  Thread model: single
  gcc version 8.2.0 (Gentoo 8.2.0-r3) 

  % arc-unknown-elf-gcc -c -Os -mcpu=arcem -mno-sdata -mcode-density -mq-class
-mbarrel-shifter -mmpy-option=3 -mswap test.c

The 11-iteration loop gets fully unrolled with pretty horrible results:

0018 :
  18:   730cmov_s   r0,3
  1a:   1e00 7000 f000  st  r0,[0xf000]
  22:   1e00 7000 f000 0004 st  r0,[0xf004]
  2a:   1e00 7000 f000 0008 st  r0,[0xf008]
  32:   1e00 7000 f000 000c st  r0,[0xf00c]
  3a:   1e00 7000 f000 0010 st  r0,[0xf010]
  42:   1e00 7000 f000 0014 st  r0,[0xf014]
  4a:   1e00 7000 f000 0018 st  r0,[0xf018]
  52:   1e00 7000 f000 001c st  r0,[0xf01c]
  5a:   1e00 7000 f000 0020 st  r0,[0xf020]
  62:   1e00 7000 f000 0024 st  r0,[0xf024]
  6a:   1e00 7000 f000 0028 st  r0,[0xf028]
  72:   7ee0j_s [blink]

That's almost five times the size of the 12-iteration one which didn't
get unrolled:

 :
   0:   41c3 f000   mov_s   r1,0xf000
   6:   734cmov_s   r2,3
   8:   d80cmov_s   r0,0xc
   a:   240a 7000   mov lp_count,r0
   e:   20a8 0140   lp  10  ;16 
  12:   1904 0090   st.ab   r2,[r1,4]
  16:   7ee0j_s [blink]

That one's pretty good.  This specific example could be a _tiny_
bit better, because the constant values moved to r2 and r0 could be
immediates in the instructions where those registers are used but
I'm not bothered by that.

Since I requested size optimizations, it would be nice if my code
size didn't get quintupled like this.

[Bug c/87869] Unrolled loop leads to excessive code bloat with -Os on ARC EM.

2018-11-02 Thread nbowler at draconx dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87869

--- Comment #1 from Nick Bowler  ---
Er, I can't count, the unrolled loop is only ~four times the size.