[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-11-10 Thread guojiufu at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

--- Comment #43 from Jiu Fu Guo  ---
Author: guojiufu
Date: Mon Nov 11 06:30:38 2019
New Revision: 278034

URL: https://gcc.gnu.org/viewcvs?rev=278034=gcc=rev
Log:

rs6000: Refine small loop unroll in loop_unroll_adjust hook

In this patch, loop unroll adjust hook is introduced for powerpc.  We
can do target related heuristic adjustment in this hook.  In this patch,
-funroll-loops is enabled for small loops at O2 and above with an option
-munroll-small-loops to guard the small loops unrolling, and it works
fine with -flto.


gcc/
2019-11-11  Jiufu Guo  

PR tree-optimization/88760
* gcc/config/rs6000/rs6000.opt (-munroll-only-small-loops): New option.
* gcc/common/config/rs6000/rs6000-common.c
(rs6000_option_optimization_table) [OPT_LEVELS_2_PLUS_SPEED_ONLY]:
Turn on -funroll-loops and -munroll-only-small-loops.
[OPT_LEVELS_ALL]: Turn off -fweb and -frename-registers.
* config/rs6000/rs6000.c (rs6000_option_override_internal): Remove
set of PARAM_MAX_UNROLL_TIMES and PARAM_MAX_UNROLLED_INSNS.
Turn off -munroll-only-small-loops for explicit -funroll-loops.
(TARGET_LOOP_UNROLL_ADJUST): Add loop unroll adjust hook.
(rs6000_loop_unroll_adjust): Define it.  Use -munroll-only-small-loops.

gcc.testsuite/
2019-11-11  Jiufu Guo  

PR tree-optimization/88760
* gcc.dg/pr59643.c: Update back to r277550.


Modified:
trunk/gcc/ChangeLog
trunk/gcc/common/config/rs6000/rs6000-common.c
trunk/gcc/config/rs6000/rs6000.c
trunk/gcc/config/rs6000/rs6000.opt
trunk/gcc/testsuite/ChangeLog
trunk/gcc/testsuite/gcc.dg/pr59643.c

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-10-27 Thread guojiufu at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

--- Comment #42 from Jiu Fu Guo  ---
Author: guojiufu
Date: Mon Oct 28 05:23:24 2019
New Revision: 277501

URL: https://gcc.gnu.org/viewcvs?rev=277501=gcc=rev
Log:
rs6000: Enable limited unrolling at -O2

In PR88760, there are a few disscussion about improve or tune unroller for
targets. And we would agree to enable unroller for small loops at O2 first.
And we could see performance improvement(~10%) for below code:
```
  subroutine foo (i, i1, block)
integer :: i, i1
integer :: block(9, 9, 9)
block(i:9,1,i1) = block(i:9,1,i1) - 10
  end subroutine foo

```
This kind of code occurs a few times in exchange2 benchmark.

Similar C code:
```
  for (i = 0; i < n; i++)
arr[i] = arr[i] - 10;
```

On powerpcle, for O2 , enable -funroll-loops and limit
PARAM_MAX_UNROLL_TIMES=2 and PARAM_MAX_UNROLLED_INSNS=20, we can see >2%
overall improvement for SPEC2017.

This patch is only for rs6000 in which we see visible performance improvement.

gcc/
2019-10-25  Jiufu Guo   

PR tree-optimization/88760
* config/rs6000/rs6000-common.c (rs6000_option_optimization_table):
Enable -funroll-loops for -O2 and above.
* config/rs6000/rs6000.c (rs6000_option_override_internal): Set
PARAM_MAX_UNROLL_TIMES to 2 and PARAM_MAX_UNROLLED_INSNS to 20, and
do not turn on web and rngreg implicitly, if the unroller is not
explicitly enabled.

gcc.testsuite/
2019-10-25  Jiufu Guo  

PR tree-optimization/88760
* gcc.target/powerpc/small-loop-unroll.c: New test.
* c-c++-common/tsan/thread_leak2.c: Update test.
* gcc.dg/pr59643.c: Update test.
* gcc.target/powerpc/loop_align.c: Update test.
* gcc.target/powerpc/ppc-fma-1.c: Update test.
* gcc.target/powerpc/ppc-fma-2.c: Update test.
* gcc.target/powerpc/ppc-fma-3.c: Update test.
* gcc.target/powerpc/ppc-fma-4.c: Update test.
* gcc.target/powerpc/pr78604.c: Update test.

Added:
trunk/gcc/testsuite/gcc.target/powerpc/small-loop-unroll.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/common/config/rs6000/rs6000-common.c
trunk/gcc/config/rs6000/rs6000.c
trunk/gcc/testsuite/ChangeLog
trunk/gcc/testsuite/c-c++-common/tsan/thread_leak2.c
trunk/gcc/testsuite/gcc.dg/pr59643.c
trunk/gcc/testsuite/gcc.target/powerpc/loop_align.c
trunk/gcc/testsuite/gcc.target/powerpc/ppc-fma-1.c
trunk/gcc/testsuite/gcc.target/powerpc/ppc-fma-2.c
trunk/gcc/testsuite/gcc.target/powerpc/ppc-fma-3.c
trunk/gcc/testsuite/gcc.target/powerpc/ppc-fma-4.c
trunk/gcc/testsuite/gcc.target/powerpc/pr78604.c

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-10-22 Thread guojiufu at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

--- Comment #41 from Jiu Fu Guo  ---
for code:

  subroutine foo (i, i1, block)
integer :: i, i1
integer :: block(9, 9, 9)
block(i:9,1,i1) = block(i:9,1,i1) - 10
  end subroutine foo

"-funroll-loops  --param max-unroll-times=2 --param max-unrolled-insns=20"
could help to improve some run time.(~10% on ppcle)

main:
  do n = 0, N
 do i = 1, 9
do j = 1, 9 
   call foo (i, j, block)
end do
 end do
  end do

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-10-14 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

--- Comment #40 from rguenther at suse dot de  ---
On Sat, 12 Oct 2019, guojiufu at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
> 
> --- Comment #39 from Jiu Fu Guo  ---
> For small loop (1-2 stmts), in forms of GIMPLE and RTL, it would be around 
> 5-10
> instructions: 2-4 insns per stmt, ~4 insns for idx.
> 
> With current unroller, here is a statistic on spec2017. 
> Using --param max-unrolled-insns=12, there are ~3000 small loops could be
> unrolled totally, and ~40 of these small loops are located in hot-functions.
> 
> Using --param max-unrolled-insns=16, there are ~11000 small loops could be
> unrolled totally, and ~230 of these small loops are located in hot-functions.
> 
> Using --param max-unrolled-insns=20, there are ~15000 small loops could be
> unrolled totally, and ~570 of these small loops are located in hot-functions.
> 
> Using --param max-unrolled-insns=24, there are ~18000 small loops could be
> unrolled totally, and ~680 of these small loops are located in hot-functions.
> 
> 
> if max-unrolled-insns<16, just few small loops are unrolled for hot-functions;
> it may be not very valuable.

So 12 if two times unrolled is already 6 insns, minus IV update and
compare-and-branch (assuming single pattern) that's 4 insns.  On
GIMPLE I'd already call this large since eventual memory loads and
stores would be separate - so there it wuld be ~16 instead of 12.

I think the better approach is to identify the cases where unrolling
would help, and on which (sub-)architectures, and prepare testcases
for them.

I guess the times where our default unroll factor (if it fits the
size limits) of 8 is a good idea is long gone, I'd expect ILP
to stop improving much earlier (depending on the set of operations).
For ILP you also want to do interleaving of the unrolled iterations,
so I point to SMS again here (SMS suffers from the fact that
loop dependence info is weak on RTL, but it uses the scheduler
model of the target).

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-10-12 Thread guojiufu at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

--- Comment #39 from Jiu Fu Guo  ---
For small loop (1-2 stmts), in forms of GIMPLE and RTL, it would be around 5-10
instructions: 2-4 insns per stmt, ~4 insns for idx.

With current unroller, here is a statistic on spec2017. 
Using --param max-unrolled-insns=12, there are ~3000 small loops could be
unrolled totally, and ~40 of these small loops are located in hot-functions.

Using --param max-unrolled-insns=16, there are ~11000 small loops could be
unrolled totally, and ~230 of these small loops are located in hot-functions.

Using --param max-unrolled-insns=20, there are ~15000 small loops could be
unrolled totally, and ~570 of these small loops are located in hot-functions.

Using --param max-unrolled-insns=24, there are ~18000 small loops could be
unrolled totally, and ~680 of these small loops are located in hot-functions.


if max-unrolled-insns<16, just few small loops are unrolled for hot-functions;
it may be not very valuable.

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-10-11 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

--- Comment #38 from rguenther at suse dot de  ---
On Fri, 11 Oct 2019, segher at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
> 
> --- Comment #37 from Segher Boessenkool  ---
> -- If it is done in RTL it should really be done earlier, it doesn't get all
> optimisations it should right now.
> 
> -- Unrolling small loops more aggressively (at -O2 even) perhaps needs to be
> done at a different spot in the pipeline then?  It needs no cost estimate
> (other than some simple params).

Well.  Even unrolling a two-stmt loop just once can be detrimental on
x86 CPUs and whether it is beneficial depends a lot on code alignment
which we cannot really track well w/o introducing artificial code
padding everywhere.  But yes, this can be turned off easily on x86.
Just heuristics are not as simple as you might think.

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-10-11 Thread segher at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

--- Comment #37 from Segher Boessenkool  ---
-- If it is done in RTL it should really be done earlier, it doesn't get all
optimisations it should right now.

-- Unrolling small loops more aggressively (at -O2 even) perhaps needs to be
done at a different spot in the pipeline then?  It needs no cost estimate
(other than some simple params).

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-10-11 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

--- Comment #36 from rguenther at suse dot de  ---
On Fri, 11 Oct 2019, wilco at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
> 
> --- Comment #32 from Wilco  ---
> (In reply to Segher Boessenkool from comment #31)
> > Gimple passes know a lot about machine details, too.
> > 
> > Irrespective of if this is "low-level" or "high-level", it should be done
> > earlier than it is now.  It should either be done right after expand, or
> > somewhere before it.  Doing it in gimple seems easiest?
> 
> Expand is way too late. Unrolling is high-level because it relies on other
> high-level optimizations - for example to get addressing and induction
> variables right. Without that it's always a loss.

I know I've said it before but I really want to have a "GIMPLE unroller"
integrated with a pass that can compute unrolling benefit.  One candidate
was of course IVOPTs which albeit is already quite complex but which
has register pressure estimates and should have an idea how unrolling
affects these.

Computing unrolling benefit on its own would require some pipeline
modeling on GIMPLE.  So the other obvious candidiate would be
the RTL scheduler - here I think SMS does "unrolling".

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-10-11 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

--- Comment #35 from rguenther at suse dot de  ---
On Fri, 11 Oct 2019, wilco at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
> 
> --- Comment #34 from Wilco  ---
> (In reply to rguent...@suse.de from comment #30)
> > On Fri, 11 Oct 2019, wilco at gcc dot gnu.org wrote:
> > 
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
> > > 
> > > --- Comment #29 from Wilco  ---
> > > (In reply to Jiu Fu Guo from comment #28)
> > > > For these kind of small loops, it would be acceptable to unroll in 
> > > > GIMPLE,
> > > > because register pressure and instruction cost may not be major 
> > > > concerns;
> > > > just  like "cunroll" and "cunrolli" passes (complete unroll) which also 
> > > > been
> > > > done at O2.
> > > 
> > > Absolutely, unrolling is a high-level optimization like vectorization.
> > 
> > To expose ILP?  I'd call that low-level though ;)
> > 
> > If it exposes data reuse then I'd call it high-level - and at that level
> > we already have passes like predictive commoning or unroll-and-jam doing
> > exactly that.  Or vectorization.
> > 
> > We've shown though data that unrolling without a good idea on CPU
> > pipeline details is a loss on x86_64.  This further hints at it
> > being low-level.
> 
> That was using the RTL unroller and existing settings right? Those settings 
> are
> not sane - is 16x unrolling still the default?!? If so, that doesn't surprise
> me at all.

We actually did measurements with restricting that to 2x, 3x, 4x, ...
plus selectively enabling some extra unroller features
(fsplit-ivs-in-unroller, fvariable-expansion-in-unroller).  Quite a big
testing matrix but then nothing was an obvious win.

> Note also it's dangerous to assume x86 behaves exactly like Arm, POWER 
> or other ISAs. Arm cores are very wide nowadays (6+ wide) and have 
> supported 2 memory accesses per cycle (loads and/or stores) for years.

Sure.

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-10-11 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

--- Comment #34 from Wilco  ---
(In reply to rguent...@suse.de from comment #30)
> On Fri, 11 Oct 2019, wilco at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
> > 
> > --- Comment #29 from Wilco  ---
> > (In reply to Jiu Fu Guo from comment #28)
> > > For these kind of small loops, it would be acceptable to unroll in GIMPLE,
> > > because register pressure and instruction cost may not be major concerns;
> > > just  like "cunroll" and "cunrolli" passes (complete unroll) which also 
> > > been
> > > done at O2.
> > 
> > Absolutely, unrolling is a high-level optimization like vectorization.
> 
> To expose ILP?  I'd call that low-level though ;)
> 
> If it exposes data reuse then I'd call it high-level - and at that level
> we already have passes like predictive commoning or unroll-and-jam doing
> exactly that.  Or vectorization.
> 
> We've shown though data that unrolling without a good idea on CPU
> pipeline details is a loss on x86_64.  This further hints at it
> being low-level.

That was using the RTL unroller and existing settings right? Those settings are
not sane - is 16x unrolling still the default?!? If so, that doesn't surprise
me at all.

Note also it's dangerous to assume x86 behaves exactly like Arm, POWER or other
ISAs. Arm cores are very wide nowadays (6+ wide) and have supported 2 memory
accesses per cycle (loads and/or stores) for years.

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-10-11 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

--- Comment #33 from rguenther at suse dot de  ---
On Fri, 11 Oct 2019, segher at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
> 
> --- Comment #31 from Segher Boessenkool  ---
> Gimple passes know a lot about machine details, too.
> 
> Irrespective of if this is "low-level" or "high-level", it should be done
> earlier than it is now.  It should either be done right after expand, or
> somewhere before it.  Doing it in gimple seems easiest?

Sure it's easiest in GIMPLE.  But I'd say it shouldn't be done earlier,
it should be done correctly - RTL unroll simply unrolls everything
without any good costing.  That is what needs to be fixed.

Other than that, where we do cunroll on GIMPLE is a good spot I guess
(need to do it _before_ IVOPTs at least, though I wanted to push that
one later as well).

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-10-11 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

--- Comment #32 from Wilco  ---
(In reply to Segher Boessenkool from comment #31)
> Gimple passes know a lot about machine details, too.
> 
> Irrespective of if this is "low-level" or "high-level", it should be done
> earlier than it is now.  It should either be done right after expand, or
> somewhere before it.  Doing it in gimple seems easiest?

Expand is way too late. Unrolling is high-level because it relies on other
high-level optimizations - for example to get addressing and induction
variables right. Without that it's always a loss.

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-10-11 Thread segher at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

--- Comment #31 from Segher Boessenkool  ---
Gimple passes know a lot about machine details, too.

Irrespective of if this is "low-level" or "high-level", it should be done
earlier than it is now.  It should either be done right after expand, or
somewhere before it.  Doing it in gimple seems easiest?

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-10-11 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

--- Comment #30 from rguenther at suse dot de  ---
On Fri, 11 Oct 2019, wilco at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
> 
> --- Comment #29 from Wilco  ---
> (In reply to Jiu Fu Guo from comment #28)
> > For these kind of small loops, it would be acceptable to unroll in GIMPLE,
> > because register pressure and instruction cost may not be major concerns;
> > just  like "cunroll" and "cunrolli" passes (complete unroll) which also been
> > done at O2.
> 
> Absolutely, unrolling is a high-level optimization like vectorization.

To expose ILP?  I'd call that low-level though ;)

If it exposes data reuse then I'd call it high-level - and at that level
we already have passes like predictive commoning or unroll-and-jam doing
exactly that.  Or vectorization.

We've shown though data that unrolling without a good idea on CPU
pipeline details is a loss on x86_64.  This further hints at it
being low-level.

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-10-11 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

--- Comment #29 from Wilco  ---
(In reply to Jiu Fu Guo from comment #28)
> For these kind of small loops, it would be acceptable to unroll in GIMPLE,
> because register pressure and instruction cost may not be major concerns;
> just  like "cunroll" and "cunrolli" passes (complete unroll) which also been
> done at O2.

Absolutely, unrolling is a high-level optimization like vectorization. Note the
existing unroller doesn't care about register pressure or costs, but a
high-level unroller can certainly take this into account by not unrolling as
aggressively like the existing unroller.

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-10-10 Thread guojiufu at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

Jiu Fu Guo  changed:

   What|Removed |Added

 CC||guojiufu at gcc dot gnu.org

--- Comment #28 from Jiu Fu Guo  ---
For these kind of small loops, it would be acceptable to unroll in GIMPLE,
because register pressure and instruction cost may not be major concerns; just 
like "cunroll" and "cunrolli" passes (complete unroll) which also been done at
O2.

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-10-10 Thread wdijkstr at arm dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

Wilco  changed:

   What|Removed |Added

 CC||wdijkstr at arm dot com

--- Comment #27 from Wilco  ---
(In reply to Segher Boessenkool from comment #26)
> Yeah, and it probably should be a param (that different targets can default
> differently, per CPU probably).  On most Power CPUs all loops take a minimum
> number of cycles per iteration (say, three), but that translates to a lot of
> instructions (say, >10).
> 
> Small loops should probably be unrolled at -O2 already as well.  Maybe fixed
> length loops only?

Agreed, there is no reason not to unroll (or vectorize) with -O2 given several
other compilers do (and beat GCC as a result). It's best to limit unrolling to
small loops and only 2x unless it's 1-2 instructions.

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-09-27 Thread segher at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

--- Comment #26 from Segher Boessenkool  ---
Yeah, and it probably should be a param (that different targets can default
differently, per CPU probably).  On most Power CPUs all loops take a minimum
number of cycles per iteration (say, three), but that translates to a lot of
instructions (say, >10).

Small loops should probably be unrolled at -O2 already as well.  Maybe fixed
length loops only?

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-09-27 Thread rearnsha at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

--- Comment #25 from Richard Earnshaw  ---
Well very small loops should be unrolled more than slightly larger ones.  So
perhaps if the loop body is only 3 or 4 instructions it should be unrolled four
times but above that perhaps only twice.

Some consideration should be given, however, to whether the target
micro-architecture has looping buffers as unrolling beyond the limit of such
structures will hurt performance.

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-09-25 Thread segher at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

--- Comment #24 from Segher Boessenkool  ---
On some (many?) targets it would be good to unroll all loops with a small body
(not containing calls etc.) at -O2 already, some small number of times (2 or 4
maybe).

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-01-25 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

--- Comment #23 from Wilco  ---
(In reply to ktkachov from comment #22)

> helps even more. On Cortex-A72 it gives a bit more than 6% (vs 3%)
> improvement on parest, and about 5.3% on a more aggressive CPU.
> I tried unrolling 8x in a similar manner and that was not faster than 4x on
> either target.

The 4x unrolled version has 19 instructions (and microops) vs 7*4 for the
non-unrolled version, a significant reduction (without LDP it would be 21 vs
28). There is potential to use 2 more LDPs and use load+writeback which would
make it 15 vs 28 instructions, so close to 2x reduction in instructions.

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-01-24 Thread ktkachov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

--- Comment #22 from ktkachov at gcc dot gnu.org ---
Some more experiments...
Unrolling 4x in a similar way to my previous example and not splitting the
accumulator (separate issue):

unsigned int *colnums;
double *val;

struct foostruct
{
  unsigned int rows;
  unsigned int *colnums;
  unsigned int *rowstart;
};

struct foostruct *cols;

void
foo (double * __restrict__ dst, const double *__restrict__ src)
{
  const unsigned int n_rows = cols->rows;
  const double *val_ptr = [cols->rowstart[0]];
  const unsigned int *colnum_ptr = >colnums[cols->rowstart[0]];  

  double *dst_ptr = dst;
  for (unsigned int row=0; rowrowstart[row+1]];
  __PTRDIFF_TYPE__ diff = val_end_of_row - val_ptr;
  if (diff & 1)
  {
s += *val_ptr++ * src[*colnum_ptr++];
diff--;
  }
  if (diff & 2)
{
  s += val_ptr[0] * src[colnum_ptr[0]];
  s += val_ptr[1] * src[colnum_ptr[1]];
  val_ptr += 2;
  colnum_ptr += 2;
}
  while (val_ptr != val_end_of_row)
{
  s += val_ptr[0] * src[colnum_ptr[0]];
  s += val_ptr[1] * src[colnum_ptr[1]];
  s += val_ptr[2] * src[colnum_ptr[2]];
  s += val_ptr[3] * src[colnum_ptr[3]];
  val_ptr += 4;
  colnum_ptr += 4;
}
  *dst_ptr++ = s;
}
}

helps even more. On Cortex-A72 it gives a bit more than 6% (vs 3%) improvement
on parest, and about 5.3% on a more aggressive CPU.
I tried unrolling 8x in a similar manner and that was not faster than 4x on
either target.

Note that perf profiling shows that the loads are what's hot in these loops,
not the FMAs themselves:
  4.41 │1b8:   ldpw3, w4, [x0] 
   
   ▒
  5.85 │   ldpd3, d4, [x2] 
   
   ▒
   │   addx2, x2, #0x20
   
   ▒
  3.79 │   ldur   d5, [x2, #-16]   
   
   ▒
  2.82 │   ldrd0, [x1, x4, lsl #3] 
   
   ▒
  2.53 │   ldrd2, [x1, x3, lsl #3] 
   
   ▒
  2.10 │   ldpw4, w3, [x0, #8] 
   
   ▒
   │   addx0, x0, #0x10
   
   ▒
  0.00 │   cmpx5, x0   
   
   ▒
   │   fmul   d0, d0, d4   
   
   ▒
  4.73 │   ldrd4, [x1, x4, lsl #3] 
   
   ▒
   │   fmadd  d0, d3, d2, d0   
   
   ▒
  2.01 │   ldur   d3, [x2, #-8]
   
   ▒
  2.54 │   ldrd2, [x1, x3, lsl #3] 
   
   ▒
   │   fmadd  d0, d5, d4, d0   
   
   ▒
   │   fmadd  d0, d3, d2, d0   
   

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-01-24 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

--- Comment #21 from Wilco  ---
(In reply to rguent...@suse.de from comment #20)
> On Thu, 24 Jan 2019, wilco at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
> > 
> > --- Comment #19 from Wilco  ---
> > (In reply to rguent...@suse.de from comment #18)
> > 
> > > > 1) Unrolling for load-pair-forming vectorisation (Richard Sandiford's
> > > > suggestion)
> > > 
> > > If that helps, sure (I'd have guessed uarchs are going to split
> > > load-multiple into separate loads, but eventually it avoids
> > > load-port contention?)
> > 
> > Many CPUs execute LDP/STP as a single load/store, eg. Cortex-A57 executes a
> > 128-bit LDP in a single cycle (see Optimization Guide).
> > 
> > > > 2) Unrolling and breaking accumulator dependencies.
> > > 
> > > IIRC RTL unrolling can do this (as side-effect, not as main
> > > cost motivation) guarded with an extra switch.
> > > 
> > > > I think more general unrolling and the peeling associated with it can be
> > > > discussed independently of 1) and 2) once we collect more data on more
> > > > microarchitectures.
> > > 
> > > I think both of these can be "implemented" on the RTL unroller
> > > side.
> > 
> > You still need dependence analysis, alias info, ivopt to run again. The 
> > goal is
> > to remove the increment of the index, use efficient addressing modes 
> > (base+imm)
> > and allow scheduling to move instructions between iterations. I don't 
> > believe
> > the RTL unroller supports any of this today.
> 
> There's no way to encode load-multiple on GIMPLE that wouldn't be
> awkward to other GIMPLE optimizers.

I don't think anyone want LDP/STP directly in GIMPLE - that doesn't seem
useful. We don't even form LDP until quite late in RTL. The key to forming
LDP/STP is using base+imm addressing modes and having correct alias info (so
loads/stores from different iterations can be interleaved and then combined
into LDP/STP). The main thing a backend would need to do is tune address costs
to take future LDP formation into account (and yes, the existing cost models
need to be improved anyway).

> Yes, the RTL unroller supports scheduling (sched runs after unrolling)
> and scheduling can do dependence analysis.  Yes, the RTL unroller
> does _not_ do dependence analysis at the moment, so if you want to
> know beforehand whether you can interleave iterations you have to
> actually perform dependence analysis.  Of course you can do that
> on RTL.  And of course you can do IVOPTs on RTL (yes, we don't do that
> at the moment).

Sure we *could* duplicate all high-level loop optimizations to work on RTL.
However is that worth the effort given we have them already at tree level?

> Note I'm not opposed to have IVOPTs on GIMPLE itself perform
> unrolling (I know Bin was against this given IVOPTs is already
> so comples) and a do accumulator breakup.  But I don't see how
> to do the load-multiple thing (yes, you could represent it
> as a vector load plus N element extracts on GIMPLE and it
> would be easy to teach SLP vectorization to perform this
> transform on its own if it is really profitable - which I
> doubt you can reasonably argue before RA, let alone on GIMPLE).

Let's forget about load-multiple in GIMPLE. Kyrill's example shows that
unrolling at the high level means the existing loop optimizations and analysis
work as expected and we end up with good addressing modes, LDPs and
interleaving of different iterations. With the existing RTL unroller this just
isn't happening.

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-01-24 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

--- Comment #20 from rguenther at suse dot de  ---
On Thu, 24 Jan 2019, wilco at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
> 
> --- Comment #19 from Wilco  ---
> (In reply to rguent...@suse.de from comment #18)
> 
> > > 1) Unrolling for load-pair-forming vectorisation (Richard Sandiford's
> > > suggestion)
> > 
> > If that helps, sure (I'd have guessed uarchs are going to split
> > load-multiple into separate loads, but eventually it avoids
> > load-port contention?)
> 
> Many CPUs execute LDP/STP as a single load/store, eg. Cortex-A57 executes a
> 128-bit LDP in a single cycle (see Optimization Guide).
> 
> > > 2) Unrolling and breaking accumulator dependencies.
> > 
> > IIRC RTL unrolling can do this (as side-effect, not as main
> > cost motivation) guarded with an extra switch.
> > 
> > > I think more general unrolling and the peeling associated with it can be
> > > discussed independently of 1) and 2) once we collect more data on more
> > > microarchitectures.
> > 
> > I think both of these can be "implemented" on the RTL unroller
> > side.
> 
> You still need dependence analysis, alias info, ivopt to run again. The goal 
> is
> to remove the increment of the index, use efficient addressing modes 
> (base+imm)
> and allow scheduling to move instructions between iterations. I don't believe
> the RTL unroller supports any of this today.

There's no way to encode load-multiple on GIMPLE that wouldn't be
awkward to other GIMPLE optimizers.

Yes, the RTL unroller supports scheduling (sched runs after unrolling)
and scheduling can do dependence analysis.  Yes, the RTL unroller
does _not_ do dependence analysis at the moment, so if you want to
know beforehand whether you can interleave iterations you have to
actually perform dependence analysis.  Of course you can do that
on RTL.  And of course you can do IVOPTs on RTL (yes, we don't do that
at the moment).

Note I'm not opposed to have IVOPTs on GIMPLE itself perform
unrolling (I know Bin was against this given IVOPTs is already
so comples) and a do accumulator breakup.  But I don't see how
to do the load-multiple thing (yes, you could represent it
as a vector load plus N element extracts on GIMPLE and it
would be easy to teach SLP vectorization to perform this
transform on its own if it is really profitable - which I
doubt you can reasonably argue before RA, let alone on GIMPLE).

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-01-24 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

--- Comment #19 from Wilco  ---
(In reply to rguent...@suse.de from comment #18)

> > 1) Unrolling for load-pair-forming vectorisation (Richard Sandiford's
> > suggestion)
> 
> If that helps, sure (I'd have guessed uarchs are going to split
> load-multiple into separate loads, but eventually it avoids
> load-port contention?)

Many CPUs execute LDP/STP as a single load/store, eg. Cortex-A57 executes a
128-bit LDP in a single cycle (see Optimization Guide).

> > 2) Unrolling and breaking accumulator dependencies.
> 
> IIRC RTL unrolling can do this (as side-effect, not as main
> cost motivation) guarded with an extra switch.
> 
> > I think more general unrolling and the peeling associated with it can be
> > discussed independently of 1) and 2) once we collect more data on more
> > microarchitectures.
> 
> I think both of these can be "implemented" on the RTL unroller
> side.

You still need dependence analysis, alias info, ivopt to run again. The goal is
to remove the increment of the index, use efficient addressing modes (base+imm)
and allow scheduling to move instructions between iterations. I don't believe
the RTL unroller supports any of this today.

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-01-24 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

--- Comment #18 from rguenther at suse dot de  ---
On Thu, 24 Jan 2019, ktkachov at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
> 
> --- Comment #17 from ktkachov at gcc dot gnu.org ---
> I played around with the source to do some conservative 2x manual unrolling in
> the two hottest functions in 510.parest_r (3 more-or-less identical tight FMA
> loops). This was to try out Richard's thinking suggestion in #c10 about
> unrolling for forming load-pairs, and also to break the accumulator 
> dependency.
> 
> So the initial testcase now became:
> unsigned int *colnums;
> double *val;
> 
> struct foostruct
> {
>   unsigned int rows;
>   unsigned int *colnums;
>   unsigned int *rowstart;
> };
> 
> struct foostruct *cols;
> 
> void
> foo (double * __restrict__ dst, const double *__restrict__ src)
> {
>   const unsigned int n_rows = cols->rows;
>   const double *val_ptr = [cols->rowstart[0]];
>   const unsigned int *colnum_ptr = >colnums[cols->rowstart[0]];  
> 
>   double *dst_ptr = dst;
>   for (unsigned int row=0; row {
>   double s = 0.;
>   const double *const val_end_of_row = [cols->rowstart[row+1]];
>   __PTRDIFF_TYPE__ diff = val_end_of_row - val_ptr;
> 
>   if (diff & 1) // Peel the odd iteration.
> s += *val_ptr++ * src[*colnum_ptr++];
> 
>   double s1 = 0.; // Second accumulator
>   while (val_ptr != val_end_of_row)
> {
>   s += val_ptr[0] * src[colnum_ptr[0]];
>   s1 += val_ptr[1] * src[colnum_ptr[1]];
>   val_ptr += 2;
>   colnum_ptr += 2;
> }
>   *dst_ptr++ = s + s1;
> }
> }
> 
> This transformed the initial loop from:
> .L4:
> ldr w3, [x7, x2, lsl 2]
> cmp x6, x2
> ldr d2, [x5, x2, lsl 3]
> add x2, x2, 1
> ldr d1, [x1, x3, lsl 3]
> fmadd   d0, d2, d1, d0
> bne .L4
> 
> into:
> .L5:
> ldp w6, w5, [x3] // LDP
> add x3, x3, 8
> ldp d5, d3, [x2]  // LDP
> add x2, x2, 16
> ldr d4, [x1, x6, lsl 3]
> cmp x4, x2
> ldr d2, [x1, x5, lsl 3]
> fmadd   d0, d5, d4, d0
> fmadd   d1, d3, d2, d1
> bne .L5
> 
> In parest itself a few of the loops transformed this way did not form LDPs due
> to unrelated LDP-forming inefficiencies but the most did.
> This transformation gave a 3% improvement on a Cortex-A72. There are more
> similar loops in the 3rd, 4th and 5th hottest functions in that benchmark, so 
> I
> suspect if we do something intelligent there as well we'll get even more
> sizeable gains.
> 
> So rather than solving general "unrolling", how about we break this down into
> two desirable transformations:
> 1) Unrolling for load-pair-forming vectorisation (Richard Sandiford's
> suggestion)

If that helps, sure (I'd have guessed uarchs are going to split
load-multiple into separate loads, but eventually it avoids
load-port contention?)

> 2) Unrolling and breaking accumulator dependencies.

IIRC RTL unrolling can do this (as side-effect, not as main
cost motivation) guarded with an extra switch.

> I think more general unrolling and the peeling associated with it can be
> discussed independently of 1) and 2) once we collect more data on more
> microarchitectures.

I think both of these can be "implemented" on the RTL unroller
side.

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-01-24 Thread ktkachov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

--- Comment #17 from ktkachov at gcc dot gnu.org ---
I played around with the source to do some conservative 2x manual unrolling in
the two hottest functions in 510.parest_r (3 more-or-less identical tight FMA
loops). This was to try out Richard's thinking suggestion in #c10 about
unrolling for forming load-pairs, and also to break the accumulator dependency.

So the initial testcase now became:
unsigned int *colnums;
double *val;

struct foostruct
{
  unsigned int rows;
  unsigned int *colnums;
  unsigned int *rowstart;
};

struct foostruct *cols;

void
foo (double * __restrict__ dst, const double *__restrict__ src)
{
  const unsigned int n_rows = cols->rows;
  const double *val_ptr = [cols->rowstart[0]];
  const unsigned int *colnum_ptr = >colnums[cols->rowstart[0]];  

  double *dst_ptr = dst;
  for (unsigned int row=0; rowrowstart[row+1]];
  __PTRDIFF_TYPE__ diff = val_end_of_row - val_ptr;

  if (diff & 1) // Peel the odd iteration.
s += *val_ptr++ * src[*colnum_ptr++];

  double s1 = 0.; // Second accumulator
  while (val_ptr != val_end_of_row)
{
  s += val_ptr[0] * src[colnum_ptr[0]];
  s1 += val_ptr[1] * src[colnum_ptr[1]];
  val_ptr += 2;
  colnum_ptr += 2;
}
  *dst_ptr++ = s + s1;
}
}

This transformed the initial loop from:
.L4:
ldr w3, [x7, x2, lsl 2]
cmp x6, x2
ldr d2, [x5, x2, lsl 3]
add x2, x2, 1
ldr d1, [x1, x3, lsl 3]
fmadd   d0, d2, d1, d0
bne .L4

into:
.L5:
ldp w6, w5, [x3] // LDP
add x3, x3, 8
ldp d5, d3, [x2]  // LDP
add x2, x2, 16
ldr d4, [x1, x6, lsl 3]
cmp x4, x2
ldr d2, [x1, x5, lsl 3]
fmadd   d0, d5, d4, d0
fmadd   d1, d3, d2, d1
bne .L5

In parest itself a few of the loops transformed this way did not form LDPs due
to unrelated LDP-forming inefficiencies but the most did.
This transformation gave a 3% improvement on a Cortex-A72. There are more
similar loops in the 3rd, 4th and 5th hottest functions in that benchmark, so I
suspect if we do something intelligent there as well we'll get even more
sizeable gains.

So rather than solving general "unrolling", how about we break this down into
two desirable transformations:
1) Unrolling for load-pair-forming vectorisation (Richard Sandiford's
suggestion)
2) Unrolling and breaking accumulator dependencies.

I think more general unrolling and the peeling associated with it can be
discussed independently of 1) and 2) once we collect more data on more
microarchitectures.

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-01-17 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

--- Comment #16 from Wilco  ---
(In reply to rguent...@suse.de from comment #15)

> which is what I refered to for branch prediction.  Your & prompts me
> to a way to do sth similar as duffs device, turning the loop into a nest.
> 
>  head:
>if (n == 0) exit;
><1 iteration>
>if (n & 1)
>  n -= 1, goto head;
><1 iteration>
>if (n & 2)
>  n -= 2, goto head;
><2 iterations>
>if (n & 4)
>  n -= 4, goto head;
><4 iterations>
>n -= 8, goto head;
> 
> the inner loop tests should become well-predicted quickly.
> 
> But as always - more branches make it more likely that one is
> mispredicted.  For a single non-unrolled loop you usually only
> get the actual exit mispredicted.

Yes the overlapping the branches for the tail loop and the main loop will
result in more mispredictions. And there are still 4 branches for an 8x
unrolled loop, blocking optimizations and scheduling. So Duff's device is
always inefficient - the above loop is much faster like this:

if (n & 1)
 do 1 iteration
if (n & 2)
 do 2 iterations
if (n >= 4)
 do
  4 iterations
 while ((n -= 4) > 0)

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-01-17 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

--- Comment #15 from rguenther at suse dot de  ---
On Thu, 17 Jan 2019, wilco at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
> 
> --- Comment #14 from Wilco  ---
> (In reply to rguent...@suse.de from comment #13)
> 
> > Usually the peeling is done to improve branch prediction on the
> > prologue/epilogue.
> 
> Modern branch predictors do much better on a loop than with this kind of code:
> 
> andsx12, x11, 7
> beq .L70
> cmp x12, 1
> beq .L55
> cmp x12, 2
> beq .L57
> cmp x12, 3
> beq .L59
> cmp x12, 4
> beq .L61
> cmp x12, 5
> beq .L63
> cmp x12, 6
> bne .L72
> 
> That's way too many branches close together so most predictors will hit the
> maximum branches per fetch block limit and not predict them.
> 
> If you wanted to peel it would have to be like:
> 
> if (n & 4)
>   do 4 iterations
> if (n & 2)
>   do 2 iterations
> if (n & 1)
>   do 1 iteration
> 
> However that's still way too much code explosion for little or no gain. The
> only case where this makes sense is in a handwritten memcpy.

Classic peeling would be

  do 1 iteration
  if (n == 1) exit
  do 1 iteration
  if (n == 2) exit
...

which is what I refered to for branch prediction.  Your & prompts me
to a way to do sth similar as duffs device, turning the loop into a nest.

 head:
   if (n == 0) exit;
   <1 iteration>
   if (n & 1)
 n -= 1, goto head;
   <1 iteration>
   if (n & 2)
 n -= 2, goto head;
   <2 iterations>
   if (n & 4)
 n -= 4, goto head;
   <4 iterations>
   n -= 8, goto head;

the inner loop tests should become well-predicted quickly.

But as always - more branches make it more likely that one is
mispredicted.  For a single non-unrolled loop you usually only
get the actual exit mispredicted.

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-01-17 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

--- Comment #14 from Wilco  ---
(In reply to rguent...@suse.de from comment #13)

> Usually the peeling is done to improve branch prediction on the
> prologue/epilogue.

Modern branch predictors do much better on a loop than with this kind of code:

andsx12, x11, 7
beq .L70
cmp x12, 1
beq .L55
cmp x12, 2
beq .L57
cmp x12, 3
beq .L59
cmp x12, 4
beq .L61
cmp x12, 5
beq .L63
cmp x12, 6
bne .L72

That's way too many branches close together so most predictors will hit the
maximum branches per fetch block limit and not predict them.

If you wanted to peel it would have to be like:

if (n & 4)
  do 4 iterations
if (n & 2)
  do 2 iterations
if (n & 1)
  do 1 iteration

However that's still way too much code explosion for little or no gain. The
only case where this makes sense is in a handwritten memcpy.

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-01-17 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

--- Comment #13 from rguenther at suse dot de  ---
On Wed, 16 Jan 2019, ktkachov at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
> 
> --- Comment #11 from ktkachov at gcc dot gnu.org ---
> Thank you all for the input.
> 
> Just to add a bit of data.
> I've instrumented 510.parest_r to count the number of loop iterations to get a
> feel for how much of the unrolled loop is spent in the actual unrolled part
> rather than the prologue/peeled part. Overall, the hot function itself is
> entered 290M times. The distribution of loop iteration counts is:
> 
> Frequency iter:
> 92438870  36
> 87028560  54
> 20404571  24
> 17312960  62
> 14237184  72
> 13403904  108
> 7574437   102
> 7574420   70
> 5564881   40
> 4328249   64
> 4328240   46
> 3142656   48
> 2666496   124
> 1248176   8
> 1236641   16
> 1166592   204
> 1166592   140
> 1134392   4
>  857088   80
>  24   92
>  24   128
>  618320   30
>  613056   1
>  234464   2
>  190464   32
>   95232   60
>   84476   20
>   48272   10
>6896   5
> 
> So the two most common iteration counts are 36 and 54. For an 8x unrolled loop
> that's 4 and 6 iterations spent in the prologue with 4 and 6 times going 
> around
> the 8x unrolled loop respectively.
> 
> As an experiment I hacked the AArch64 assembly of the function generated with
> -funroll-loops to replace the peeled prologue version with a simple
> non-unrolled loop. That gave a sizeable speedup on two AArch64 platforms: >7%.
> 
> So beyond the vectorisation point Richard S. made above, maybe it's worth
> considering replacing the peeled prologue with a simple loop instead?
> Or at least add that as a distinct unrolling strategy and work to come up with
> an analysis that would allow us to choose one over the other?

Patches welcome ;)

Usually the peeling is done to improve branch prediction on the
prologue/epilogue.

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-01-16 Thread ktkachov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

--- Comment #12 from ktkachov at gcc dot gnu.org ---
(In reply to ktkachov from comment #11)
> 
> As an experiment I hacked the AArch64 assembly of the function generated
> with -funroll-loops to replace the peeled prologue version with a simple
> non-unrolled loop. That gave a sizeable speedup on two AArch64 platforms:
> >7%.
> 
> So beyond the vectorisation point Richard S. made above, maybe it's worth
> considering replacing the peeled prologue with a simple loop instead?
> Or at least add that as a distinct unrolling strategy and work to come up
> with an analysis that would allow us to choose one over the other?

Upon reflection I think I may have bungled up the assembly hacking (the changes
I made may not be equivalent to the source). I'll redo that experiment soon, so
please disregard that part for now. The iteration count distribution numbers
are still valid though.

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-01-16 Thread ktkachov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

--- Comment #11 from ktkachov at gcc dot gnu.org ---
Thank you all for the input.

Just to add a bit of data.
I've instrumented 510.parest_r to count the number of loop iterations to get a
feel for how much of the unrolled loop is spent in the actual unrolled part
rather than the prologue/peeled part. Overall, the hot function itself is
entered 290M times. The distribution of loop iteration counts is:

Frequency iter:
92438870  36
87028560  54
20404571  24
17312960  62
14237184  72
13403904  108
7574437   102
7574420   70
5564881   40
4328249   64
4328240   46
3142656   48
2666496   124
1248176   8
1236641   16
1166592   204
1166592   140
1134392   4
 857088   80
 24   92
 24   128
 618320   30
 613056   1
 234464   2
 190464   32
  95232   60
  84476   20
  48272   10
   6896   5

So the two most common iteration counts are 36 and 54. For an 8x unrolled loop
that's 4 and 6 iterations spent in the prologue with 4 and 6 times going around
the 8x unrolled loop respectively.

As an experiment I hacked the AArch64 assembly of the function generated with
-funroll-loops to replace the peeled prologue version with a simple
non-unrolled loop. That gave a sizeable speedup on two AArch64 platforms: >7%.

So beyond the vectorisation point Richard S. made above, maybe it's worth
considering replacing the peeled prologue with a simple loop instead?
Or at least add that as a distinct unrolling strategy and work to come up with
an analysis that would allow us to choose one over the other?

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-01-16 Thread rsandifo at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

rsandifo at gcc dot gnu.org  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2019-01-16
 CC||rsandifo at gcc dot gnu.org
 Ever confirmed|0   |1

--- Comment #10 from rsandifo at gcc dot gnu.org  
---
FWIW, I agree that pure unrolling doesn't feel like a gimple-level
optimisation.  Whether it's a win or not depends on whether the
unrolled loop will make better use of the microarchitecture.
The problem isn't just that that's hard to decide at the gimple level,
but that the result can't be represented directly in gimple.  AIUI
there's no real significance to the schedule of gimple statements
(beyond ensuring valid SSA and functional correctness).

This is different from vectorisation and ivopts, which can represent
the benefit of the transformation directly in gimple (using vector
ops and TARGET_MEM_REFs respectively).

As Kyrill pointed out off-list, LLVM does the unrolling in the vectoriser
rather than a separate unrolling pass.  (Use -mllvm -print-after-all
to see this.)

I think for AArch64 we can view LDP and STP as 2-element vector loads
and stores that have zero-cost insertion and extraction.  So converting:

  ldr x0, [...]
  add x0, x0, 1
  str x0, [...]

into:

  ldp x0, x1, [...]
  add x0, x0, 1
  add x1, x1, 1
  stp x0, x1, [...]

is IMO genuine vectorisation.  The LDPs and STPs are effectively
scalar IFN_LOAD_LANES and IFN_STORE_LANES, although we could also
represent them as single-element (V1) vector ops instead if that
seems more consistent.

Vectorising operations other than loads and stores would simply
involve duplicating the statements VF times.

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-01-15 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

--- Comment #9 from rguenther at suse dot de  ---
On Tue, 15 Jan 2019, ktkachov at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
> 
> --- Comment #8 from ktkachov at gcc dot gnu.org ---
> btw looks likes ICC vectorises this as well as unrolling:
> ..B1.14:
> movl  (%rcx,%rbx,4), %r15d  
> vmovsd(%rdi,%r15,8), %xmm2  
> movl  4(%rcx,%rbx,4), %r15d 
> vmovhpd   (%rdi,%r15,8), %xmm2, %xmm3   
> movl  8(%rcx,%rbx,4), %r15d 
> vfmadd231pd (%r10,%rbx,8), %xmm3, %xmm0 
> vmovsd(%rdi,%r15,8), %xmm4  
> movl  12(%rcx,%rbx,4), %r15d
> vmovhpd   (%rdi,%r15,8), %xmm4, %xmm5   
> movl  16(%rcx,%rbx,4), %r15d
> vfmadd231pd 16(%r10,%rbx,8), %xmm5, %xmm1   
> vmovsd(%rdi,%r15,8), %xmm6  
> movl  20(%rcx,%rbx,4), %r15d
> vmovhpd   (%rdi,%r15,8), %xmm6, %xmm7   
> movl  24(%rcx,%rbx,4), %r15d
> vfmadd231pd 32(%r10,%rbx,8), %xmm7, %xmm0   
> vmovsd(%rdi,%r15,8), %xmm8  
> movl  28(%rcx,%rbx,4), %r15d
> vmovhpd   (%rdi,%r15,8), %xmm8, %xmm9   
> vfmadd231pd 48(%r10,%rbx,8), %xmm9, %xmm1   
> addq  $8, %rbx  
> cmpq  %r14, %rbx
> jb..B1.14 
> 
> Is that something GCC could reasonably do?

GCC could choose a larger vectorization factor, yes.
The longer epilogue could be vectorized with the same
vector size again then.

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-01-15 Thread ktkachov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

--- Comment #8 from ktkachov at gcc dot gnu.org ---
btw looks likes ICC vectorises this as well as unrolling:
..B1.14:
movl  (%rcx,%rbx,4), %r15d  
vmovsd(%rdi,%r15,8), %xmm2  
movl  4(%rcx,%rbx,4), %r15d 
vmovhpd   (%rdi,%r15,8), %xmm2, %xmm3   
movl  8(%rcx,%rbx,4), %r15d 
vfmadd231pd (%r10,%rbx,8), %xmm3, %xmm0 
vmovsd(%rdi,%r15,8), %xmm4  
movl  12(%rcx,%rbx,4), %r15d
vmovhpd   (%rdi,%r15,8), %xmm4, %xmm5   
movl  16(%rcx,%rbx,4), %r15d
vfmadd231pd 16(%r10,%rbx,8), %xmm5, %xmm1   
vmovsd(%rdi,%r15,8), %xmm6  
movl  20(%rcx,%rbx,4), %r15d
vmovhpd   (%rdi,%r15,8), %xmm6, %xmm7   
movl  24(%rcx,%rbx,4), %r15d
vfmadd231pd 32(%r10,%rbx,8), %xmm7, %xmm0   
vmovsd(%rdi,%r15,8), %xmm8  
movl  28(%rcx,%rbx,4), %r15d
vmovhpd   (%rdi,%r15,8), %xmm8, %xmm9   
vfmadd231pd 48(%r10,%rbx,8), %xmm9, %xmm1   
addq  $8, %rbx  
cmpq  %r14, %rbx
jb..B1.14 

Is that something GCC could reasonably do?

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-01-09 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

--- Comment #7 from Wilco  ---
(In reply to rguent...@suse.de from comment #6)
> On Wed, 9 Jan 2019, wilco at gcc dot gnu.org wrote:
> 
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
> > 
> > --- Comment #5 from Wilco  ---
> > (In reply to Wilco from comment #4)
> > > (In reply to ktkachov from comment #2)
> > > > Created attachment 45386 [details]
> > > > aarch64-llvm output with -Ofast -mcpu=cortex-a57
> > > > 
> > > > I'm attaching the full LLVM aarch64 output.
> > > > 
> > > > The output you quoted is with -funroll-loops. If that's not given, GCC
> > > > doesn't seem to unroll by default at all (on aarch64 or x86_64 from my
> > > > testing).
> > > > 
> > > > Is there anything we can do to make the default unrolling a bit more
> > > > aggressive?
> > > 
> > > I don't think the RTL unroller works at all. It doesn't have the right
> > > settings, and doesn't understand how to unroll, so we always get 
> > > inefficient
> > > and bloated code.
> > > 
> > > To do unrolling correctly it has to be integrated at tree level - for
> > > example when vectorization isn't possible/beneficial, unrolling might 
> > > still
> > > be a good idea.
> > 
> > To add some numbers to the conversation, the gain LLVM gets from default
> > unrolling is 4.5% on SPECINT2017 and 1.0% on SPECFP2017.
> > 
> > This clearly shows there is huge potential from unrolling, *if* we can teach
> > GCC to unroll properly like LLVM. That means early unrolling, using good
> > default settings and using a trailing loop rather than inefficient peeling.
> 
> I don't see why this cannot be done on RTL where we have vastly more
> information of whether there are execution resources that can be
> used by unrolling.  Note we also want unrolling to interleave
> instructions to not rely on pre-reload scheduling which in turn means
> having a good eye on register pressure (again sth not very well handled
> on GIMPLE)

The main issue is that other loop optimizations are done on tree, so things
like addressing modes, loop invariants, CSEs are run on the non-unrolled
version. Then when we unroll in RTL we end up with very non-optimal code.
Typical unrolled loop starts like this:

add x13, x2, 1
add x14, x2, 2
add x11, x2, 3
add x10, x2, 4
ldr w30, [x4, x13, lsl 2]
add x9, x2, 5
add x5, x2, 6
add x12, x2, 7
ldr d23, [x3, x2, lsl 3]
... rest of unrolled loop

So basically it decides to create a new induction variable for every unrolled
copy in the loop. This often leads to spills just because it creates way too
many redundant addressing instructions. It also blocks scheduling between
iterations since the alias optimization doesn't appear to understand simple
constant differences between indices.

So unrolling should definitely be done at a high level just like vectorization.

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-01-09 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

--- Comment #6 from rguenther at suse dot de  ---
On Wed, 9 Jan 2019, wilco at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760
> 
> --- Comment #5 from Wilco  ---
> (In reply to Wilco from comment #4)
> > (In reply to ktkachov from comment #2)
> > > Created attachment 45386 [details]
> > > aarch64-llvm output with -Ofast -mcpu=cortex-a57
> > > 
> > > I'm attaching the full LLVM aarch64 output.
> > > 
> > > The output you quoted is with -funroll-loops. If that's not given, GCC
> > > doesn't seem to unroll by default at all (on aarch64 or x86_64 from my
> > > testing).
> > > 
> > > Is there anything we can do to make the default unrolling a bit more
> > > aggressive?
> > 
> > I don't think the RTL unroller works at all. It doesn't have the right
> > settings, and doesn't understand how to unroll, so we always get inefficient
> > and bloated code.
> > 
> > To do unrolling correctly it has to be integrated at tree level - for
> > example when vectorization isn't possible/beneficial, unrolling might still
> > be a good idea.
> 
> To add some numbers to the conversation, the gain LLVM gets from default
> unrolling is 4.5% on SPECINT2017 and 1.0% on SPECFP2017.
> 
> This clearly shows there is huge potential from unrolling, *if* we can teach
> GCC to unroll properly like LLVM. That means early unrolling, using good
> default settings and using a trailing loop rather than inefficient peeling.

I don't see why this cannot be done on RTL where we have vastly more
information of whether there are execution resources that can be
used by unrolling.  Note we also want unrolling to interleave
instructions to not rely on pre-reload scheduling which in turn means
having a good eye on register pressure (again sth not very well handled
on GIMPLE)

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-01-09 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

--- Comment #5 from Wilco  ---
(In reply to Wilco from comment #4)
> (In reply to ktkachov from comment #2)
> > Created attachment 45386 [details]
> > aarch64-llvm output with -Ofast -mcpu=cortex-a57
> > 
> > I'm attaching the full LLVM aarch64 output.
> > 
> > The output you quoted is with -funroll-loops. If that's not given, GCC
> > doesn't seem to unroll by default at all (on aarch64 or x86_64 from my
> > testing).
> > 
> > Is there anything we can do to make the default unrolling a bit more
> > aggressive?
> 
> I don't think the RTL unroller works at all. It doesn't have the right
> settings, and doesn't understand how to unroll, so we always get inefficient
> and bloated code.
> 
> To do unrolling correctly it has to be integrated at tree level - for
> example when vectorization isn't possible/beneficial, unrolling might still
> be a good idea.

To add some numbers to the conversation, the gain LLVM gets from default
unrolling is 4.5% on SPECINT2017 and 1.0% on SPECFP2017.

This clearly shows there is huge potential from unrolling, *if* we can teach
GCC to unroll properly like LLVM. That means early unrolling, using good
default settings and using a trailing loop rather than inefficient peeling.

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-01-09 Thread wilco at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

Wilco  changed:

   What|Removed |Added

 CC||wilco at gcc dot gnu.org

--- Comment #4 from Wilco  ---
(In reply to ktkachov from comment #2)
> Created attachment 45386 [details]
> aarch64-llvm output with -Ofast -mcpu=cortex-a57
> 
> I'm attaching the full LLVM aarch64 output.
> 
> The output you quoted is with -funroll-loops. If that's not given, GCC
> doesn't seem to unroll by default at all (on aarch64 or x86_64 from my
> testing).
> 
> Is there anything we can do to make the default unrolling a bit more
> aggressive?

I don't think the RTL unroller works at all. It doesn't have the right
settings, and doesn't understand how to unroll, so we always get inefficient
and bloated code.

To do unrolling correctly it has to be integrated at tree level - for example
when vectorization isn't possible/beneficial, unrolling might still be a good
idea.

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-01-09 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

--- Comment #3 from Richard Biener  ---
(In reply to ktkachov from comment #2)
> Created attachment 45386 [details]
> aarch64-llvm output with -Ofast -mcpu=cortex-a57
> 
> I'm attaching the full LLVM aarch64 output.
> 
> The output you quoted is with -funroll-loops. If that's not given, GCC
> doesn't seem to unroll by default at all (on aarch64 or x86_64 from my
> testing).
> 
> Is there anything we can do to make the default unrolling a bit more
> aggressive?

Well, the RTL loop unroller is not enabled by default at any
optimization level (unless you are using FDO).  There's also
related flags not enabled (-fsplit-ivs-in-unroller and
-fvariable-expansion-in-unroller).

The RTL loop unroller is simply not good at estimating benefit
of unrolling (which is also why you usually see it unrolling
--param max-unroll-times times) and the tunables it has are
not very well tuned across targets.

Micha did quite extensive benchmarking (on x86_64) which shows that
the cases where unrolling is profitable are rare and the reason
is often hard to understand.

That's of course in the context of CPUs having caches of
pre-decoded/fused/etc. instructions optimizing issue which
makes peeled prologues expensive as well as even more special
caches for small loops avoiding more frontend costs.

Not sure if arm archs have any of this.

I generally don't believe in unrolling as a separately profitable
transform.  Rather unrolling could be done as part of another
transform (vectorization is the best example).  For sth still
done on RTL that would then include scheduling which is where
the best cost estimates should be available (and if you do
this post-reload then you even have a very good idea of
register pressure).  This is also why I think a standalone
unrolling phase belongs on RTL since I don't see a good way
of estimating cost/benefit on GIMPLE (see how difficult it is
to cost vectorization vs. non-vectorization there).

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-01-09 Thread ktkachov at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

--- Comment #2 from ktkachov at gcc dot gnu.org ---
Created attachment 45386
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45386=edit
aarch64-llvm output with -Ofast -mcpu=cortex-a57

I'm attaching the full LLVM aarch64 output.

The output you quoted is with -funroll-loops. If that's not given, GCC doesn't
seem to unroll by default at all (on aarch64 or x86_64 from my testing).

Is there anything we can do to make the default unrolling a bit more
aggressive?

[Bug tree-optimization/88760] GCC unrolling is suboptimal

2019-01-09 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88760

--- Comment #1 from Richard Biener  ---
So LLVM unrolls 4 times while GCC (always) unrolls 8 times.  The unrolled body
for GCC (x86_64 this time) is

.L4:
movl(%rdx), %ecx
vmovsd  (%rax), %xmm8
addq$32, %rdx
addq$64, %rax
vmovsd  -56(%rax), %xmm9
vmovsd  -48(%rax), %xmm10
vfmadd231sd (%rsi,%rcx,8), %xmm8, %xmm0
movl-28(%rdx), %ecx
vmovsd  -40(%rax), %xmm11
vmovsd  -32(%rax), %xmm12
vfmadd231sd (%rsi,%rcx,8), %xmm9, %xmm0
movl-24(%rdx), %ecx
vmovsd  -24(%rax), %xmm13
vmovsd  -16(%rax), %xmm14
vfmadd231sd (%rsi,%rcx,8), %xmm10, %xmm0
movl-20(%rdx), %ecx
vmovsd  -8(%rax), %xmm15
vfmadd231sd (%rsi,%rcx,8), %xmm11, %xmm0
movl-16(%rdx), %ecx
vfmadd231sd (%rsi,%rcx,8), %xmm12, %xmm0
movl-12(%rdx), %ecx
vfmadd231sd (%rsi,%rcx,8), %xmm13, %xmm0
movl-8(%rdx), %ecx
vfmadd231sd (%rsi,%rcx,8), %xmm14, %xmm0
movl-4(%rdx), %ecx
vfmadd231sd (%rsi,%rcx,8), %xmm15, %xmm0
cmpq%rax, %r9
jne .L4

and what you quoted is the prologue.  You didn't quote llvms prologue
but if I read my clangs outout correct it uses a loop there.
(is there sth like -fdump-tree-optimized for clang?)

Our RTL unroller cannot do a loopy prologue but it always has this
jump-into peeled copies thing.  Using --param max-unroll-times=4
produces

.L4:
movl(%rdx), %ecx
vmovsd  (%rax), %xmm2
addq$16, %rdx
addq$32, %rax
vmovsd  -24(%rax), %xmm3
vmovsd  -16(%rax), %xmm4
vfmadd231sd (%rsi,%rcx,8), %xmm2, %xmm0
movl-12(%rdx), %ecx
vmovsd  -8(%rax), %xmm5
vfmadd231sd (%rsi,%rcx,8), %xmm3, %xmm0
movl-8(%rdx), %ecx
vfmadd231sd (%rsi,%rcx,8), %xmm4, %xmm0
movl-4(%rdx), %ecx
vfmadd231sd (%rsi,%rcx,8), %xmm5, %xmm0
cmpq%rax, %r8
jne .L4

which is nearly equivalent to clnags varaint?