[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644

2018-04-24 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037
Bug 84037 depends on bug 85491, which changed state.

Bug 85491 Summary: [8 Regression] nbench LU Decomposition test 15% slower than 
GCC 7, 30% slower than peak
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85491

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644

2018-02-16 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037

--- Comment #33 from Richard Biener  ---
Author: rguenth
Date: Fri Feb 16 13:47:25 2018
New Revision: 257734

URL: https://gcc.gnu.org/viewcvs?rev=257734&root=gcc&view=rev
Log:
2018-02-16  Richard Biener  

PR tree-optimization/84037
PR tree-optimization/84016
PR target/82862
* config/i386/i386.c (ix86_builtin_vectorization_cost):
Adjust vec_construct for the fact we need additional higher latency
128bit inserts for AVX256 and AVX512 vector builds.
(ix86_add_stmt_cost): Scale vector construction cost for
elementwise loads.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/config/i386/i386.c

[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644

2018-02-16 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037

Richard Biener  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #32 from Richard Biener  ---
Fixed.

[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644

2018-02-14 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037

Richard Biener  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |rguenth at gcc dot 
gnu.org

--- Comment #31 from Richard Biener  ---
(In reply to Jakub Jelinek from comment #30)
> Is this fixed now or is there more work to do?

I'm performance testing some more fixes.

[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644

2018-02-14 Thread jakub at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037

--- Comment #30 from Jakub Jelinek  ---
Is this fixed now or is there more work to do?

[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644

2018-02-12 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037

--- Comment #29 from Richard Biener  ---
Author: rguenth
Date: Mon Feb 12 13:55:04 2018
New Revision: 257588

URL: https://gcc.gnu.org/viewcvs?rev=257588&root=gcc&view=rev
Log:
2018-02-12  Richard Biener  

PR tree-optimization/84037
* tree-vect-slp.c (vect_analyze_slp_cost): Add visited
parameter, move visited init to caller.
(vect_slp_analyze_operations): Separate cost from validity
check, initialize visited once for all instances.
(vect_schedule_slp): Analyze map to CSE vectorized nodes once
for all instances.
* tree-vect-stmts.c (vect_model_simple_cost): Make early
out an assert.
(vect_model_promotion_demotion_cost): Likewise.
(vectorizable_bswap): Guard cost modeling with !slp_node
instead of !PURE_SLP_STMT to avoid double-counting on hybrid
SLP stmts.
(vectorizable_call): Likewise.
(vectorizable_conversion): Likewise.
(vectorizable_assignment): Likewise.
(vectorizable_shift): Likewise.
(vectorizable_operation): Likewise.
(vectorizable_store): Likewise.
(vectorizable_load): Likewise.
(vectorizable_condition): Likewise.
(vectorizable_comparison): Likewise.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/tree-vect-slp.c
trunk/gcc/tree-vect-stmts.c

[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644

2018-02-12 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037

--- Comment #28 from Richard Biener  ---
Author: rguenth
Date: Mon Feb 12 08:54:28 2018
New Revision: 257581

URL: https://gcc.gnu.org/viewcvs?rev=257581&root=gcc&view=rev
Log:
2018-02-12  Richard Biener  

PR tree-optimization/84037
* tree-vect-slp.c (vect_build_slp_tree_2): Try swapping the
matched stmts if we cannot swap the non-matched ones.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/tree-vect-slp.c

[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644

2018-02-09 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037

--- Comment #27 from Richard Biener  ---
(In reply to amker from comment #26)
> (In reply to amker from comment #25)
> > I tend to believe this is an register pressure based strength-reduction +
> > lim problem than ivopts.
> > 
> > So given class of memory references like:
> > 
> >   reg = ...
> > Loop:
> >   MEM[iv_base + reg * 0];
> >   MEM[iv_base + reg * 1];
> >   MEM[iv_base + reg * 2];
> >   MEM[iv_base + reg * 3];
> >   MEM[iv_base + reg * 4];
> >   MEM[iv_base + reg * 5];
> >   MEM[iv_base + reg * 6];
> >   MEM[iv_base + reg * 7];
> > 
> > The best arrangement probably would be:
> > 
> >   reg = ...
> >   regX = reg * 3;
> > Loop:
> >   MEM[iv_base + reg * 0];
> >   MEM[iv_base + reg * 1];
> >   MEM[iv_base + reg * 2];
> >   t = iv_base + regX;
> >   MEM[t];
> >   MEM[iv_base + reg * 4];
> >   MEM[t + reg * 2];
> >   MEM[iv_base + regX * 2];
> >   MEM[t + reg * 4];
> > 
> > Depending on the register pressure, regX should be re-materialized in loop
> > or hoisted as invariant.
> 
> Of course supported scales in addressing modes should be considered, but I
> guess most targets only (if) support [base + reg * 2^x]?

Yes.

Note that the input to IVOPTs is the strength-reduced variant and IVOPTs
generates the variant with multiple hoisted invariants.

The IVOPTs issue I mentioned happens when you compare the following two
functions - for foo it ends up with two IVs (good).

float A[1024];
float B[1024];

void foo(int s, int r)
{
  for (int i = 0; i < 128; i+=4)
{
  B[(i+0)*r + 0] = A[(i+0)*s + 0];
  B[(i+0)*r + 1] = A[(i+0)*s + 1];

  B[(i+1)*r + 0] = A[(i+1)*s + 0];
  B[(i+1)*r + 1] = A[(i+1)*s + 1];

  B[(i+2)*r + 0] = A[(i+2)*s + 0];
  B[(i+2)*r + 1] = A[(i+2)*s + 1];

  B[(i+3)*r + 0] = A[(i+3)*s + 0];
  B[(i+3)*r + 1] = A[(i+3)*s + 1];
}
}

void bar(int s, int r)
{
  for (int i = 0; i < 128; i+=4)
{
  float *b = &B[i*r];
  float *a = &A[i*s];
  b[0] = a[0];
  b[1] = a[1];

  b += r;
  a += s;
  b[0] = a[0];
  b[1] = a[1];

  b += r;
  a += s;
  b[0] = a[0];
  b[1] = a[1];

  b += r;
  a += s;
  b[0] = a[0];
  b[1] = a[1];
}
}

The vectorizer actually emits sth like the following but it behaves like bar.

void baz(int s, int r)
{
  float *IVb = &B[0];
  float *IVa = &A[0];
  for (int i = 0; i < 128; i+=4)
{
  float *b = IVb;
  float *a = IVa;
  b[0] = a[0];
  b[1] = a[1];

  b += r;
  a += s;
  b[0] = a[0];
  b[1] = a[1];

  b += r;
  a += s;
  b[0] = a[0];
  b[1] = a[1];

  b += r;
  a += s;
  b[0] = a[0];
  b[1] = a[1];

  IVb += 4*r;
  IVa += 4*s;
}
}

[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644

2018-02-08 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037

--- Comment #26 from amker at gcc dot gnu.org ---
(In reply to amker from comment #25)
> I tend to believe this is an register pressure based strength-reduction +
> lim problem than ivopts.
> 
> So given class of memory references like:
> 
>   reg = ...
> Loop:
>   MEM[iv_base + reg * 0];
>   MEM[iv_base + reg * 1];
>   MEM[iv_base + reg * 2];
>   MEM[iv_base + reg * 3];
>   MEM[iv_base + reg * 4];
>   MEM[iv_base + reg * 5];
>   MEM[iv_base + reg * 6];
>   MEM[iv_base + reg * 7];
> 
> The best arrangement probably would be:
> 
>   reg = ...
>   regX = reg * 3;
> Loop:
>   MEM[iv_base + reg * 0];
>   MEM[iv_base + reg * 1];
>   MEM[iv_base + reg * 2];
>   t = iv_base + regX;
>   MEM[t];
>   MEM[iv_base + reg * 4];
>   MEM[t + reg * 2];
>   MEM[iv_base + regX * 2];
>   MEM[t + reg * 4];
> 
> Depending on the register pressure, regX should be re-materialized in loop
> or hoisted as invariant.

Of course supported scales in addressing modes should be considered, but I
guess most targets only (if) support [base + reg * 2^x]?

[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644

2018-02-08 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037

--- Comment #25 from amker at gcc dot gnu.org ---
I tend to believe this is an register pressure based strength-reduction + lim
problem than ivopts.

So given class of memory references like:

  reg = ...
Loop:
  MEM[iv_base + reg * 0];
  MEM[iv_base + reg * 1];
  MEM[iv_base + reg * 2];
  MEM[iv_base + reg * 3];
  MEM[iv_base + reg * 4];
  MEM[iv_base + reg * 5];
  MEM[iv_base + reg * 6];
  MEM[iv_base + reg * 7];

The best arrangement probably would be:

  reg = ...
  regX = reg * 3;
Loop:
  MEM[iv_base + reg * 0];
  MEM[iv_base + reg * 1];
  MEM[iv_base + reg * 2];
  t = iv_base + regX;
  MEM[t];
  MEM[iv_base + reg * 4];
  MEM[t + reg * 2];
  MEM[iv_base + regX * 2];
  MEM[t + reg * 4];

Depending on the register pressure, regX should be re-materialized in loop or
hoisted as invariant.

[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644

2018-02-08 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037
Bug 84037 depends on bug 84278, which changed state.

Bug 84278 Summary: claims initv4sfv2sf is available but inits through stack
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84278

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644

2018-02-08 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037

--- Comment #24 from Richard Biener  ---
(In reply to amker from comment #23)
> (In reply to Richard Biener from comment #21)
> > So after r257453 we improve the situation pre-IVOPTs to just
> > 6 IVs (duplicated but trivially equivalent) plus one counting IV.  But then
> > when SLP is enabled IVOPTs comes along and adds another 4 IVs which makes us
> > spill... (for AVX256, so you need -march=core-avx2 for example).
> > 
> > Bin, any chance you can take a look?  In the IVO dump I see
> > 
> >   target_avail_regs 15
> >   target_clobbered_regs 9
> >   target_reg_cost 4
> >   target_spill_cost 8
> >   regs_used 3
> > ^^^
> > 
> > and regs_used looks awfully low to me.  The loop has even more IVs initially
> > plus variable steps for that IVs which means we need two regs per IV.
> > 
> > There doesn't seem to be a way to force IVOPTs to use the minimal set of 
> > IVs?
> > Or just use the original set, removing the obvious redundancies?  There is
> > a microarchitectural issue left with the vectorization but the spilling
> > obscures the look quite a bit :/
> 
> Sure, I will have a look based on your commit.  Thanks

Note the loop in question is the one starting at line 551, it gets inlined
multiple times but the issue is visible with -fno-inline as well.
-mavx2 makes things worse (compared to -mavx2 -mprefer-avx128) because
for the strided accesses we choose to compute extra invariants for the
two strides of A and E.  For SSE we keep stride and stride * 3 while
for AVX we additionally compute stride * 5, stride * 6 and stride * 7
(in the cases we don't choose another base IV).  At least computing stride * 6
can be avoided by using stride * 3 with step 2 - but it's probably too
hard to see that within the current IVO model?  I'm not sure avoiding
an invariant in exchange for an extra IV is ever a good idea?  Spilling
an invariant should be cheaper than spilling an IV - but yes, the
addressing mode possibly looks to offset for any bias we apply there.

Note the vectorizer itself tries to avoid computing stride * N by
strength-reducing it:

  _711 = (sizetype) iftmp.472_91;
  _712 = _711 * 64;
  _715 = (sizetype) iftmp.472_91;
  _716 = _715 * 8;
...
  # ivtmp_891 = PHI 
...
  _893 = MEM[(real(kind=4) *)ivtmp_891];
  ivtmp_894 = ivtmp_891 + _716;
  _895 = MEM[(real(kind=4) *)ivtmp_894];
  ivtmp_896 = ivtmp_894 + _716;
  _897 = MEM[(real(kind=4) *)ivtmp_896];
  ivtmp_898 = ivtmp_896 + _716;
  _899 = MEM[(real(kind=4) *)ivtmp_898];
  ivtmp_900 = ivtmp_898 + _716;
  _901 = MEM[(real(kind=4) *)ivtmp_900];
  ivtmp_902 = ivtmp_900 + _716;
  _903 = MEM[(real(kind=4) *)ivtmp_902];
  ivtmp_904 = ivtmp_902 + _716;
  _905 = MEM[(real(kind=4) *)ivtmp_904];
  ivtmp_906 = ivtmp_904 + _716;
  _907 = MEM[(real(kind=4) *)ivtmp_906];
  vect_cst__909 = {_893, _895, _897, _899, _901, _903, _905, _907};
...
  ivtmp_892 = ivtmp_891 + _712;

note how it advances the IV in one step at the end though.  Not sure if
IVO is confused by that or the way we compute _716 vs. _712.

That said, the summary is that IVO behavior with unrolled loop bodies with
variable stride isn't helping here ;)

[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644

2018-02-07 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037

--- Comment #23 from amker at gcc dot gnu.org ---
(In reply to Richard Biener from comment #21)
> So after r257453 we improve the situation pre-IVOPTs to just
> 6 IVs (duplicated but trivially equivalent) plus one counting IV.  But then
> when SLP is enabled IVOPTs comes along and adds another 4 IVs which makes us
> spill... (for AVX256, so you need -march=core-avx2 for example).
> 
> Bin, any chance you can take a look?  In the IVO dump I see
> 
>   target_avail_regs 15
>   target_clobbered_regs 9
>   target_reg_cost 4
>   target_spill_cost 8
>   regs_used 3
> ^^^
> 
> and regs_used looks awfully low to me.  The loop has even more IVs initially
> plus variable steps for that IVs which means we need two regs per IV.
> 
> There doesn't seem to be a way to force IVOPTs to use the minimal set of IVs?
> Or just use the original set, removing the obvious redundancies?  There is
> a microarchitectural issue left with the vectorization but the spilling
> obscures the look quite a bit :/

Sure, I will have a look based on your commit.  Thanks

[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644

2018-02-07 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037

--- Comment #22 from Richard Biener  ---
Author: rguenth
Date: Wed Feb  7 15:46:17 2018
New Revision: 257453

URL: https://gcc.gnu.org/viewcvs?rev=257453&root=gcc&view=rev
Log:
2018-02-07  Richard Biener  

PR tree-optimization/84037
* tree-vectorizer.h (struct _loop_vec_info): Add ivexpr_map member.
(cse_and_gimplify_to_preheader): Declare.
(vect_get_place_in_interleaving_chain): Likewise.
* tree-vect-loop.c (_loop_vec_info::_loop_vec_info): Initialize
ivexpr_map.
(_loop_vec_info::~_loop_vec_info): Delete it.
(cse_and_gimplify_to_preheader): New function.
* tree-vect-slp.c (vect_get_place_in_interleaving_chain): Export.
* tree-vect-stmts.c (vectorizable_store): CSE base and steps.
(vectorizable_load): Likewise.  For grouped stores always base
the IV on the first element.
* tree-vect-loop-manip.c (vect_loop_versioning): Unshare versioning
condition before gimplifying.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/tree-vect-loop-manip.c
trunk/gcc/tree-vect-loop.c
trunk/gcc/tree-vect-slp.c
trunk/gcc/tree-vect-stmts.c
trunk/gcc/tree-vectorizer.h

[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644

2018-02-07 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037

--- Comment #21 from Richard Biener  ---
So after r257453 we improve the situation pre-IVOPTs to just
6 IVs (duplicated but trivially equivalent) plus one counting IV.  But then
when SLP is enabled IVOPTs comes along and adds another 4 IVs which makes us
spill... (for AVX256, so you need -march=core-avx2 for example).

Bin, any chance you can take a look?  In the IVO dump I see

  target_avail_regs 15
  target_clobbered_regs 9
  target_reg_cost 4
  target_spill_cost 8
  regs_used 3
^^^

and regs_used looks awfully low to me.  The loop has even more IVs initially
plus variable steps for that IVs which means we need two regs per IV.

There doesn't seem to be a way to force IVOPTs to use the minimal set of IVs?
Or just use the original set, removing the obvious redundancies?  There is
a microarchitectural issue left with the vectorization but the spilling
obscures the look quite a bit :/

[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644

2018-01-31 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037

--- Comment #20 from Richard Biener  ---
Note that targets already have the opportunity to limit vectorization by
adjusting their finish_cost hook - here they even have more useful information
available
(kind of).

[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644

2018-01-31 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037

--- Comment #19 from Richard Biener  ---
On Zen I measure 23s with --param vect-max-version-for-alias-checks=0 (thus
basically before the rev.) and 33s without.  With the patch and the size
parameter tuned to 146 I get 25s and with 90 it is 22.5s.  So Zen seems
to be a bit pickier (code-size wise) than Broadwell.

[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644

2018-01-31 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037

--- Comment #18 from Richard Biener  ---
(In reply to Jan Hubicka from comment #17)
> We already have
> /* This function adjusts the unroll factor based on
>the hardware capabilities. For ex, bdver3 has
>a loop buffer which makes unrolling of smaller
>loops less important. This function decides the
>unroll factor using number of memory references
>(value 32 is used) as a heuristic. */
> 
> static unsigned
> ix86_loop_unroll_adjust (unsigned nunroll, struct loop *loop)
> 
> which triggers with TARGET_ADJUST_UNROLL
> /* X86_TUNE_ADJUST_UNROLL: This enables adjusting the unroll factor based   
> 
>on hardware capabilities. Bdver3 hardware has a loop buffer which makes  
> 
>unrolling small loop less important. For, such architectures we adjust   
> 
>the unroll factor so that the unrolled loop fits the loop buffer.  */
> 
> DEF_TUNE (X86_TUNE_ADJUST_UNROLL, "adjust_unroll_factor", m_BDVER3 |
> m_BDVER4)  
> 
> so perhaps what you propose can be done by making this one more general?

Hmm.  Both i386 and s390x expect the function to be in RTL form for this hook
so it doesn't apply on GIMPLE (and gimple unrolling).

If we'd make it more general, thus handle both IL forms and instead of
returning an unroll factor return a maximum growth factor, not maximum
size to avoid comparing apples (size unit of target) and oranges (size
unit of consumer) then it could indeed work.  So sth like

/* Return the maximum percentage LOOP is allowed to grow to or -1U for
   no target specific constraints.  */
unsigned targetm.loop_growth_constraint (loop *loop);

the existing uses in loop-unroll.c

  if (targetm.loop_unroll_adjust)
nunroll = targetm.loop_unroll_adjust (nunroll, loop);

would then become sth like

  if (targetm.loop_unroll_adjust)
nunroll = MIN (nunroll, targetm.loop_unroll_adjust (loop) / 100);

and existing hooks would have an early out

  if (in_gimple_form)
 return -1U;

and their current return value simply multiplied by 100?

Note the current x86 implementation does

  /* Count the number of memory references within the loop body.
 This value determines the unrolling factor for bdver3 and bdver4
 architectures. */
...
  if (mem_count && mem_count <=32)
return 32/mem_count;

  return nunroll;

and thus returns 32 for mem_count == 1 and any nunroll specified by the
caller (like if it specified 2 because of some user --param).  That
looks bougs to me and we instead want to return MIN (nunroll, 32 / mem_count)?
The s390 implementation gets this right.

[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644

2018-01-31 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037

--- Comment #17 from Jan Hubicka  ---
We already have
/* This function adjusts the unroll factor based on
   the hardware capabilities. For ex, bdver3 has
   a loop buffer which makes unrolling of smaller
   loops less important. This function decides the
   unroll factor using number of memory references
   (value 32 is used) as a heuristic. */

static unsigned
ix86_loop_unroll_adjust (unsigned nunroll, struct loop *loop)

which triggers with TARGET_ADJUST_UNROLL
/* X86_TUNE_ADJUST_UNROLL: This enables adjusting the unroll factor based   
   on hardware capabilities. Bdver3 hardware has a loop buffer which makes  
   unrolling small loop less important. For, such architectures we adjust   
   the unroll factor so that the unrolled loop fits the loop buffer.  */
DEF_TUNE (X86_TUNE_ADJUST_UNROLL, "adjust_unroll_factor", m_BDVER3 | m_BDVER4)  

so perhaps what you propose can be done by making this one more general?

[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644

2018-01-30 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037

--- Comment #16 from Richard Biener  ---
So discussion lead to the proposal to add another unroll parameter, for example
--param small-loop-size which serves as a "barrier" we may not cross when
optimizing a loop.  Thus for all loops <= small-loop-size before the transform
inhibit the transform from growing the loop to > small-loop-size.  Limit other
loops like before.  The motivation is to model things like the
loop-stream-detector or uop caches where falling out of either has a severe
performance
impact.  That's without trying to combine this new heuristic with the
computed speedup by unrolling/vectorization.  The default of the parameter
would be zero (disable that feature) and targets could set it based on their
micro-architecture and benchmarking.  It should _always_ be smaller than
--param max-unrolled-insns.

Sounds sane?

[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644

2018-01-30 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037

--- Comment #15 from Richard Biener  ---
Oh, and if you don't disable inlining then you get down to sizes of 148
(SSE and SLP) and 91 and 75 (SSE and no SLP).  So you won't get rid
of two instances of vectorization regardless of the parameter
(for size 75 I don't apply PARAM_MAX_UNROLLED_INSNS because it's not at least
one full unroll copy when looking at the scalar body size of 43).

With the default param we inhibit use of SLP in

capacita2.f90:226:0: note: estimated vector body size is 19, scalar body size 2
capacita2.f90:226:0: note: not vectorized: loop grows too much.
capacita2.f90:226:0: note: estimated vector body size is 11, scalar body size 2
capacita2.f90:226:0: note: loop vectorized

capacita2.f90:259:0: note: estimated vector body size is 19, scalar body size 2
capacita2.f90:259:0: note: not vectorized: loop grows too much.
capacita2.f90:259:0: note: estimated vector body size is 11, scalar body size 2
capacita2.f90:259:0: note: loop vectorized

in addition to the critical loop (copies) at 551:

capacita2.f90:551:0: note: estimated vector body size is 298, scalar body size
43
capacita2.f90:551:0: note: not vectorized: loop grows too much.
capacita2.f90:551:0: note: estimated vector body size is 259, scalar body size
43
capacita2.f90:551:0: note: not vectorized: loop grows too much.
capacita2.f90:551:0: note: estimated vector body size is 147, scalar body size
43
capacita2.f90:551:0: note: loop vectorized
capacita2.f90:551:0: note: estimated vector body size is 259, scalar body size
43
capacita2.f90:551:0: note: not vectorized: loop grows too much.
capacita2.f90:551:0: note: estimated vector body size is 147, scalar body size
43
capacita2.f90:551:0: note: loop vectorized
capacita2.f90:551:0: note: estimated vector body size is 258, scalar body size
43
capacita2.f90:551:0: note: not vectorized: loop grows too much.
capacita2.f90:551:0: note: estimated vector body size is 259, scalar body size
43
capacita2.f90:551:0: note: not vectorized: loop grows too much.
capacita2.f90:551:0: note: estimated vector body size is 168, scalar body size
43
capacita2.f90:551:0: note: loop vectorized


I do think that applying this sort of heuristic makes sense, even if it doesn't
help the polyhedron case.

Numbers for different values of the parameter are

300 (w/o patch)  20.91user 0.05system 0:20.97elapsed
200 (default)20.98user 0.08system 0:21.07elapsed 99%CPU
147  19.62user 0.06system 0:19.70elapsed 99%CPU
146  17.27user 0.08system 0:17.36elapsed 99%CPU
140  17.19user 0.06system 0:17.26elapsed 99%CPU
91   17.41user 0.05system 0:17.48elapsed 99%CPU
90   16.98user 0.05system 0:17.04elapsed 99%CPU
75   17.01user 0.04system 0:17.06elapsed 99%CPU
74   16.93user 0.06system 0:16.99elapsed 99%CPU 
117.02user 0.06system 0:17.08elapsed 100%CPU

the sweet spot for this benchmark seems to be 146...

For reference with -fno-tree-vectorize I get

 18.36user 0.09system 0:18.45elapsed 99%CPU

with --param vect-max-version-for-alias-checks=0

 16.92user 0.06system 0:16.99elapsed 99%CPU

[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644

2018-01-30 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037

--- Comment #14 from Richard Biener  ---
Created attachment 43289
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43289&action=edit
patch limiting growth

So I played with a simple hack limiting the amount of growth in a vectorized
loop
based on our exisiting unroller parameters.  This causes AVX256 vectorization
to
be rejected on that round with an estimated scalar loop body size of 43 and
a vector loop body size of 289 (with SLP and AVX256), 259 (without SLP and
AVX256) .  But we then still accept vectorization with SSE (vector size with
SLP 147).

I simply count the number of stmts costed for the estimate and added

  if (vect_body_size / scalar_size > PARAM_VALUE (PARAM_MAX_UNROLL_TIMES)
  || ((vect_body_size / scalar_size > 1)
  && (vect_body_size > PARAM_VALUE (PARAM_MAX_UNROLLED_INSNS
{
  dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
   "not vectorized: loop grows too much.\n");
  return -1;
}

where PARAM_MAX_UNROLLED_INSNS is 200 at the moment.  With SSE vectorization
the LSD.UOPS counter still doesn't trigger on the loop so it's still too big
(it's 958 bytes, the AVX2 variant is 1365 bytes, the scalar loop is already 363
bytes).

So to fix the regression we'd need to lower the unroll parameter to 146.

[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644

2018-01-29 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037

Richard Biener  changed:

   What|Removed |Added

 CC||amker at gcc dot gnu.org

--- Comment #13 from Richard Biener  ---
So performance counters on my Broadwell machine say that there are zero hits
for this vectorized loop with LSD.UOPS while there are very many for the scalar
case.  This means the loop body is too large to trigger the LSD (and probably
also to fit the uop cache).

One of the issue is that we require 4 registers for the indexes into the loads

vmovq   (%rax,%r11,2), %xmm7
vpinsrq $1, (%rax,%r13), %xmm7, %xmm4
vmovq   (%rax), %xmm7
vpinsrq $1, (%rax,%r11), %xmm7, %xmm9
vmovq   (%rax,%r15), %xmm7
vpinsrq $1, (%rax,%r12), %xmm7, %xmm3
vmovq   (%rax,%r11,4), %xmm7
vpinsrq $1, (%rax,%r14), %xmm7, %xmm1
vinserti128 $0x1, %xmm4, %ymm9, %ymm9

from

  _1507 = (void *) ivtmp.760_1462;
  _792 = MEM[base: _1507, offset: 0B];
  _1508 = (void *) ivtmp.760_1462;
  _794 = MEM[base: _1508, index: _331, offset: 0B];
  _1509 = (void *) ivtmp.760_1462;
  _796 = MEM[base: _1509, index: _331, step: 2, offset: 0B];
  _1510 = (void *) ivtmp.760_1462;
  _1511 = _331 * 3;
  _798 = MEM[base: _1510, index: _1511, offset: 0B];
  _1512 = (void *) ivtmp.760_1462;
  _800 = MEM[base: _1512, index: _331, step: 4, offset: 0B];
  _1513 = (void *) ivtmp.760_1462;
  _1514 = _331 * 5;
  _802 = MEM[base: _1513, index: _1514, offset: 0B];
  _1515 = (void *) ivtmp.760_1462;
  _1516 = _331 * 6;
  _804 = MEM[base: _1515, index: _1516, offset: 0B];
  _1517 = (void *) ivtmp.760_1462;
  _1518 = _331 * 7;
  _806 = MEM[base: _1517, index: _1518, offset: 0B];
  vect_cst__808 = {_792, _794, _796, _798, _800, _802, _804, _806};

where IVOPTs did a reasonable job.  Later LIM hoists all the invariant
_311 * N indexes.  And IVOPTs failed to realize that _331 * 3 can be used
for _331 * 6 by using step == 2.  But in the end the register optimal
decision is probably to strength-reduce this (the vectorizer generates
strength-reduced code).

We do end up spilling most IVs in this loop.

[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644

2018-01-29 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037

--- Comment #12 from Richard Biener  ---
I have opened PR84102 for the missed optimizations in this particular loop.  I
believe now the interesting one is the other.

  30.25%  a.outa.out [.] __solv_cap_MOD_fourir2d
  24.83%  a.outa.out [.] __solv_cap_MOD_fourir
  24.83%  a.outa.out [.] __solv_cap_MOD_fourir2dx
  18.78%  a.out[unknown] [k] 0x813366e7

and the 551 loops are in the innermost nest of fourir() which is called/inlined
to fourir*.  It's actually not array expressions but the different inline
copies we vectorize.  Cost model for that one:

capacita2.f90:551:0: note: Cost model analysis:
  Vector inside of loop cost: 3756
  Vector prologue cost: 64
  Vector epilogue cost: 1712
  Scalar iteration cost: 516
  Scalar outside cost: 4
  Vector outside cost: 1776
  prologue iterations: 0
  epilogue iterations: 4
  Calculated minimum iters for profitability: 0
capacita2.f90:551:0: note:   Runtime profitability threshold = 8
capacita2.f90:551:0: note:   Static estimate profitability threshold = 11545608

the loop is hybrid SLP, VF is 8

capacita2.f90:551:0: note: improved number of alias checks from 36 to 6

so here it's also dependence analysis breaking down because the step is
unknown:

(compute_affine_dependence
  stmt_a: _170 = REALPART_EXPR <*a.0_107[_54]>;
  stmt_b: _196 = REALPART_EXPR <*a.0_107[_73]>;
(analyze_overlapping_iterations
  (chrec_a = 0)
  (chrec_b = 0)
  (overlap_iterations_a = [0])
  (overlap_iterations_b = [0]))
(analyze_overlapping_iterations
  (chrec_a = {((integer(kind=8)) j0_119 + 1) * iftmp.476_91, +,
iftmp.476_91}_3)
  (chrec_b = {((integer(kind=8)) j1_124 + 1) * iftmp.476_91, +,
iftmp.476_91}_3)
(analyze_siv_subscript
  siv test failed: unimplemented)
  (overlap_iterations_a = not known)
  (overlap_iterations_b = not known))
) -> dependence analysis failed

the only thing we know about iftmp.476_91 is that it isn't zero (but I'm sure
dependence analysis doesn't use that fact).  I think this specific dependence
should be computable...?  [by just dividing both chrecs by iftmp.476_91?]

[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644

2018-01-29 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037

--- Comment #11 from Richard Biener  ---
So probably the big slowdown is because the vectorized loop body is so much
larger.  Unvectorized:

.L61:
vmulss  __solv_cap_MOD_d1(%rip), %xmm4, %xmm0
incl%ecx
vmulss  (%rdx), %xmm0, %xmm0
vmovss  %xmm0, (%rdx)
addq%rax, %rdx
cmpl%r12d, %ecx
jne .L61

vectorized (see how we hoist the load), with -march=haswell:

vmulss  __solv_cap_MOD_d1(%rip), %xmm4, %xmm5
movq144(%rsp), %rdi
leaq(%rbx,%r13), %rdx
xorl%r10d, %r10d
movq%rdx, %rsi
leaq0(%r13,%rdi), %rcx
movq%rcx, %rdi
vbroadcastss%xmm5, %ymm5
.p2align 4,,10
.p2align 3
.L58:
vmovss  (%rcx,%rax,2), %xmm1
vmovss  (%rsi,%rax,2), %xmm0
incl%r10d
vinsertps   $0x10, (%rcx,%r8), %xmm1, %xmm3
vinsertps   $0x10, (%rsi,%r8), %xmm0, %xmm7
vmovss  (%rcx), %xmm1
vmovss  (%rsi), %xmm0
vinsertps   $0x10, (%rcx,%rax), %xmm1, %xmm1
vinsertps   $0x10, (%rsi,%rax), %xmm0, %xmm0
addq%r9, %rcx
addq%r9, %rsi
vmovlhps%xmm7, %xmm0, %xmm0
vmovlhps%xmm3, %xmm1, %xmm1

^^ not sure why we construct in such strange way - ICC simply
does a single vmovss and then 7 vinsertps

vinsertf128 $0x1, %xmm1, %ymm0, %ymm0
vmulps  %ymm5, %ymm0, %ymm0
vmovss  %xmm0, (%rdx)
vextractps  $1, %xmm0, (%rdx,%rax)
vextractps  $2, %xmm0, (%rdx,%rax,2)
vextractps  $3, %xmm0, (%rdx,%r8)
vextractf128$0x1, %ymm0, %xmm0
addq%r9, %rdx
vmovss  %xmm0, (%rdi)
vextractps  $1, %xmm0, (%rdi,%rax)
vextractps  $2, %xmm0, (%rdi,%rax,2)
vextractps  $3, %xmm0, (%rdi,%r8)

Similar here.  But fixing this would only reduce this by a few stmts.

addq%r9, %rdi
cmpl%r10d, %r14d
jne .L58

Anyway, this size (140 bytes, 9 cache lines) probably blows any loop
stream cache limits (IIRC that was around 3 cache lines).  Compared
to 26 bytes for the scalar version (2 cache lines).

Any such considerations would be best placed in the targets finish_cost
hook where the target knows all stmts that are going to be emitted and
can in theory also cost against the scalar variant (not easily available).

The SSE variant is smaller so measuring the slowdown with SSE only would
be interesting.  Hmm, SSE variant is slower (for all of capacita) but
-fno-tree-vectorize is fastest.

[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644

2018-01-29 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037

--- Comment #10 from Richard Biener  ---
So strided stores are costed as

  /* Costs of the stores.  */
  if (memory_access_type == VMAT_ELEMENTWISE
  || memory_access_type == VMAT_GATHER_SCATTER)
{
  /* N scalar stores plus extracting the elements.  */
  unsigned int assumed_nunits = vect_nunits_for_cost (vectype);
  inside_cost += record_stmt_cost (body_cost_vec,
   ncopies * assumed_nunits,
   scalar_store, stmt_info, 0, vect_body);
}
...
  if (memory_access_type == VMAT_ELEMENTWISE
  || memory_access_type == VMAT_STRIDED_SLP)
{
  /* N scalar stores plus extracting the elements.  */
  unsigned int assumed_nunits = vect_nunits_for_cost (vectype);
  inside_cost += record_stmt_cost (body_cost_vec,
   ncopies * assumed_nunits,
   vec_to_scalar, stmt_info, 0, vect_body);
}

there's the issue of "overloading" vec_to_scalar with extraction.  It's costed
as generic sse_op which IMHO is reasonable here (vextract*).

The scalar cost is 12 for each of the following stmts

  _66 = *_150[_65];
  d1.76_67 = d1;
  _160 = d1.76_67 * _73;
  _74 = _66 * _160;
  *_150[_65] = _74;

the vector variant is adding the construction/extraction cost compared
to the scalar variant and wins with the two multiplications being costed
once instead of four times.  We don't actually factor in the "win" by
hoisting the vectorized load of 'd1' only in the vector case.

With AVX2 things become even more "cheap" vectorized.  And we of course
peel the epilogue completely.

Ideally we'd interchange this specific loop but interchange doesn't do
anything here because we get niters that might be zero.  Later dependences
would probably wreck things but here this also is a missed optimization.
We have two paths running into the loop loading ng1 and checking it
against zero properly but the PHI result doesn't have this range info
merged (well, VRP sets the info but it needs LIM / PRE to see the
opportunity so it's only set by late VRP).

[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644

2018-01-29 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037

--- Comment #9 from Richard Biener  ---
(In reply to Martin Liška from comment #7)
> (In reply to Jakub Jelinek from comment #6)
> > Is it really r256643 and not r256644 that is causing this though?
> 
> Yes, I can verify that it's r256644 that's causing the regression.

This means that those newly vectorized loops make capacita slower.  551 is

  do m=1,n/4-1
h = A(j0+m) + A(j2+m)*E(m*inc)
A(j2+m) = A(j0+m) - A(j2+m)*E(m*inc)
A(j0+m) = h
eh = conjg(E(ntot/4-m*inc))
h = A(j1+m) - A(j3+m)*eh
A(j3+m) = A(j1+m) + A(j3+m)*eh
A(j1+m) = h
  end do

but it's actually the loops from the array expressions I guess (receiving
"interesting" locations).  105 is

do i=1,Ng1 !   .. and multiply charge with x
->do j=1,Ng2
X(i,j) = X(i,j) * D1 * (i-(Ng1+1)/2.0_dp)
  end do
end do

the variable strides are because those are arrays accessed via array
descriptors.  This means that the stride will be very likely one so
any "strided XY" vectorization will have quite a big overhead.

Of course in the end it looks like a cost model issue (which might be
just not enough factoring in of the alias runtime check).  Like in
the 105 case I expect the non-vectorized loop to be a quite "nice"
optimized nest with IVO doing a good job, etc..  If the inner loop
is vectorized conditionally this can wreck code generation and
runtime quite a bit.  Ng1/Ng2 are 1024 (at runtime, read from capacita.in).

For 105 we have

capacita.f90:105:0: note: need run-time check that (ssizetype) ((sizetype)
prephitmp_341 * 4) is nonzero
capacita.f90:105:0: note: versioning for alias required: can't determine
dependence between d1 and *_150[_65]

so its two checks needed.

capacita.f90:105:0: note: Cost model analysis:
  Vector inside of loop cost: 172
  Vector prologue cost: 60
  Vector epilogue cost: 136
  Scalar iteration cost: 60
  Scalar outside cost: 8
  Vector outside cost: 196
  prologue iterations: 0
  epilogue iterations: 2
  Calculated minimum iters for profitability: 7

I think the bug is that we're somehow thinking the vectorized arithmetic
(two multiplications) offset the use of strided loads and stores...


Testcase for this loop:

module solv_cap
  implicit none

  public  :: solveP

  integer, parameter, public :: dp = selected_real_kind(5)

  real(kind=dp), private :: Pi, eps0
  real(kind=dp), private :: D1, D2
  integer,   private, save :: Ng1=0, Ng2=0
  integer,   private, pointer, dimension(:,:)  :: Grid

contains

  subroutine solveP(P)
real(kind=dp), intent(out) :: P

real(kind=dp), allocatable, dimension(:,:)  :: Y0, X
integer :: i,j

allocate( Y0(Ng1,Ng2), X(Ng1,Ng2) )

do i=1,Ng1
  do j=1,Ng2
Y0(i,j) = D1 * (i-(Ng1+1)/2.0_dp) * Grid(i,j)
  end do
end do ! RHS for in-field E_x=1.  V = -V_in = x, where metal on grid

call solve( X, Y0 )

X = X - sum(X)/size(X) ! get rid of monopole term ..
do i=1,Ng1 !   .. and multiply charge with x
  do j=1,Ng2
X(i,j) = X(i,j) * D1 * (i-(Ng1+1)/2.0_dp)
  end do
end do
P = sum(X)*D1*D2 * 4*Pi*eps0   ! E-dipole moment in 1 V/m field

deallocate( X, Y0 )
return
  end subroutine solveP

end module solv_cap

[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644

2018-01-26 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037

--- Comment #8 from Jan Hubicka  ---
https://gcc.opensuse.org/gcc-old/c++bench-czerny/pb11/pb11-summary.txt-2-0.html
runs with  -Ofast -funroll-loops so indeed does not seem essential to trigger
the regression (it may be two different ones of course)

note that also tfft,aermod and tfft2 shows regression that may be easy to
track/fix.

[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644

2018-01-25 Thread marxin at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037

Martin Liška  changed:

   What|Removed |Added

Summary|[8 Regression] Speed|[8 Regression] Speed
   |regression of polyhedron|regression of polyhedron
   |benchmark since r256643 |benchmark since r256644

--- Comment #7 from Martin Liška  ---
(In reply to Jakub Jelinek from comment #6)
> Is it really r256643 and not r256644 that is causing this though?

Yes, I can verify that it's r256644 that's causing the regression.