[Bug tree-optimization/77498] [7 regression] Performance drop after r239414 on spec2000/172mgrid

2017-03-30 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77498

Richard Biener  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #16 from Richard Biener  ---
Fixed.

[Bug tree-optimization/77498] [7 regression] Performance drop after r239414 on spec2000/172mgrid

2017-03-30 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77498

--- Comment #15 from Richard Biener  ---
Author: rguenth
Date: Thu Mar 30 07:15:39 2017
New Revision: 246583

URL: https://gcc.gnu.org/viewcvs?rev=246583=gcc=rev
Log:
2017-03-30  Richard Biener  

PR tree-optimization/77498
* tree-ssa-pre.c (phi_translate_1): Do not allow simplifications
to non-constants over backedges.

* gfortran.dg/pr77498.f: New testcase.

Added:
trunk/gcc/testsuite/gfortran.dg/pr77498.f
Modified:
trunk/gcc/ChangeLog
trunk/gcc/testsuite/ChangeLog
trunk/gcc/tree-ssa-pre.c

[Bug tree-optimization/77498] [7 regression] Performance drop after r239414 on spec2000/172mgrid

2017-03-28 Thread ramana at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77498

Ramana Radhakrishnan  changed:

   What|Removed |Added

 Target|arm-none-eabi   |
 CC||ramana at gcc dot gnu.org

--- Comment #14 from Ramana Radhakrishnan  ---
I don't think arm is a valid target for this given PR80155 was opened as a
consequence of fixing PR77498..

[Bug tree-optimization/77498] [7 regression] Performance drop after r239414 on spec2000/172mgrid

2017-03-22 Thread thopre01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77498

Thomas Preud'homme  changed:

   What|Removed |Added

 CC|thopre01 at gcc dot gnu.org|

--- Comment #13 from Thomas Preud'homme  ---
Ack, thanks Richard. Opened PR80155

[Bug tree-optimization/77498] [7 regression] Performance drop after r239414 on spec2000/172mgrid

2017-03-22 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77498

--- Comment #12 from rguenther at suse dot de  ---
On Wed, 22 Mar 2017, thopre01 at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77498
> 
> --- Comment #11 from Thomas Preud'homme  ---
> (In reply to Thomas Preud'homme from comment #9)
> > Sadly I could not come up with a minimal testcase so far. What I can see
> > from the code is that tree code hoisting increases the live range of some
> > values which then translates into more spilling in reload.
> > 
> > As an approximation I'm wondering if the maximum distance (computer in
> > number of blocks traversed) from the definition to the use could be used to
> > limit when the optimization is applied when optimizing for speed.
> 
> I finally managed. The bug can be reproduced by building the following for
> arm-none-eabi with -S -O2 -mcpu=cortex-m7 and looking for the push in the
> resulting assembly code.
> 
> fn1() {
>   char *a;
>   char b;
>   for (; *a; a++) {
> if (b)
>   a++;
> fn2();
>   }
> }
> 
> With -O2: r3, r4, r5 and lr and pushed.
> With -O2 -fno-code-hoisting: r4 and lr are pushed only.
> 
> 
> Similarly for -mcpu=cortex-m0plus:
> 
> enum { ENUM1, ENUM2, ENUM3 } a;
> fn1() {
>   char *b;
>   for (; *b && a != ENUM2; b++)
> switch (a) {
>   case ENUM1: a = ENUM3;
> }
> }

But that's not caused by r239414 so please open a new bug for this.
(confirmed with a cross)

Transform:

   [85.00%]:
  # a_14 = PHI 
  if (b_7(D) != 0)
goto ; [50.00%]
  else
goto ; [50.00%]

   [42.50%]:
  goto ; [100.00%]

   [42.50%]:
  a_8 = a_14 + 1;

   [85.00%]:
  # a_2 = PHI 
  fn2 ();
  a_10 = a_2 + 1;

to

   [85.00%]:
  # a_14 = PHI 
  _4 = a_14 + 1;
  if (b_7(D) != 0)
goto ; [50.00%]
  else
goto ; [50.00%]

   [42.50%]:
  _3 = _4 + 1;

   [85.00%]:
  # a_2 = PHI 
  # prephitmp_12 = PHI <_4(3), _3(4)>
  fn2 ();

that's because the hoisting (which itself isn't a problem) makes
a_2 + 1 partially redundant over the latch.  We see this issue
in related testcases where PRE can compute a constant for the
first iteration value of expressions and thus inserts IVs for
them.  So it's nothing new and a fix would hopefully fix those
cases as well.

[Bug tree-optimization/77498] [7 regression] Performance drop after r239414 on spec2000/172mgrid

2017-03-22 Thread thopre01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77498

--- Comment #11 from Thomas Preud'homme  ---
(In reply to Thomas Preud'homme from comment #9)
> Sadly I could not come up with a minimal testcase so far. What I can see
> from the code is that tree code hoisting increases the live range of some
> values which then translates into more spilling in reload.
> 
> As an approximation I'm wondering if the maximum distance (computer in
> number of blocks traversed) from the definition to the use could be used to
> limit when the optimization is applied when optimizing for speed.

I finally managed. The bug can be reproduced by building the following for
arm-none-eabi with -S -O2 -mcpu=cortex-m7 and looking for the push in the
resulting assembly code.

fn1() {
  char *a;
  char b;
  for (; *a; a++) {
if (b)
  a++;
fn2();
  }
}

With -O2: r3, r4, r5 and lr and pushed.
With -O2 -fno-code-hoisting: r4 and lr are pushed only.


Similarly for -mcpu=cortex-m0plus:

enum { ENUM1, ENUM2, ENUM3 } a;
fn1() {
  char *b;
  for (; *b && a != ENUM2; b++)
switch (a) {
  case ENUM1: a = ENUM3;
}
}

[Bug tree-optimization/77498] [7 regression] Performance drop after r239414 on spec2000/172mgrid

2017-03-20 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77498

--- Comment #10 from rguenther at suse dot de  ---
On Mon, 20 Mar 2017, thopre01 at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77498
> 
> --- Comment #9 from Thomas Preud'homme  ---
> Sadly I could not come up with a minimal testcase so far. What I can see from
> the code is that tree code hoisting increases the live range of some values
> which then translates into more spilling in reload.
> 
> As an approximation I'm wondering if the maximum distance (computer in number
> of blocks traversed) from the definition to the use could be used to limit 
> when
> the optimization is applied when optimizing for speed.

Sadly the data-flow used to compute the opportunities is not suitable
for determining this.  It would probaly require "aging" of exprs in
the hoistable sets when propagating the dataflow for ANTIC_IN (in
principle PRE would have a similar issue).  We already restrict
"distance" by requiring at least one successor of the hoisting point
to provide the value directly but we do not limit proximity further.

See do_hoist_insertion.

[Bug tree-optimization/77498] [7 regression] Performance drop after r239414 on spec2000/172mgrid

2017-03-20 Thread thopre01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77498

--- Comment #9 from Thomas Preud'homme  ---
Sadly I could not come up with a minimal testcase so far. What I can see from
the code is that tree code hoisting increases the live range of some values
which then translates into more spilling in reload.

As an approximation I'm wondering if the maximum distance (computer in number
of blocks traversed) from the definition to the use could be used to limit when
the optimization is applied when optimizing for speed.

[Bug tree-optimization/77498] [7 regression] Performance drop after r239414 on spec2000/172mgrid

2017-03-10 Thread thopre01 at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77498

Thomas Preud'homme  changed:

   What|Removed |Added

 CC||thopre01 at gcc dot gnu.org

--- Comment #8 from Thomas Preud'homme  ---
(In reply to Richard Biener from comment #7)
> Ok, so given we can't have PRE do as good as predcom and a "cost model" for
> PRE is out of the question for GCC 7 the following dumbs down PRE again.  It
> does so in the very much simplest way rather than trying to block this only
> during elimination / insertion.  This should be definitely revisited for GCC
> 8.
> 
> Index: gcc/tree-ssa-pre.c
> ===
> --- gcc/tree-ssa-pre.c  (revision 246026)
> +++ gcc/tree-ssa-pre.c  (working copy)
> @@ -1468,10 +1468,20 @@ phi_translate_1 (pre_expr expr, bitmap_s
>leader for it.  */
> if (constant->kind != CONSTANT)
>   {
> -   unsigned value_id = get_expr_value_id (constant);
> -   constant = find_leader_in_sets (value_id, set1, set2);
> -   if (constant)
> - return constant;
> +   /* Do not allow simplifications to non-constants over
> +  backedges as this will likely result in a loop PHI
> node
> +  to be inserted and increased register pressure.
> +  See PR77498 - this avoids doing predcoms work in
> +  a less efficient way.  */
> +   if (find_edge (pred, phiblock)->flags & EDGE_DFS_BACK)
> + ;
> +   else
> + {
> +   unsigned value_id = get_expr_value_id (constant);
> +   constant = find_leader_in_sets (value_id, set1,
> set2);
> +   if (constant)
> + return constant;
> + }
>   }
> else
>   return constant;

I don't know for Yuri's issue but at least it sadly does not help with the
problem reported by Andre for arm-none-eabi [1]. I'll try to come up with a
testcase next week.

[1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77498#c2

[Bug tree-optimization/77498] [7 regression] Performance drop after r239414 on spec2000/172mgrid

2017-03-10 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77498

--- Comment #7 from Richard Biener  ---
Ok, so given we can't have PRE do as good as predcom and a "cost model" for PRE
is out of the question for GCC 7 the following dumbs down PRE again.  It does
so in the very much simplest way rather than trying to block this only during
elimination / insertion.  This should be definitely revisited for GCC 8.

Index: gcc/tree-ssa-pre.c
===
--- gcc/tree-ssa-pre.c  (revision 246026)
+++ gcc/tree-ssa-pre.c  (working copy)
@@ -1468,10 +1468,20 @@ phi_translate_1 (pre_expr expr, bitmap_s
   leader for it.  */
if (constant->kind != CONSTANT)
  {
-   unsigned value_id = get_expr_value_id (constant);
-   constant = find_leader_in_sets (value_id, set1, set2);
-   if (constant)
- return constant;
+   /* Do not allow simplifications to non-constants over
+  backedges as this will likely result in a loop PHI node
+  to be inserted and increased register pressure.
+  See PR77498 - this avoids doing predcoms work in
+  a less efficient way.  */
+   if (find_edge (pred, phiblock)->flags & EDGE_DFS_BACK)
+ ;
+   else
+ {
+   unsigned value_id = get_expr_value_id (constant);
+   constant = find_leader_in_sets (value_id, set1, set2);
+   if (constant)
+ return constant;
+ }
  }
else
  return constant;

[Bug tree-optimization/77498] [7 regression] Performance drop after r239414 on spec2000/172mgrid

2017-02-20 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77498

--- Comment #6 from Richard Biener  ---
For a testcase trying to show the issue:

double U[1024];
double V[1024];

void foo (void)
{
  for (unsigned i = 1; i < 1023; ++i)
V[i] = U[i-1] + U[i] + U[i+1];
}

we get from PRE (.optimized, w/ IVO disabled):



   [1.00%]:
  pretmp_19 = U[0];
  pretmp_21 = U[1];

   [99.00%]:
  # i_15 = PHI <_5(3), 1(2)>
  # prephitmp_20 = PHI 
  # prephitmp_22 = PHI <_6(3), pretmp_21(2)>
  # ivtmp_1 = PHI 
  _5 = i_15 + 1;
  _6 = U[_5];
  _17 = _6 + prephitmp_22;
  _7 = _17 + prephitmp_20;
  V[i_15] = _7;
  ivtmp_18 = ivtmp_1 + 4294967295;
  if (ivtmp_18 != 0)
goto ; [98.99%]
  else
goto ; [1.01%]

   [1.00%]:
  return;

while predcom does the same transform but unrolls the loop:

   [50.00%]:
  # i_15 = PHI <1(2), _43(3)>
  # ivtmp_18 = PHI <1022(2), ivtmp_49(3)>
  # U_I_lsm0.3_30 = PHI <_33(2), _6(3)>
  # U_I_lsm1.4_31 = PHI <_34(2), _44(3)>
  # ivtmp_50 = PHI <1021(2), ivtmp_51(3)>
  _5 = i_15 + 1;
  _6 = U[_5];
  _37 = U_I_lsm0.3_30 + U_I_lsm1.4_31;
  _7 = _6 + _37;
  V[i_15] = _7;
  _43 = i_15 + 2;
  _44 = U[_43];
  _46 = U_I_lsm1.4_31 + _44;
  _47 = _6 + _46;
  V[_5] = _47;
  ivtmp_49 = ivtmp_18 + 4294967294;
  ivtmp_51 = ivtmp_50 + 4294967294;
  if (ivtmp_51 > 1)
goto ; [98.00%]
  else
goto ; [2.00%]

register pressure created by both are the same.

For the more complex testcase PRE misses the "combination chains" because
we exclude them as

Found partial redundancy for expression {plus_expr,_2,_3} (0012)
Skipping insertion of phi for partial redundancy: Looks like an induction
variable

when we mitigate that PRE can handle all cases where association is correct.
predictive commoning in addition to that does re-association of adds to
enable more chains (so the mitigation doesn't help for the original testcase).

Testcase that is helped this way:

double U[1024], W[1024];
double V[1024];

void foo (void)
{
  for (unsigned i = 1; i < 1023; ++i)
V[i] = (U[i-1] + W[i-1]) + (U[i] + W[i]) + (U[i+1] + W[i+1]);
}

and PRE produces

   [99.00%]:
  # i_21 = PHI <_9(3), 1(2)>
  # prephitmp_43 = PHI <_23(3), _42(2)>
  # prephitmp_45 = PHI 
  # ivtmp_40 = PHI 
  _9 = i_21 + 1;
  _10 = U[_9];
  _11 = W[_9];
  _23 = _10 + _11;
  _1 = _23 + prephitmp_43;
  _13 = _1 + prephitmp_45;
  V[i_21] = _13;
  ivtmp_38 = ivtmp_40 + 4294967295;
  if (ivtmp_38 != 0)
goto ; [98.99%]
  else
goto ; [1.01%]

note how we PRE U[] + W[] rather than U[] and W[].

Index: gcc/tree-ssa-pre.c
===
--- gcc/tree-ssa-pre.c  (revision 245594)
+++ gcc/tree-ssa-pre.c  (working copy)
@@ -3008,7 +3008,9 @@ insert_into_preds_of_block (basic_block
EDGE_PRED (block, 1)->src);
   /* Induction variables only have one edge inside the loop.  */
   if ((firstinsideloop ^ secondinsideloop)
- && expr->kind != REFERENCE)
+ && expr->kind != REFERENCE
+ && (expr->kind != NARY
+ || INTEGRAL_TYPE_P (PRE_EXPR_NARY (expr)->type)))
{
  if (dump_file && (dump_flags & TDF_DETAILS))
fprintf (dump_file, "Skipping insertion of phi for partial
redundancy: Looks like an induction variable\n");

for the original testcase in this bug this removes two IVs (but the IV
detection is lame).  What we mostly want to avoid here is creating
IVs for derived IVs, it might be enough to disregard

  && expr->kind == NARY
  && (PRE_EXPR_NARY (expr)->opcode == PLUS_EXPR
  || PRE_EXPR_NARY (expr)->opcode == POINTER_PLUS_EXPR
  || PRE_EXPR_NARY (expr)->opcode == MINUS_EXPR)
  && TREE_CODE (PRE_EXPR_NARY (expr)->op[1]) == INTEGER_CST)

but as said, a solution inside PRE might improve some cases but it won't
catch the cases needing re-association.  Running pcom before PRE would
be best here but then we've been there before (and put pcom after
vectorization from previously before it).

[Bug tree-optimization/77498] [7 regression] Performance drop after r239414 on spec2000/172mgrid

2017-01-16 Thread amker at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77498

--- Comment #5 from amker at gcc dot gnu.org ---
(In reply to Richard Biener from comment #4)
> CCing Bin, he was looking into PRE/predcom as well AFAIR.  predictive
> commoning here performs unrolling to be able to avoid some loop-carried
> dependencies
> while PRE has the larger distances covered by for example
> 
>  [85.00%]:
> # prephitmp_656 = PHI <_125(6), pretmp_655(5)>
> # prephitmp_674 = PHI 
> 
> this kind of loop carried PHIs should be a hint for a tree level unroller
> to perform unrolling (just in case they literally appear in source for
> example).
> OTOH if unrolling can solve the RA problem then the it must be solvable not
> unrolled as well?  Note that with predcom we end up with 11 pointer IVs while
> with PRE we have just one (but use 20 others from the outer loop...) -
> possibly
> the versioning predcom performs makes IVO not do any outer loop IVO.  Using
> -fschedule-insns -fsched-pressure helps somewhat but not much.
> 
> So it looks like a RA related issue and IVO is as much relevant as PRE doing
> predictive commoning at -O2 (and at -O3 doing predcoms job but worse in this
> case).
> 
> During PHI translation we can tame this down to a level pre this rev. again,
> for example with the following.  But ideally we'd compute antic and do
> insertion
> for the full dataflow problem and only apply this "cost modeling" during
> elimination to not lose secondary level transforms that are profitable
> (also below we do not know whether we need to insert a PHI for the value in
> the end).
> 
> Index: gcc/tree-ssa-pre.c
> ===
> --- gcc/tree-ssa-pre.c  (revision 244484)
> +++ gcc/tree-ssa-pre.c  (working copy)
> @@ -1465,16 +1465,16 @@ phi_translate_1 (pre_expr expr, bitmap_s
>   {
> /* For non-CONSTANTs we have to make sure we can eventually
>insert the expression.  Which means we need to have a
> -  leader for it.  */
> -   if (constant->kind != CONSTANT)
> +  leader for it.  Avoid doing this across backedges though.
> */
> +   if (constant->kind == CONSTANT)
> + return constant;
> +   else if (! dominated_by_p (CDI_DOMINATORS, pred, phiblock))
>   {
> unsigned value_id = get_expr_value_id (constant);
> constant = find_leader_in_sets (value_id, set1, set2);
> if (constant)
>   return constant;
>   }
> -   else
> - return constant;
>   }
>  
> tree result = vn_nary_op_lookup_pieces (newnary->length,
> 
> 
> But as said, a whole different question is whether we want PRE to add IVs at
> all
> (but we do have some testcases requesting exactly that, for example
> gcc.dg/tree-ssa/pr71347.c or ssa-pre-23.c requesting store-motion w/o
> actually sinking the store).
> 
> Index: gcc/tree-ssa-pre.c
> ===
> --- gcc/tree-ssa-pre.c  (revision 244484)
> +++ gcc/tree-ssa-pre.c  (working copy)
> @@ -4290,6 +4290,31 @@ eliminate_dom_walker::before_dom_childre
>VN_INFO_RANGE_INFO (lhs));
> }
>  
> + if (sprime
> + && TREE_CODE (sprime) == SSA_NAME
> + && do_pre
> + && loop_outer (b->loop_father)
> + && has_zero_uses (sprime)
> + && bitmap_bit_p (inserted_exprs, SSA_NAME_VERSION (sprime)))
> +   {
> + gimple *def_stmt = SSA_NAME_DEF_STMT (sprime);
> + basic_block def_bb = gimple_bb (def_stmt);
> + if (gimple_code (def_stmt) == GIMPLE_PHI
> + && def_bb->loop_father->header == def_bb)
> +   {
> + bool ok = true;
> + edge_iterator ei;
> + edge e;
> + FOR_EACH_EDGE (e, ei, def_bb->preds)
> +   if (dominated_by_p (CDI_DOMINATORS, e->src, e->dest)
> +   && TREE_CODE (PHI_ARG_DEF_FROM_EDGE (def_stmt, e))
> == SSA_NAME)
> + ok = false;
> + /* Don't keep sprime available.  */
> + if (!ok)
> +   sprime = NULL_TREE;
> +   }
> +   }
> +
>   /* Inhibit the use of an inserted PHI on a loop header when
>  the address of the memory reference is a simple induction
>  variable.  In other cases the vectorizer won't do anything

I am trying to model register pressure and use that information to direct
predcom.  So far it detects only one case 436.cactusADM but does improve a lot.
 Though it's hard to model cost of pre_expr, but for loop carries ones, we may
be able to simply control the number using pressure 

[Bug tree-optimization/77498] [7 regression] Performance drop after r239414 on spec2000/172mgrid

2017-01-16 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77498

Richard Biener  changed:

   What|Removed |Added

   Keywords||missed-optimization, ra
 CC||amker at gcc dot gnu.org

--- Comment #4 from Richard Biener  ---
CCing Bin, he was looking into PRE/predcom as well AFAIR.  predictive commoning
here performs unrolling to be able to avoid some loop-carried dependencies
while PRE has the larger distances covered by for example

 [85.00%]:
# prephitmp_656 = PHI <_125(6), pretmp_655(5)>
# prephitmp_674 = PHI 

this kind of loop carried PHIs should be a hint for a tree level unroller
to perform unrolling (just in case they literally appear in source for
example).
OTOH if unrolling can solve the RA problem then the it must be solvable not
unrolled as well?  Note that with predcom we end up with 11 pointer IVs while
with PRE we have just one (but use 20 others from the outer loop...) - possibly
the versioning predcom performs makes IVO not do any outer loop IVO.  Using
-fschedule-insns -fsched-pressure helps somewhat but not much.

So it looks like a RA related issue and IVO is as much relevant as PRE doing
predictive commoning at -O2 (and at -O3 doing predcoms job but worse in this
case).

During PHI translation we can tame this down to a level pre this rev. again,
for example with the following.  But ideally we'd compute antic and do
insertion
for the full dataflow problem and only apply this "cost modeling" during
elimination to not lose secondary level transforms that are profitable
(also below we do not know whether we need to insert a PHI for the value in
the end).

Index: gcc/tree-ssa-pre.c
===
--- gcc/tree-ssa-pre.c  (revision 244484)
+++ gcc/tree-ssa-pre.c  (working copy)
@@ -1465,16 +1465,16 @@ phi_translate_1 (pre_expr expr, bitmap_s
  {
/* For non-CONSTANTs we have to make sure we can eventually
   insert the expression.  Which means we need to have a
-  leader for it.  */
-   if (constant->kind != CONSTANT)
+  leader for it.  Avoid doing this across backedges though. 
*/
+   if (constant->kind == CONSTANT)
+ return constant;
+   else if (! dominated_by_p (CDI_DOMINATORS, pred, phiblock))
  {
unsigned value_id = get_expr_value_id (constant);
constant = find_leader_in_sets (value_id, set1, set2);
if (constant)
  return constant;
  }
-   else
- return constant;
  }

tree result = vn_nary_op_lookup_pieces (newnary->length,


But as said, a whole different question is whether we want PRE to add IVs at
all
(but we do have some testcases requesting exactly that, for example
gcc.dg/tree-ssa/pr71347.c or ssa-pre-23.c requesting store-motion w/o actually
sinking the store).

Index: gcc/tree-ssa-pre.c
===
--- gcc/tree-ssa-pre.c  (revision 244484)
+++ gcc/tree-ssa-pre.c  (working copy)
@@ -4290,6 +4290,31 @@ eliminate_dom_walker::before_dom_childre
   VN_INFO_RANGE_INFO (lhs));
}

+ if (sprime
+ && TREE_CODE (sprime) == SSA_NAME
+ && do_pre
+ && loop_outer (b->loop_father)
+ && has_zero_uses (sprime)
+ && bitmap_bit_p (inserted_exprs, SSA_NAME_VERSION (sprime)))
+   {
+ gimple *def_stmt = SSA_NAME_DEF_STMT (sprime);
+ basic_block def_bb = gimple_bb (def_stmt);
+ if (gimple_code (def_stmt) == GIMPLE_PHI
+ && def_bb->loop_father->header == def_bb)
+   {
+ bool ok = true;
+ edge_iterator ei;
+ edge e;
+ FOR_EACH_EDGE (e, ei, def_bb->preds)
+   if (dominated_by_p (CDI_DOMINATORS, e->src, e->dest)
+   && TREE_CODE (PHI_ARG_DEF_FROM_EDGE (def_stmt, e)) ==
SSA_NAME)
+ ok = false;
+ /* Don't keep sprime available.  */
+ if (!ok)
+   sprime = NULL_TREE;
+   }
+   }
+
  /* Inhibit the use of an inserted PHI on a loop header when
 the address of the memory reference is a simple induction
 variable.  In other cases the vectorizer won't do anything

[Bug tree-optimization/77498] [7 regression] Performance drop after r239414 on spec2000/172mgrid

2017-01-12 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77498

Richard Biener  changed:

   What|Removed |Added

   Priority|P3  |P1

[Bug tree-optimization/77498] [7 regression] Performance drop after r239414 on spec2000/172mgrid

2016-09-07 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77498

Richard Biener  changed:

   What|Removed |Added

 Status|UNCONFIRMED |ASSIGNED
   Last reconfirmed||2016-09-07
   Assignee|unassigned at gcc dot gnu.org  |rguenth at gcc dot 
gnu.org
   Target Milestone|--- |7.0
 Ever confirmed|0   |1

--- Comment #3 from Richard Biener  ---
Note this revision isn't really related to code hoisting.  It merely allows PRE
to perform simple predictive commoning and more PRE in general.  The commoning
can interfere with sinking (see the adjusted testcase).

For the testcase we apply commoning which increases register pressure.

The pcom pass does a better job (well, it was designed for this).

I suppose this PRE improvement raises the general question (again) whether
we want it to introduce loop-carried dependences at all.  In this case
it trades 18 loads for 18 loop-carried dependences - optimally reg colaesced
and thus "free", maybe reg-reg copies or worst spills (as seen here).

I'll need to think about this (again).

[Bug tree-optimization/77498] [7 regression] Performance drop after r239414 on spec2000/172mgrid

2016-09-06 Thread avieira at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77498

avieira at gcc dot gnu.org changed:

   What|Removed |Added

 Target||arm-none-eabi
 CC||avieira at gcc dot gnu.org

--- Comment #2 from avieira at gcc dot gnu.org ---
I am observing some regressions for arm-none-eabi on a Cortex-M0+ for a popular
embedded benchmark following this patch.

I believe register pressure might also be the root cause of this given the
significant increase of loads and registers from and to the stack. Though I
need to have a better look.

Passing the option -fno-code-hoisting brings the performance numbers back up.

[Bug tree-optimization/77498] [7 regression] Performance drop after r239414 on spec2000/172mgrid

2016-09-06 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77498

--- Comment #1 from Yuri Rumyantsev  ---
Created attachment 39574
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=39574=edit
test-case to reproduce

Need to compile with -O2 -ffast-math to reproduce.