[Bug tree-optimization/77445] [7 Regression] Performance drop after r239219 on coremark test

2017-02-04 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445

Jan Hubicka  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #25 from Jan Hubicka  ---
We now do all the threading possible :)

[Bug tree-optimization/77445] [7 Regression] Performance drop after r239219 on coremark test

2017-02-02 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445

--- Comment #24 from Jan Hubicka  ---
Author: hubicka
Date: Thu Feb  2 20:22:13 2017
New Revision: 245135

URL: https://gcc.gnu.org/viewcvs?rev=245135=gcc=rev
Log:

PR middle-end/77445
* gcc.dg/tree-ssa/pr77445-2.c: Update testcase to check that all
threading is done.
* tree-ssa-threadbackward.c (profitable_jump_thread_path): Dump
statistics of the analyzed path; allow threading for speed when
any of BBs along the path are optimized for speed.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/testsuite/ChangeLog
trunk/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c
trunk/gcc/tree-ssa-threadbackward.c

[Bug tree-optimization/77445] [7 Regression] Performance drop after r239219 on coremark test

2017-01-24 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445

--- Comment #23 from Jan Hubicka  ---
I will implement the "allow size growth when there is a hot bb on a path"
heuristics.  We may try to get smarter on when the peeling actually helps the
unpeeled path, but that is bit hard to do because there may be many threadable
paths on the way (just as in the example) and we will never be able to fully
judge optimization oppurtunities enabled by peeling (such as better VRP
propagation, for example)

I doubt being on aggressive side here will have large effect on code size.

On unrelated note this testcase made me think that perhaps for -Ofast we want
to adjust hot/cold heuristics so we are more aggressive on guessing that BB is
hot as well (for -O2 we want to put into account code size implications, but
for -Ofast we are probably permitted to drop them)

Concerning peeling, we are probably always safe to decide to peel when the
transformation decreases code size which is the case of all those early
transforms I looked into (about 200-300 out of 700 on tramp3d). In fact such
complete peeling ought to happen somewhere in the early pass queue. 
We can punt on non-size-decreasing peelings leaving them to dedicated pass. The
loop peeling pass has its own way on predicting if something optimizes away. It
does take into account induction variables but no phis that happens to be
constant across the peeled path. Perhaps it should be merged with jump threader
and the analysis part should be done in a common way.
Backward threader should likely get smarter over time, taking into account, for
example, the value ranges and this way both passes will benefit.

[Bug tree-optimization/77445] [7 Regression] Performance drop after r239219 on coremark test

2017-01-23 Thread law at redhat dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445

--- Comment #22 from Jeffrey A. Law  ---
In response to c#19.  Yes, a thread path which is cold except for a hot block
in the middle might be profitable to thread as it will sometimes expose path
specific redundancies/simplifications in the hot block.

Whether or not it does is mostly a function of the PHI nodes in the hot block. 
ISTM that these opportunities are most likely to arise if threading results in
a degenerate PHI in the hot block.  Which is something we ought to be able to
test for.

Not sure if that helps this testcase or not.

[Bug tree-optimization/77445] [7 Regression] Performance drop after r239219 on coremark test

2017-01-23 Thread law at redhat dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445

--- Comment #21 from Jeffrey A. Law  ---
In response to c#20.  We have to be careful about fixing up the loop after
rotation and avoid repeatedly peeling.  The old threader dealt with those by
being fairly conservative in passes before the loop optimizer and much more
aggressive later.

Zdenek's general argument was that peeling in the threader is fundamentally
wrong because we really only know general loop structure, but not iteration
information, data dependencies, etc (which directly affect vectorization).  If
peeling a loop inhibits vectorization, then we've most likely lost in a big
way.

I think the backward threader punts all these issues, which is probably the
better choice than aggressively trying to handle them right now.

[Bug tree-optimization/77445] [7 Regression] Performance drop after r239219 on coremark test

2017-01-23 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445

--- Comment #20 from rguenther at suse dot de  ---
On Mon, 23 Jan 2017, hubicka at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445
> 
> --- Comment #19 from Jan Hubicka  ---
> One change that would make sense to me is to always thread when there is a
> non-cold block on the path: we are not only improving the path taken when
> threading but because we eliminate incoming edges we also permit better
> optimization of non-thread path later.
> 
> This way we thread the branch, because there is still non-cold BB on a way.
> 
> On related note, I have noticed that we give up on about 30% of jump threads 
> in
> early optimization because of:
> 
>/* Skip edges pointing outside the current loop.  */
>if (!arg || var_bb->loop_father != bbi->loop_father)
> continue;
> 
> This is very common case such as in:
> char *c;
> int t()
> {
>   for (int i=0;i<5000;i++)
> c[i]=i;
> }
> 
> before loop header copying there is threadable branch at first iteration.
> Threading it before profile construction would be quite desirable.
> 
> There seem to be code deciding on loops when checking path profitability. It
> seems to me that at least the code can be adjusted to accept thread starting 
> by
> loop header when there is only single edge in becausee that can't make
> irreducible loops?

But it will rotate the loop and thus change the loop header block?  Then
we need to make sure to preserve that properly to not destroy loop
annotations by removing/rediscovering the loop.

Also we need to make sure to not repeatedly peel the loop via such 
mechanism.

[Bug tree-optimization/77445] [7 Regression] Performance drop after r239219 on coremark test

2017-01-23 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445

--- Comment #19 from Jan Hubicka  ---
One change that would make sense to me is to always thread when there is a
non-cold block on the path: we are not only improving the path taken when
threading but because we eliminate incoming edges we also permit better
optimization of non-thread path later.

This way we thread the branch, because there is still non-cold BB on a way.

On related note, I have noticed that we give up on about 30% of jump threads in
early optimization because of:

   /* Skip edges pointing outside the current loop.  */
   if (!arg || var_bb->loop_father != bbi->loop_father)
continue;

This is very common case such as in:
char *c;
int t()
{
  for (int i=0;i<5000;i++)
c[i]=i;
}

before loop header copying there is threadable branch at first iteration.
Threading it before profile construction would be quite desirable.

There seem to be code deciding on loops when checking path profitability. It
seems to me that at least the code can be adjusted to accept thread starting by
loop header when there is only single edge in becausee that can't make
irreducible loops?

Honza

[Bug tree-optimization/77445] [7 Regression] Performance drop after r239219 on coremark test

2017-01-21 Thread law at redhat dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445

--- Comment #18 from Jeffrey A. Law  ---
>From just looking at the paths, I would expect it to matter -- they're still
cases where we're threading the multiway branch and that's the key to this
benchmark -- avoiding the multiway branch.

[Bug tree-optimization/77445] [7 Regression] Performance drop after r239219 on coremark test

2017-01-19 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445

--- Comment #17 from Jan Hubicka  ---
As reported in PR77484, coremark is now up by 20% or more.
Are we out of regression land now? If not does the patch in #15 help?

[Bug tree-optimization/77445] [7 Regression] Performance drop after r239219 on coremark test

2017-01-17 Thread seurer at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445

seurer at gcc dot gnu.org changed:

   What|Removed |Added

 CC||seurer at gcc dot gnu.org

--- Comment #16 from seurer at gcc dot gnu.org ---
One of the new tests fails on powerpc (both LE and BE):

Executing on host: /home/seurer/gcc/build/gcc-test2/gcc/xgcc
-B/home/seurer/gcc/build/gcc-test2/gcc/
/home/seurer/gcc/gcc-test2/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c 
-fno-diagnostics-show-caret -fdiagnostics-color=never   -O2
-fdump-tree-thread1-details-blocks-stats -S   -o pr77445-2.s(timeout = 300)
spawn -ignore SIGHUP /home/seurer/gcc/build/gcc-test2/gcc/xgcc
-B/home/seurer/gcc/build/gcc-test2/gcc/
/home/seurer/gcc/gcc-test2/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c
-fno-diagnostics-show-caret -fdiagnostics-color=never -O2
-fdump-tree-thread1-details-blocks-stats -S -o pr77445-2.s
PASS: gcc.dg/tree-ssa/pr77445-2.c (test for excess errors)
FAIL: gcc.dg/tree-ssa/pr77445-2.c scan-tree-dump thread1 "Jumps threaded: 16"
PASS: gcc.dg/tree-ssa/pr77445-2.c scan-tree-dump-times thread1 "Invalid sum" 2
testcase /home/seurer/gcc/gcc-test2/gcc/testsuite/gcc.dg/tree-ssa/tree-ssa.exp
completed in 0 seconds

=== gcc Summary ===

# of expected passes2
# of unexpected failures1

[Bug tree-optimization/77445] [7 Regression] Performance drop after r239219 on coremark test

2017-01-17 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445

--- Comment #15 from Jan Hubicka  ---
Note that the remaining missed threads loop exit condition test state !=
INVALID which after sequence of threads gets to probability 0 because original
guess from profile_estimate is not realistic.

I guess such repeated threading may be common. It may be interesting to
benchmark the following hack both for speed and size:

Index: tree-ssa-threadbackward.c
===
--- tree-ssa-threadbackward.c   (revision 244527)
+++ tree-ssa-threadbackward.c   (working copy)
@@ -311,7 +311,11 @@ profitable_jump_thread_path (veclength () - 1],
+  (*path)[path->length () - 2]
 {
   if (n_insns >= PARAM_VALUE (PARAM_MAX_FSM_THREAD_PATH_INSNS))
{

It makes us to consider threading the path if at least it starts by an hot
edge.

[Bug tree-optimization/77445] [7 Regression] Performance drop after r239219 on coremark test

2017-01-17 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445

--- Comment #14 from Jan Hubicka  ---
With the patch we only give up on some threading in thread4:

q.c.181t.thread4:FSM jump-thread path not considered: duplication of 5 insns is
needed and optimizing for size.
q.c.181t.thread4:FSM jump-thread path not considered: duplication of 5 insns is
needed and optimizing for size.
q.c.181t.thread4:FSM jump-thread path not considered: duplication of 5 insns is
needed and optimizing for size.
q.c.181t.thread4:FSM jump-thread path not considered: duplication of 5 insns is
needed and optimizing for size.
q.c.181t.thread4:FSM jump-thread path not considered: duplication of 5 insns is
needed and optimizing for size.
q.c.181t.thread4:FSM jump-thread path not considered: duplication of 5 insns is
needed and optimizing for size.
q.c.181t.thread4:FSM jump-thread path not considered: duplication of 5 insns is
needed and optimizing for size.
q.c.181t.thread4:FSM jump-thread path not considered: duplication of 5 insns is
needed and optimizing for size.
q.c.181t.thread4:FSM jump-thread path not considered: duplication of 5 insns is
needed and optimizing for size.


How does that affect the actual benchmark? We may experiment with jump
threading whenever the destination BB is hot, not just edge itself as
duplicating the path also simplifies the surrounding code.

[Bug tree-optimization/77445] [7 Regression] Performance drop after r239219 on coremark test

2017-01-17 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445

--- Comment #13 from Jan Hubicka  ---
Author: hubicka
Date: Tue Jan 17 12:49:41 2017
New Revision: 244528

URL: https://gcc.gnu.org/viewcvs?rev=244528=gcc=rev
Log:

PR middle-end/77445
* tree-ssa-threadupdate.c (remove_ctrl_stmt_and_useless_edges):
correctly set frequency of oudgoing edge.
(duplicate_thread_path): Fix profile updating.
* gcc.dg/tree-ssa/pr77445-2.c: New testcase.
* gcc.dg/tree-ssa/pr77445.c: New testcase.

Added:
trunk/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c
trunk/gcc/testsuite/gcc.dg/tree-ssa/pr77445.c
Modified:
trunk/gcc/ChangeLog
trunk/gcc/testsuite/ChangeLog
trunk/gcc/tree-ssa-threadupdate.c

[Bug tree-optimization/77445] [7 Regression] Performance drop after r239219 on coremark test

2017-01-16 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445

--- Comment #12 from Jan Hubicka  ---
Created attachment 40526
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40526=edit
Patch I am testing

The profile is quite inconsistent since thread1.
The problem is that duplicate_thread_path does
  if (total_count)  
{   
  scale_bbs_frequencies_gcov_type (region, n_region,
   total_count - entry_count,   
   total_count);
  scale_bbs_frequencies_gcov_type (region_copy, n_region, entry_count,  
   total_count);
}   
  else  
{   
  scale_bbs_frequencies_int (region, n_region, total_freq - entry_freq, 
 total_freq);   
  scale_bbs_frequencies_int (region_copy, n_region, entry_freq,
total_freq);
}   

This is wrong when the duplicated path have extra entries that are not
duplicated which happens in the testcase.

The attached patch fixes it.  It is still not 100% correct as the outgoing
probabilities of the original path should be updated based on the path being
copied out, but I am not sure it is worth the trouble (and risk of roundoff
errors). I will think a bit more about this tomorrow.

[Bug tree-optimization/77445] [7 Regression] Performance drop after r239219 on coremark test

2017-01-12 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445

Richard Biener  changed:

   What|Removed |Added

   Priority|P3  |P1

[Bug tree-optimization/77445] [7 Regression] Performance drop after r239219 on coremark test

2016-12-17 Thread hubicka at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445

Jan Hubicka  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |hubicka at gcc dot 
gnu.org

--- Comment #11 from Jan Hubicka  ---
I will take a look. I think we should try to fix the profile updating after
jump threading again and see if we need any hacks after that.

[Bug tree-optimization/77445] [7 Regression] Performance drop after r239219 on coremark test

2016-12-16 Thread rguenther at suse dot de
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445

--- Comment #10 from rguenther at suse dot de  ---
On Wed, 14 Dec 2016, jgreenhalgh at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445
> 
> --- Comment #8 from James Greenhalgh  ---
> Is anyone currently looking at this?
> 
> If the bug is still blocked on correcting the profile information (which 
> sounds
> like a large job)

But it's the actual regression...

[Bug tree-optimization/77445] [7 Regression] Performance drop after r239219 on coremark test

2016-12-14 Thread law at redhat dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445

Jeffrey A. Law  changed:

   What|Removed |Added

 CC||law at redhat dot com

--- Comment #9 from Jeffrey A. Law  ---
Not looking at it right now, but given it's a regression relative to gcc-6
it'll have to be looked at during this release cycle.

[Bug tree-optimization/77445] [7 Regression] Performance drop after r239219 on coremark test

2016-12-14 Thread jgreenhalgh at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445

--- Comment #8 from James Greenhalgh  ---
Is anyone currently looking at this?

If the bug is still blocked on correcting the profile information (which sounds
like a large job), should we consider weakening or reverting the cost model for
GCC 7?

[Bug tree-optimization/77445] [7 Regression] Performance drop after r239219 on coremark test

2016-11-30 Thread jgreenhalgh at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445

--- Comment #7 from James Greenhalgh  ---
Right, I've trimmed too much context from my message.

This performance regression starts with r239219 which adds a cost model to the
threader which relies on frequency information (arguably this is a bad cost
model for threading, as at a switch statement you might expect multiple cold
edges, and still want to thread the switch, but that's a separate discussion).
The threader does a bad job of updating frequency information when it creates
new paths, with the effect that the edges we'd want to thread in this test case
appear to be cold. The new cost model from r239219 sees the cold edges, and
rejects the threading opportunity.

The message I was replying to above had said:

  > Hmm, this is interesting. The patch should have "fixed" the previous
  > degradation by making the profile correct (backward threader still doe not
  > update it, but because most threading now happens early and profile is
built
  > afterwards this should be less of issue).  I am now looking into the
profile
  > update issues and will try to check why coremarks degrade again.

The answer to which is that the early-threader has hard-coded that it is
compiling for size, which causes most backward threading to be rejected, so
wouldn't fix this issue.

However, if we were to use optimize_bb_for_size_p in
pass_early_thread_jumps::execute rather than just passing false then the early
threader would have resolved this issue (as the profile information is not used
to decide if the edge should be optimised for speed).

[Bug tree-optimization/77445] [7 Regression] Performance drop after r239219 on coremark test

2016-11-30 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445

Richard Biener  changed:

   What|Removed |Added

 CC||rguenth at gcc dot gnu.org

--- Comment #6 from Richard Biener  ---
But the early threader is new and can't be responsible for the regression.

[Bug tree-optimization/77445] [7 Regression] Performance drop after r239219 on coremark test

2016-11-30 Thread jgreenhalgh at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445

James Greenhalgh  changed:

   What|Removed |Added

   Last reconfirmed|2016-09-03 00:00:00 |2016-11-30
 CC||law at gcc dot gnu.org

--- Comment #5 from James Greenhalgh  ---
I posted this on list a few weeks back:

  https://gcc.gnu.org/ml/gcc-patches/2016-10/msg01454.html

The early threader is running with speed_p set to false (second parameter to
find_jump_threads_backwards)

  unsigned int
  pass_early_thread_jumps::execute (function *fun)
  {
/* Try to thread each block with more than one successor.  */
basic_block bb;
FOR_EACH_BB_FN (bb, fun)
  {
if (EDGE_COUNT (bb->succs) > 1)
find_jump_threads_backwards (bb, false);
  }
thread_through_all_blocks (true);
return 0;
  }

So even though profile information is ignored, we think we are compiling for
size and won't thread. The relevant check in profitable_jump_thread_path is:

  if (speed_p && optimize_edge_for_speed_p (taken_edge))
{
  
}
  else if (n_insns > 1)
{
  if (dump_file && (dump_flags & TDF_DETAILS))
fprintf (dump_file, "FSM jump-thread path not considered: "
 "duplication of %i insns is needed and optimizing for
size.\n",
 n_insns);
  path->pop ();
  return NULL;
}

Changing false to true (or even to optimize_bb_for_size_p ) in the above hunk
looks like it would enable some of the threading we're relying on here.

[Bug tree-optimization/77445] [7 Regression] Performance drop after r239219 on coremark test

2016-11-14 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445

--- Comment #4 from Yuri Rumyantsev  ---
Ping.
Do you have any progress on this?

Thanks.

[Bug tree-optimization/77445] [7 Regression] Performance drop after r239219 on coremark test

2016-09-03 Thread pinskia at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445

Andrew Pinski  changed:

   What|Removed |Added

 Status|UNCONFIRMED |NEW
   Last reconfirmed||2016-09-03
 Ever confirmed|0   |1

--- Comment #3 from Andrew Pinski  ---
Confirmed.  I had reported it too:
https://gcc.gnu.org/ml/gcc-patches/2016-08/msg00552.html

[Bug tree-optimization/77445] [7 Regression] Performance drop after r239219 on coremark test

2016-09-01 Thread rguenth at gcc dot gnu.org
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445

Richard Biener  changed:

   What|Removed |Added

   Keywords||missed-optimization
 CC||hubicka at gcc dot gnu.org
   Target Milestone|--- |7.0

--- Comment #2 from Richard Biener  ---
I think r239219 looks bogus in that it considers the outgoing edge frequency
but not the frequency of the path to it.  That is, we usually have

   if ()
/ \
   B   A
   \  /
X
if ()
   \
taken_edge

and we should duplicate 'X' if the path leading to taken_edge (either
through A or B) is to be optimized for speed (or ! for size).  The
frequency of the edge out of the condition isn't the only relevant
one (in fact whether we should consider it hot or cold needs to be
adjusted by the incoming path frequency given the now static prediction).

Given r239219 uses ninsns > 1 in the else path it IMHO should use
!optimize_edge_for_size_p or retain a > 1 PARAM use.

[Bug tree-optimization/77445] [7 Regression] Performance drop after r239219 on coremark test

2016-09-01 Thread ysrumyan at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445

--- Comment #1 from Yuri Rumyantsev  ---
Created attachment 39535
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=39535=edit
test-case to reproduce

It is sufficient to compile it with -Ofast option.