[Bug tree-optimization/77445] [7 Regression] Performance drop after r239219 on coremark test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445 Jan Hubicka changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #25 from Jan Hubicka --- We now do all the threading possible :)
[Bug tree-optimization/77445] [7 Regression] Performance drop after r239219 on coremark test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445 --- Comment #24 from Jan Hubicka --- Author: hubicka Date: Thu Feb 2 20:22:13 2017 New Revision: 245135 URL: https://gcc.gnu.org/viewcvs?rev=245135=gcc=rev Log: PR middle-end/77445 * gcc.dg/tree-ssa/pr77445-2.c: Update testcase to check that all threading is done. * tree-ssa-threadbackward.c (profitable_jump_thread_path): Dump statistics of the analyzed path; allow threading for speed when any of BBs along the path are optimized for speed. Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c trunk/gcc/tree-ssa-threadbackward.c
[Bug tree-optimization/77445] [7 Regression] Performance drop after r239219 on coremark test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445 --- Comment #23 from Jan Hubicka --- I will implement the "allow size growth when there is a hot bb on a path" heuristics. We may try to get smarter on when the peeling actually helps the unpeeled path, but that is bit hard to do because there may be many threadable paths on the way (just as in the example) and we will never be able to fully judge optimization oppurtunities enabled by peeling (such as better VRP propagation, for example) I doubt being on aggressive side here will have large effect on code size. On unrelated note this testcase made me think that perhaps for -Ofast we want to adjust hot/cold heuristics so we are more aggressive on guessing that BB is hot as well (for -O2 we want to put into account code size implications, but for -Ofast we are probably permitted to drop them) Concerning peeling, we are probably always safe to decide to peel when the transformation decreases code size which is the case of all those early transforms I looked into (about 200-300 out of 700 on tramp3d). In fact such complete peeling ought to happen somewhere in the early pass queue. We can punt on non-size-decreasing peelings leaving them to dedicated pass. The loop peeling pass has its own way on predicting if something optimizes away. It does take into account induction variables but no phis that happens to be constant across the peeled path. Perhaps it should be merged with jump threader and the analysis part should be done in a common way. Backward threader should likely get smarter over time, taking into account, for example, the value ranges and this way both passes will benefit.
[Bug tree-optimization/77445] [7 Regression] Performance drop after r239219 on coremark test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445 --- Comment #22 from Jeffrey A. Law --- In response to c#19. Yes, a thread path which is cold except for a hot block in the middle might be profitable to thread as it will sometimes expose path specific redundancies/simplifications in the hot block. Whether or not it does is mostly a function of the PHI nodes in the hot block. ISTM that these opportunities are most likely to arise if threading results in a degenerate PHI in the hot block. Which is something we ought to be able to test for. Not sure if that helps this testcase or not.
[Bug tree-optimization/77445] [7 Regression] Performance drop after r239219 on coremark test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445 --- Comment #21 from Jeffrey A. Law --- In response to c#20. We have to be careful about fixing up the loop after rotation and avoid repeatedly peeling. The old threader dealt with those by being fairly conservative in passes before the loop optimizer and much more aggressive later. Zdenek's general argument was that peeling in the threader is fundamentally wrong because we really only know general loop structure, but not iteration information, data dependencies, etc (which directly affect vectorization). If peeling a loop inhibits vectorization, then we've most likely lost in a big way. I think the backward threader punts all these issues, which is probably the better choice than aggressively trying to handle them right now.
[Bug tree-optimization/77445] [7 Regression] Performance drop after r239219 on coremark test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445 --- Comment #20 from rguenther at suse dot de --- On Mon, 23 Jan 2017, hubicka at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445 > > --- Comment #19 from Jan Hubicka --- > One change that would make sense to me is to always thread when there is a > non-cold block on the path: we are not only improving the path taken when > threading but because we eliminate incoming edges we also permit better > optimization of non-thread path later. > > This way we thread the branch, because there is still non-cold BB on a way. > > On related note, I have noticed that we give up on about 30% of jump threads > in > early optimization because of: > >/* Skip edges pointing outside the current loop. */ >if (!arg || var_bb->loop_father != bbi->loop_father) > continue; > > This is very common case such as in: > char *c; > int t() > { > for (int i=0;i<5000;i++) > c[i]=i; > } > > before loop header copying there is threadable branch at first iteration. > Threading it before profile construction would be quite desirable. > > There seem to be code deciding on loops when checking path profitability. It > seems to me that at least the code can be adjusted to accept thread starting > by > loop header when there is only single edge in becausee that can't make > irreducible loops? But it will rotate the loop and thus change the loop header block? Then we need to make sure to preserve that properly to not destroy loop annotations by removing/rediscovering the loop. Also we need to make sure to not repeatedly peel the loop via such mechanism.
[Bug tree-optimization/77445] [7 Regression] Performance drop after r239219 on coremark test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445 --- Comment #19 from Jan Hubicka --- One change that would make sense to me is to always thread when there is a non-cold block on the path: we are not only improving the path taken when threading but because we eliminate incoming edges we also permit better optimization of non-thread path later. This way we thread the branch, because there is still non-cold BB on a way. On related note, I have noticed that we give up on about 30% of jump threads in early optimization because of: /* Skip edges pointing outside the current loop. */ if (!arg || var_bb->loop_father != bbi->loop_father) continue; This is very common case such as in: char *c; int t() { for (int i=0;i<5000;i++) c[i]=i; } before loop header copying there is threadable branch at first iteration. Threading it before profile construction would be quite desirable. There seem to be code deciding on loops when checking path profitability. It seems to me that at least the code can be adjusted to accept thread starting by loop header when there is only single edge in becausee that can't make irreducible loops? Honza
[Bug tree-optimization/77445] [7 Regression] Performance drop after r239219 on coremark test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445 --- Comment #18 from Jeffrey A. Law --- >From just looking at the paths, I would expect it to matter -- they're still cases where we're threading the multiway branch and that's the key to this benchmark -- avoiding the multiway branch.
[Bug tree-optimization/77445] [7 Regression] Performance drop after r239219 on coremark test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445 --- Comment #17 from Jan Hubicka --- As reported in PR77484, coremark is now up by 20% or more. Are we out of regression land now? If not does the patch in #15 help?
[Bug tree-optimization/77445] [7 Regression] Performance drop after r239219 on coremark test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445 seurer at gcc dot gnu.org changed: What|Removed |Added CC||seurer at gcc dot gnu.org --- Comment #16 from seurer at gcc dot gnu.org --- One of the new tests fails on powerpc (both LE and BE): Executing on host: /home/seurer/gcc/build/gcc-test2/gcc/xgcc -B/home/seurer/gcc/build/gcc-test2/gcc/ /home/seurer/gcc/gcc-test2/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c -fno-diagnostics-show-caret -fdiagnostics-color=never -O2 -fdump-tree-thread1-details-blocks-stats -S -o pr77445-2.s(timeout = 300) spawn -ignore SIGHUP /home/seurer/gcc/build/gcc-test2/gcc/xgcc -B/home/seurer/gcc/build/gcc-test2/gcc/ /home/seurer/gcc/gcc-test2/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c -fno-diagnostics-show-caret -fdiagnostics-color=never -O2 -fdump-tree-thread1-details-blocks-stats -S -o pr77445-2.s PASS: gcc.dg/tree-ssa/pr77445-2.c (test for excess errors) FAIL: gcc.dg/tree-ssa/pr77445-2.c scan-tree-dump thread1 "Jumps threaded: 16" PASS: gcc.dg/tree-ssa/pr77445-2.c scan-tree-dump-times thread1 "Invalid sum" 2 testcase /home/seurer/gcc/gcc-test2/gcc/testsuite/gcc.dg/tree-ssa/tree-ssa.exp completed in 0 seconds === gcc Summary === # of expected passes2 # of unexpected failures1
[Bug tree-optimization/77445] [7 Regression] Performance drop after r239219 on coremark test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445 --- Comment #15 from Jan Hubicka --- Note that the remaining missed threads loop exit condition test state != INVALID which after sequence of threads gets to probability 0 because original guess from profile_estimate is not realistic. I guess such repeated threading may be common. It may be interesting to benchmark the following hack both for speed and size: Index: tree-ssa-threadbackward.c === --- tree-ssa-threadbackward.c (revision 244527) +++ tree-ssa-threadbackward.c (working copy) @@ -311,7 +311,11 @@ profitable_jump_thread_path (veclength () - 1], + (*path)[path->length () - 2] { if (n_insns >= PARAM_VALUE (PARAM_MAX_FSM_THREAD_PATH_INSNS)) { It makes us to consider threading the path if at least it starts by an hot edge.
[Bug tree-optimization/77445] [7 Regression] Performance drop after r239219 on coremark test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445 --- Comment #14 from Jan Hubicka --- With the patch we only give up on some threading in thread4: q.c.181t.thread4:FSM jump-thread path not considered: duplication of 5 insns is needed and optimizing for size. q.c.181t.thread4:FSM jump-thread path not considered: duplication of 5 insns is needed and optimizing for size. q.c.181t.thread4:FSM jump-thread path not considered: duplication of 5 insns is needed and optimizing for size. q.c.181t.thread4:FSM jump-thread path not considered: duplication of 5 insns is needed and optimizing for size. q.c.181t.thread4:FSM jump-thread path not considered: duplication of 5 insns is needed and optimizing for size. q.c.181t.thread4:FSM jump-thread path not considered: duplication of 5 insns is needed and optimizing for size. q.c.181t.thread4:FSM jump-thread path not considered: duplication of 5 insns is needed and optimizing for size. q.c.181t.thread4:FSM jump-thread path not considered: duplication of 5 insns is needed and optimizing for size. q.c.181t.thread4:FSM jump-thread path not considered: duplication of 5 insns is needed and optimizing for size. How does that affect the actual benchmark? We may experiment with jump threading whenever the destination BB is hot, not just edge itself as duplicating the path also simplifies the surrounding code.
[Bug tree-optimization/77445] [7 Regression] Performance drop after r239219 on coremark test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445 --- Comment #13 from Jan Hubicka --- Author: hubicka Date: Tue Jan 17 12:49:41 2017 New Revision: 244528 URL: https://gcc.gnu.org/viewcvs?rev=244528=gcc=rev Log: PR middle-end/77445 * tree-ssa-threadupdate.c (remove_ctrl_stmt_and_useless_edges): correctly set frequency of oudgoing edge. (duplicate_thread_path): Fix profile updating. * gcc.dg/tree-ssa/pr77445-2.c: New testcase. * gcc.dg/tree-ssa/pr77445.c: New testcase. Added: trunk/gcc/testsuite/gcc.dg/tree-ssa/pr77445-2.c trunk/gcc/testsuite/gcc.dg/tree-ssa/pr77445.c Modified: trunk/gcc/ChangeLog trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-ssa-threadupdate.c
[Bug tree-optimization/77445] [7 Regression] Performance drop after r239219 on coremark test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445 --- Comment #12 from Jan Hubicka --- Created attachment 40526 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=40526=edit Patch I am testing The profile is quite inconsistent since thread1. The problem is that duplicate_thread_path does if (total_count) { scale_bbs_frequencies_gcov_type (region, n_region, total_count - entry_count, total_count); scale_bbs_frequencies_gcov_type (region_copy, n_region, entry_count, total_count); } else { scale_bbs_frequencies_int (region, n_region, total_freq - entry_freq, total_freq); scale_bbs_frequencies_int (region_copy, n_region, entry_freq, total_freq); } This is wrong when the duplicated path have extra entries that are not duplicated which happens in the testcase. The attached patch fixes it. It is still not 100% correct as the outgoing probabilities of the original path should be updated based on the path being copied out, but I am not sure it is worth the trouble (and risk of roundoff errors). I will think a bit more about this tomorrow.
[Bug tree-optimization/77445] [7 Regression] Performance drop after r239219 on coremark test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445 Richard Biener changed: What|Removed |Added Priority|P3 |P1
[Bug tree-optimization/77445] [7 Regression] Performance drop after r239219 on coremark test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445 Jan Hubicka changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |hubicka at gcc dot gnu.org --- Comment #11 from Jan Hubicka --- I will take a look. I think we should try to fix the profile updating after jump threading again and see if we need any hacks after that.
[Bug tree-optimization/77445] [7 Regression] Performance drop after r239219 on coremark test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445 --- Comment #10 from rguenther at suse dot de --- On Wed, 14 Dec 2016, jgreenhalgh at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445 > > --- Comment #8 from James Greenhalgh --- > Is anyone currently looking at this? > > If the bug is still blocked on correcting the profile information (which > sounds > like a large job) But it's the actual regression...
[Bug tree-optimization/77445] [7 Regression] Performance drop after r239219 on coremark test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445 Jeffrey A. Law changed: What|Removed |Added CC||law at redhat dot com --- Comment #9 from Jeffrey A. Law --- Not looking at it right now, but given it's a regression relative to gcc-6 it'll have to be looked at during this release cycle.
[Bug tree-optimization/77445] [7 Regression] Performance drop after r239219 on coremark test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445 --- Comment #8 from James Greenhalgh --- Is anyone currently looking at this? If the bug is still blocked on correcting the profile information (which sounds like a large job), should we consider weakening or reverting the cost model for GCC 7?
[Bug tree-optimization/77445] [7 Regression] Performance drop after r239219 on coremark test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445 --- Comment #7 from James Greenhalgh --- Right, I've trimmed too much context from my message. This performance regression starts with r239219 which adds a cost model to the threader which relies on frequency information (arguably this is a bad cost model for threading, as at a switch statement you might expect multiple cold edges, and still want to thread the switch, but that's a separate discussion). The threader does a bad job of updating frequency information when it creates new paths, with the effect that the edges we'd want to thread in this test case appear to be cold. The new cost model from r239219 sees the cold edges, and rejects the threading opportunity. The message I was replying to above had said: > Hmm, this is interesting. The patch should have "fixed" the previous > degradation by making the profile correct (backward threader still doe not > update it, but because most threading now happens early and profile is built > afterwards this should be less of issue). I am now looking into the profile > update issues and will try to check why coremarks degrade again. The answer to which is that the early-threader has hard-coded that it is compiling for size, which causes most backward threading to be rejected, so wouldn't fix this issue. However, if we were to use optimize_bb_for_size_p in pass_early_thread_jumps::execute rather than just passing false then the early threader would have resolved this issue (as the profile information is not used to decide if the edge should be optimised for speed).
[Bug tree-optimization/77445] [7 Regression] Performance drop after r239219 on coremark test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445 Richard Biener changed: What|Removed |Added CC||rguenth at gcc dot gnu.org --- Comment #6 from Richard Biener --- But the early threader is new and can't be responsible for the regression.
[Bug tree-optimization/77445] [7 Regression] Performance drop after r239219 on coremark test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445 James Greenhalgh changed: What|Removed |Added Last reconfirmed|2016-09-03 00:00:00 |2016-11-30 CC||law at gcc dot gnu.org --- Comment #5 from James Greenhalgh --- I posted this on list a few weeks back: https://gcc.gnu.org/ml/gcc-patches/2016-10/msg01454.html The early threader is running with speed_p set to false (second parameter to find_jump_threads_backwards) unsigned int pass_early_thread_jumps::execute (function *fun) { /* Try to thread each block with more than one successor. */ basic_block bb; FOR_EACH_BB_FN (bb, fun) { if (EDGE_COUNT (bb->succs) > 1) find_jump_threads_backwards (bb, false); } thread_through_all_blocks (true); return 0; } So even though profile information is ignored, we think we are compiling for size and won't thread. The relevant check in profitable_jump_thread_path is: if (speed_p && optimize_edge_for_speed_p (taken_edge)) { } else if (n_insns > 1) { if (dump_file && (dump_flags & TDF_DETAILS)) fprintf (dump_file, "FSM jump-thread path not considered: " "duplication of %i insns is needed and optimizing for size.\n", n_insns); path->pop (); return NULL; } Changing false to true (or even to optimize_bb_for_size_p ) in the above hunk looks like it would enable some of the threading we're relying on here.
[Bug tree-optimization/77445] [7 Regression] Performance drop after r239219 on coremark test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445 --- Comment #4 from Yuri Rumyantsev --- Ping. Do you have any progress on this? Thanks.
[Bug tree-optimization/77445] [7 Regression] Performance drop after r239219 on coremark test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445 Andrew Pinski changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2016-09-03 Ever confirmed|0 |1 --- Comment #3 from Andrew Pinski --- Confirmed. I had reported it too: https://gcc.gnu.org/ml/gcc-patches/2016-08/msg00552.html
[Bug tree-optimization/77445] [7 Regression] Performance drop after r239219 on coremark test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445 Richard Biener changed: What|Removed |Added Keywords||missed-optimization CC||hubicka at gcc dot gnu.org Target Milestone|--- |7.0 --- Comment #2 from Richard Biener --- I think r239219 looks bogus in that it considers the outgoing edge frequency but not the frequency of the path to it. That is, we usually have if () / \ B A \ / X if () \ taken_edge and we should duplicate 'X' if the path leading to taken_edge (either through A or B) is to be optimized for speed (or ! for size). The frequency of the edge out of the condition isn't the only relevant one (in fact whether we should consider it hot or cold needs to be adjusted by the incoming path frequency given the now static prediction). Given r239219 uses ninsns > 1 in the else path it IMHO should use !optimize_edge_for_size_p or retain a > 1 PARAM use.
[Bug tree-optimization/77445] [7 Regression] Performance drop after r239219 on coremark test
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77445 --- Comment #1 from Yuri Rumyantsev --- Created attachment 39535 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=39535=edit test-case to reproduce It is sufficient to compile it with -Ofast option.