[Bug tree-optimization/46590] [4.6/4.7/4.8 Regression] long compile time with -O2 and many loops
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46590 --- Comment #36 from Richard Guenther rguenth at gcc dot gnu.org 2012-09-05 10:59:52 UTC --- If I fix that (PR54489) by iterating over immediate dominators when querying AVAIL_OUT instead of accumulating then other loop opts quickly take over in compile-time, but memory usage stays reasonable at -O1. LIM is now the pass that pushes memory usage to 1.8GB - all other optimization passes are happy with just ~800MB. The issue with LIM is that it analyzes the whole function instead of working on outermost loops at a time (PR54488). Then of course IRA comes along and wrecks memory usage again ... (create_loop_tree_nodes). One can tame down IRA a bit using -fno-ira-loop-pressure -fira-region=one. We then arrive at roughly a constant 900MB memory usage for the full(!) testcase at -O1 and Execution times (seconds) phase opt and generate : 495.90 (99%) usr 1.98 (98%) sys 499.91 (99%) wall 870508 kB (92%) ggc df reaching defs: 19.16 ( 4%) usr 0.06 ( 3%) sys 19.18 ( 4%) wall 0 kB ( 0%) ggc alias stmt walking : 28.75 ( 6%) usr 0.21 (10%) sys 29.12 ( 6%) wall 2336 kB ( 0%) ggc tree SSA rewrite: 63.42 (13%) usr 0.02 ( 1%) sys 63.77 (13%) wall 18830 kB ( 2%) ggc tree SSA incremental: 74.64 (15%) usr 0.03 ( 1%) sys 74.44 (15%) wall 25886 kB ( 3%) ggc dominance frontiers : 101.71 (20%) usr 0.09 ( 4%) sys 102.17 (20%) wall 0 kB ( 0%) ggc dominance computation : 52.56 (11%) usr 0.09 ( 4%) sys 53.35 (11%) wall 0 kB ( 0%) ggc loop invariant motion : 101.20 (20%) usr 0.10 ( 5%) sys 101.75 (20%) wall 2700 kB ( 0%) ggc TOTAL : 498.79 2.03 502.87 947764 kB (all entries 10s) The incremental SSA stuff is complete loop unrolling / IV canonicalization which does SSA update once per loop (similar to what loop header copying formerly did). Fixing that leads to Execution times (seconds) phase opt and generate : 214.62 (99%) usr 1.53 (96%) sys 217.41 (99%) wall 870508 kB (92%) ggc df reaching defs: 23.07 (11%) usr 0.01 ( 1%) sys 23.10 (10%) wall 0 kB ( 0%) ggc alias stmt walking : 28.51 (13%) usr 0.23 (14%) sys 28.93 (13%) wall 2336 kB ( 0%) ggc loop invariant motion : 105.43 (48%) usr 0.01 ( 1%) sys 106.22 (48%) wall 2700 kB ( 0%) ggc TOTAL : 217.56 1.59 220.44 947764 kB so RTL invariant motion is now the main offender ;)
[Bug tree-optimization/46590] [4.6/4.7/4.8 Regression] long compile time with -O2 and many loops
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46590 --- Comment #37 from Richard Guenther rguenth at gcc dot gnu.org 2012-09-05 13:29:20 UTC --- Author: rguenth Date: Wed Sep 5 13:29:13 2012 New Revision: 190978 URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=190978 Log: 2012-09-05 Richard Guenther rguent...@suse.de PR tree-optimization/46590 * tree-ssa-loop-ivcanon.c (try_unroll_loop_completely): Do not update SSA form here. (canonicalize_induction_variables): Assert we do not need to update SSA form. (tree_unroll_loops_completely): Update SSA form here. * tree-ssa-loop-manip.c (gimple_duplicate_loop_to_header_edge): Do not verify loop-closed SSA form if SSA form is not up-to-date. Modified: trunk/gcc/ChangeLog trunk/gcc/tree-ssa-loop-ivcanon.c trunk/gcc/tree-ssa-loop-manip.c
[Bug tree-optimization/46590] [4.6/4.7/4.8 Regression] long compile time with -O2 and many loops
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46590 --- Comment #35 from Richard Guenther rguenth at gcc dot gnu.org 2012-09-04 13:17:53 UTC --- With -O1 FRE uses loads of memory and compile-time. Not the value-numbering itself but compute_avail ()! Bah. Of course AVAIL sets get bigger and bigger going down the dominator tree. For FRE a single domwalk after SCCVN computing avail and doing elimiation would be enough. Especially as all the overhead in AVAIL is just SSA DEFs and their value.
[Bug tree-optimization/46590] [4.6/4.7/4.8 Regression] long compile time with -O2 and many loops
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46590 Richard Guenther rguenth at gcc dot gnu.org changed: What|Removed |Added Status|ASSIGNED|NEW AssignedTo|rguenth at gcc dot gnu.org |unassigned at gcc dot ||gnu.org --- Comment #33 from Richard Guenther rguenth at gcc dot gnu.org 2012-09-03 10:45:06 UTC --- Not mine anymore.
[Bug tree-optimization/46590] [4.6/4.7/4.8 Regression] long compile time with -O2 and many loops
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46590 --- Comment #34 from Michael Matz matz at gcc dot gnu.org 2012-09-03 15:39:20 UTC --- Author: matz Date: Mon Sep 3 15:39:15 2012 New Revision: 190897 URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=190897 Log: PR tree-optimization/46590 * tree-cfg.c (gimple_duplicate_sese_region): Don't update SSA web here ... * tree-ssa-loop-ch.c (copy_loop_headers): ... but here. Modified: trunk/gcc/ChangeLog trunk/gcc/tree-cfg.c trunk/gcc/tree-ssa-loop-ch.c
[Bug tree-optimization/46590] [4.6/4.7/4.8 Regression] long compile time with -O2 and many loops
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46590 --- Comment #30 from rguenther at suse dot de rguenther at suse dot de 2012-08-23 07:13:13 UTC --- On Wed, 22 Aug 2012, stevenb.gcc at gmail dot com wrote: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46590 --- Comment #29 from stevenb.gcc at gmail dot com stevenb.gcc at gmail dot com 2012-08-22 17:33:00 UTC --- I thought that loop header copying wouldn't need to insert new PHI nodes and thus can do with TODO_update_ssa_no_phi if we are in loop-closed SSA form, but appearantly that is not the case (I didn't inverstigate further here, but possibly that's just because virtual operands are not in loop-closed SSA form). I'll experiment with VOPs in LC-SSA. You might have seen the patch I posted - it passed testing and I will install it today.
[Bug tree-optimization/46590] [4.6/4.7/4.8 Regression] long compile time with -O2 and many loops
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46590 --- Comment #31 from Richard Guenther rguenth at gcc dot gnu.org 2012-08-23 11:41:50 UTC --- Btw, I have experimented with Index: tree-into-ssa.c === --- tree-into-ssa.c (revision 190594) +++ tree-into-ssa.c (working copy) @@ -3224,11 +3224,35 @@ update_ssa (unsigned update_flags) bitmap_head *dfs; /* If the caller requested PHI nodes to be added, compute -dominance frontiers. */ +dominance frontiers for all interesting blocks +up to the dominating start_bb. */ dfs = XNEWVEC (bitmap_head, last_basic_block); FOR_EACH_BB (bb) bitmap_initialize (dfs[bb-index], bitmap_default_obstack); - compute_dominance_frontiers (dfs); + EXECUTE_IF_SET_IN_BITMAP (blocks_to_update, 0, i, bi) + { + basic_block b = BASIC_BLOCK (i); + edge_iterator ei; + edge p; + if (EDGE_COUNT (b-preds) = 2) + FOR_EACH_EDGE (p, ei, b-preds) + { + basic_block runner = p-src; + basic_block domsb; + if (runner == start_bb) + continue; + + domsb = get_immediate_dominator (CDI_DOMINATORS, b); + while (runner != domsb) + { + if (!bitmap_set_bit (dfs[runner-index], +b-index)) + break; + runner = get_immediate_dominator (CDI_DOMINATORS, + runner); + } + } + } if (sbitmap_first_set_bit (old_ssa_names) = 0) { which helps reducing the time spent in computing dominance frontiers. But as we no longer have bitmaps but bitmap_heads in dfs it's hard to verify we only ever access dfs for entries we computed ... but we should, looking at how compute_idf works(?).
[Bug tree-optimization/46590] [4.6/4.7/4.8 Regression] long compile time with -O2 and many loops
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46590 --- Comment #32 from Steven Bosscher steven at gcc dot gnu.org 2012-08-23 13:44:53 UTC --- (In reply to comment #31) which helps reducing the time spent in computing dominance frontiers. But as we no longer have bitmaps but bitmap_heads in dfs it's hard to verify we only ever access dfs for entries we computed ... but we should, looking at how compute_idf works(?). I don't understand this comment. You can still always do: bitmap_empty_p (dfs[bb-index]) to see if something was computed for bb.
[Bug tree-optimization/46590] [4.6/4.7/4.8 Regression] long compile time with -O2 and many loops
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46590 --- Comment #27 from Richard Guenther rguenth at gcc dot gnu.org 2012-08-22 11:39:56 UTC --- The issue with loop header copying is that we use incremental SSA updating with PHI insertion for each individual loop header copied. That computes dominance frontiers which is always for the whole function. I thought that loop header copying wouldn't need to insert new PHI nodes and thus can do with TODO_update_ssa_no_phi if we are in loop-closed SSA form, but appearantly that is not the case (I didn't inverstigate further here, but possibly that's just because virtual operands are not in loop-closed SSA form).
[Bug tree-optimization/46590] [4.6/4.7/4.8 Regression] long compile time with -O2 and many loops
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46590 --- Comment #28 from Richard Guenther rguenth at gcc dot gnu.org 2012-08-22 13:17:42 UTC --- Author: rguenth Date: Wed Aug 22 13:17:26 2012 New Revision: 190594 URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=190594 Log: 2012-08-22 Richard Guenther rguent...@suse.de PR tree-optimization/46590 * tree-ssa-alias.h (get_continuation_for_phi): Add alias query counter output argument. (walk_non_aliased_vuses): Add alias query counter argument to the walker callback. * tree-ssa-alias.c (maybe_skip_until): Add alias query counter output argument and count alias queries. (get_continuation_for_phi_1): Likewise. (get_continuation_for_phi): Likewise. (walk_non_aliased_vuses): Add alias query counter argument to the walker callback and allow it to abort the walk by returning -1. * tree-ssa-pre.c (translate_vuse_through_block): Adjust. * tree-ssa-sccvn.c (vn_reference_lookup_2): Add alias query counter parmeter, abort walk if that is bigger than --param sccvn-max-alias-queries-per-access. * params.def (sccvn-max-alias-queries-per-access): New param. * doc/invoke.texi (sccvn-max-alias-queries-per-access): Document. Modified: trunk/gcc/ChangeLog trunk/gcc/doc/invoke.texi trunk/gcc/params.def trunk/gcc/tree-ssa-alias.c trunk/gcc/tree-ssa-alias.h trunk/gcc/tree-ssa-pre.c trunk/gcc/tree-ssa-sccvn.c
[Bug tree-optimization/46590] [4.6/4.7/4.8 Regression] long compile time with -O2 and many loops
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46590 --- Comment #29 from stevenb.gcc at gmail dot com stevenb.gcc at gmail dot com 2012-08-22 17:33:00 UTC --- I thought that loop header copying wouldn't need to insert new PHI nodes and thus can do with TODO_update_ssa_no_phi if we are in loop-closed SSA form, but appearantly that is not the case (I didn't inverstigate further here, but possibly that's just because virtual operands are not in loop-closed SSA form). I'll experiment with VOPs in LC-SSA.
[Bug tree-optimization/46590] [4.6/4.7/4.8 Regression] long compile time with -O2 and many loops
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46590 Richard Guenther rguenth at gcc dot gnu.org changed: What|Removed |Added CC||nbhargava at google dot com --- Comment #22 from Richard Guenther rguenth at gcc dot gnu.org 2012-08-21 08:09:55 UTC --- *** Bug 54337 has been marked as a duplicate of this bug. ***
[Bug tree-optimization/46590] [4.6/4.7/4.8 Regression] long compile time with -O2 and many loops
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46590 Steven Bosscher steven at gcc dot gnu.org changed: What|Removed |Added CC||steven at gcc dot gnu.org --- Comment #23 from Steven Bosscher steven at gcc dot gnu.org 2012-08-21 09:42:42 UTC --- (In reply to comment #13) The I/O parts of the FRE cost are due to value-numbering stores in visit_reference_op_store. They can be drastically cut by an equivalent of (not generating code) Index: trans-io.c === --- trans-io.c (revision 167111) +++ trans-io.c (working copy) @@ -1670,6 +1670,7 @@ build_dt (tree function, gfc_code * code gfc_init_block (post_iu_block); var = gfc_create_var (st_parameter[IOPARM_ptype_dt].type, dt_parm); + gfc_add_modify (block, var, build_constructor (TREE_TYPE (var), NULL)); set_error_locus (block, var, code-loc); You didn't post/commit this, but it looks like a reasonable change to me.
[Bug tree-optimization/46590] [4.6/4.7/4.8 Regression] long compile time with -O2 and many loops
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46590 --- Comment #24 from rguenther at suse dot de rguenther at suse dot de 2012-08-21 09:59:41 UTC --- On Tue, 21 Aug 2012, steven at gcc dot gnu.org wrote: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46590 Steven Bosscher steven at gcc dot gnu.org changed: What|Removed |Added CC||steven at gcc dot gnu.org --- Comment #23 from Steven Bosscher steven at gcc dot gnu.org 2012-08-21 09:42:42 UTC --- (In reply to comment #13) The I/O parts of the FRE cost are due to value-numbering stores in visit_reference_op_store. They can be drastically cut by an equivalent of (not generating code) Index: trans-io.c === --- trans-io.c (revision 167111) +++ trans-io.c (working copy) @@ -1670,6 +1670,7 @@ build_dt (tree function, gfc_code * code gfc_init_block (post_iu_block); var = gfc_create_var (st_parameter[IOPARM_ptype_dt].type, dt_parm); + gfc_add_modify (block, var, build_constructor (TREE_TYPE (var), NULL)); set_error_locus (block, var, code-loc); You didn't post/commit this, but it looks like a reasonable change to me. I think this is obsolete now with Michas change to have CLOBBERs. Or if still useful, the above should be a CLOBBER now. Richard.
[Bug tree-optimization/46590] [4.6/4.7/4.8 Regression] long compile time with -O2 and many loops
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46590 --- Comment #25 from Richard Guenther rguenth at gcc dot gnu.org 2012-08-21 14:10:38 UTC --- I have a patch for the SCCVN issue, but trying to gather current trunk status first.
[Bug tree-optimization/46590] [4.6/4.7/4.8 Regression] long compile time with -O2 and many loops
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46590 --- Comment #26 from Richard Guenther rguenth at gcc dot gnu.org 2012-08-21 14:56:41 UTC --- For a somewhat reduced testcase I now get at -O1: alias stmt walking : 105.51 (45%) usr 0.33 (24%) sys tree SSA rewrite: 22.01 ( 9%) usr 0.04 ( 3%) sys tree SSA incremental: 25.25 (11%) usr 0.07 ( 5%) sys dominance frontiers : 35.35 (15%) usr 0.02 ( 1%) sys dominance computation : 14.60 ( 6%) usr 0.09 ( 7%) sys TOTAL : 234.28 1.38 as previously said most of the non-alias-stmt walk time is spent on loop header copying. WIth -O1 -fno-tree-ch we have alias stmt walking : 101.52 (68%) usr 0.37 (34%) sys tree SSA rewrite: 4.14 ( 3%) usr 0.01 ( 1%) sys tree SSA incremental: 8.00 ( 5%) usr 0.02 ( 2%) sys dominance frontiers : 6.14 ( 4%) usr 0.01 ( 1%) sys dominance computation : 4.74 ( 3%) usr 0.06 ( 6%) sys TOTAL : 150.14 1.09 limiting stmt walk results in the ability to arbitrarily scale down its cost with a param (we can either limit alias oracle query numbers or SCCVN table lookups). With 100 alias oracle queries per load/store we end up with alias stmt walking : 1.60 ( 3%) usr 0.05 ( 5%) sys with 100 SCCVN table lookups the figure is alias stmt walking : 1.60 ( 3%) usr 0.06 ( 6%) sys one assumes the lookups are expensive, the other one assumes the walk itself is. Increasing the latter to 1000 SCCVN table lookup produces alias stmt walking : 9.24 (16%) usr 0.18 (19%) sys which is around the expected 10-fold increase (but still reasonable given the artificial nature of the testcase). We could also, instead of limiting each walk to a constant cost, have a per-function budget that we can use up first before limiting further walks individually (helps to not regress reasonably sized cases).