[Bug tree-optimization/46590] [4.6/4.7/4.8 Regression] long compile time with -O2 and many loops

2012-09-05 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46590

--- Comment #36 from Richard Guenther rguenth at gcc dot gnu.org 2012-09-05 
10:59:52 UTC ---
If I fix that (PR54489) by iterating over immediate dominators when querying
AVAIL_OUT
instead of accumulating then other loop opts quickly take over in compile-time,
but memory usage stays reasonable at -O1.  LIM is now the pass that pushes
memory usage to 1.8GB - all other optimization passes are happy with just
~800MB.  The issue with LIM is that it analyzes the whole function instead
of working on outermost loops at a time (PR54488).  Then of course IRA
comes along and wrecks memory usage again ... (create_loop_tree_nodes).
One can tame down IRA a bit using -fno-ira-loop-pressure -fira-region=one.
We then arrive at roughly a constant 900MB memory usage for the full(!)
testcase at -O1 and

Execution times (seconds)
 phase opt and generate  : 495.90 (99%) usr   1.98 (98%) sys 499.91 (99%) wall 
870508 kB (92%) ggc
 df reaching defs:  19.16 ( 4%) usr   0.06 ( 3%) sys  19.18 ( 4%) wall 
 0 kB ( 0%) ggc
 alias stmt walking  :  28.75 ( 6%) usr   0.21 (10%) sys  29.12 ( 6%) wall 
  2336 kB ( 0%) ggc
 tree SSA rewrite:  63.42 (13%) usr   0.02 ( 1%) sys  63.77 (13%) wall 
 18830 kB ( 2%) ggc
 tree SSA incremental:  74.64 (15%) usr   0.03 ( 1%) sys  74.44 (15%) wall 
 25886 kB ( 3%) ggc
 dominance frontiers : 101.71 (20%) usr   0.09 ( 4%) sys 102.17 (20%) wall 
 0 kB ( 0%) ggc
 dominance computation   :  52.56 (11%) usr   0.09 ( 4%) sys  53.35 (11%) wall 
 0 kB ( 0%) ggc
 loop invariant motion   : 101.20 (20%) usr   0.10 ( 5%) sys 101.75 (20%) wall 
  2700 kB ( 0%) ggc
 TOTAL : 498.79 2.03   502.87
947764 kB

(all entries  10s)

The incremental SSA stuff is complete loop unrolling / IV canonicalization
which does SSA update once per loop (similar to what loop header copying
formerly did).  Fixing that leads to

Execution times (seconds)
 phase opt and generate  : 214.62 (99%) usr   1.53 (96%) sys 217.41 (99%) wall 
870508 kB (92%) ggc
 df reaching defs:  23.07 (11%) usr   0.01 ( 1%) sys  23.10 (10%) wall 
 0 kB ( 0%) ggc
 alias stmt walking  :  28.51 (13%) usr   0.23 (14%) sys  28.93 (13%) wall 
  2336 kB ( 0%) ggc
 loop invariant motion   : 105.43 (48%) usr   0.01 ( 1%) sys 106.22 (48%) wall 
  2700 kB ( 0%) ggc
 TOTAL : 217.56 1.59   220.44
947764 kB

so RTL invariant motion is now the main offender ;)


[Bug tree-optimization/46590] [4.6/4.7/4.8 Regression] long compile time with -O2 and many loops

2012-09-05 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46590

--- Comment #37 from Richard Guenther rguenth at gcc dot gnu.org 2012-09-05 
13:29:20 UTC ---
Author: rguenth
Date: Wed Sep  5 13:29:13 2012
New Revision: 190978

URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=190978
Log:
2012-09-05  Richard Guenther  rguent...@suse.de

PR tree-optimization/46590
* tree-ssa-loop-ivcanon.c (try_unroll_loop_completely): Do not
update SSA form here.
(canonicalize_induction_variables): Assert we do not need to
update SSA form.
(tree_unroll_loops_completely): Update SSA form here.
* tree-ssa-loop-manip.c (gimple_duplicate_loop_to_header_edge):
Do not verify loop-closed SSA form if SSA form is not up-to-date.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/tree-ssa-loop-ivcanon.c
trunk/gcc/tree-ssa-loop-manip.c


[Bug tree-optimization/46590] [4.6/4.7/4.8 Regression] long compile time with -O2 and many loops

2012-09-04 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46590

--- Comment #35 from Richard Guenther rguenth at gcc dot gnu.org 2012-09-04 
13:17:53 UTC ---
With -O1 FRE uses loads of memory and compile-time.  Not the value-numbering
itself but compute_avail ()!  Bah.  Of course AVAIL sets get bigger and
bigger going down the dominator tree.  For FRE a single domwalk after
SCCVN computing avail and doing elimiation would be enough.  Especially
as all the overhead in AVAIL is just SSA DEFs and their value.


[Bug tree-optimization/46590] [4.6/4.7/4.8 Regression] long compile time with -O2 and many loops

2012-09-03 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46590

Richard Guenther rguenth at gcc dot gnu.org changed:

   What|Removed |Added

 Status|ASSIGNED|NEW
 AssignedTo|rguenth at gcc dot gnu.org  |unassigned at gcc dot
   ||gnu.org

--- Comment #33 from Richard Guenther rguenth at gcc dot gnu.org 2012-09-03 
10:45:06 UTC ---
Not mine anymore.


[Bug tree-optimization/46590] [4.6/4.7/4.8 Regression] long compile time with -O2 and many loops

2012-09-03 Thread matz at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46590

--- Comment #34 from Michael Matz matz at gcc dot gnu.org 2012-09-03 15:39:20 
UTC ---
Author: matz
Date: Mon Sep  3 15:39:15 2012
New Revision: 190897

URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=190897
Log:
PR tree-optimization/46590
* tree-cfg.c (gimple_duplicate_sese_region): Don't update
SSA web here ...
* tree-ssa-loop-ch.c (copy_loop_headers): ... but here.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/tree-cfg.c
trunk/gcc/tree-ssa-loop-ch.c


[Bug tree-optimization/46590] [4.6/4.7/4.8 Regression] long compile time with -O2 and many loops

2012-08-23 Thread rguenther at suse dot de
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46590

--- Comment #30 from rguenther at suse dot de rguenther at suse dot de 
2012-08-23 07:13:13 UTC ---
On Wed, 22 Aug 2012, stevenb.gcc at gmail dot com wrote:

 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46590
 
 --- Comment #29 from stevenb.gcc at gmail dot com stevenb.gcc at gmail dot 
 com 2012-08-22 17:33:00 UTC ---
  I thought that loop header copying wouldn't need to insert new PHI nodes
  and thus can do with TODO_update_ssa_no_phi if we are in loop-closed SSA 
  form,
  but appearantly that is not the case (I didn't inverstigate further here,
  but possibly that's just because virtual operands are not in loop-closed
  SSA form).
 
 I'll experiment with VOPs in LC-SSA.

You might have seen the patch I posted - it passed testing and I will
install it today.


[Bug tree-optimization/46590] [4.6/4.7/4.8 Regression] long compile time with -O2 and many loops

2012-08-23 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46590

--- Comment #31 from Richard Guenther rguenth at gcc dot gnu.org 2012-08-23 
11:41:50 UTC ---
Btw, I have experimented with

Index: tree-into-ssa.c
===
--- tree-into-ssa.c (revision 190594)
+++ tree-into-ssa.c (working copy)
@@ -3224,11 +3224,35 @@ update_ssa (unsigned update_flags)
   bitmap_head *dfs;

   /* If the caller requested PHI nodes to be added, compute
-dominance frontiers.  */
+dominance frontiers for all interesting blocks
+up to the dominating start_bb.  */
   dfs = XNEWVEC (bitmap_head, last_basic_block);
   FOR_EACH_BB (bb)
bitmap_initialize (dfs[bb-index], bitmap_default_obstack);
-  compute_dominance_frontiers (dfs);
+  EXECUTE_IF_SET_IN_BITMAP (blocks_to_update, 0, i, bi)
+   {
+ basic_block b = BASIC_BLOCK (i);
+ edge_iterator ei;
+ edge p;
+ if (EDGE_COUNT (b-preds) = 2)
+   FOR_EACH_EDGE (p, ei, b-preds)
+ {
+   basic_block runner = p-src;
+   basic_block domsb;
+   if (runner == start_bb)
+ continue;
+
+   domsb = get_immediate_dominator (CDI_DOMINATORS, b);
+   while (runner != domsb)
+ {
+   if (!bitmap_set_bit (dfs[runner-index],
+b-index))
+ break;
+   runner = get_immediate_dominator (CDI_DOMINATORS,
+ runner);
+ }
+ }
+   }

   if (sbitmap_first_set_bit (old_ssa_names) = 0)
{

which helps reducing the time spent in computing dominance frontiers.  But
as we no longer have bitmaps but bitmap_heads in dfs it's hard to verify
we only ever access dfs for entries we computed ... but we should,
looking at how compute_idf works(?).


[Bug tree-optimization/46590] [4.6/4.7/4.8 Regression] long compile time with -O2 and many loops

2012-08-23 Thread steven at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46590

--- Comment #32 from Steven Bosscher steven at gcc dot gnu.org 2012-08-23 
13:44:53 UTC ---
(In reply to comment #31)
 which helps reducing the time spent in computing dominance frontiers.  But
 as we no longer have bitmaps but bitmap_heads in dfs it's hard to verify
 we only ever access dfs for entries we computed ... but we should,
 looking at how compute_idf works(?).

I don't understand this comment. You can still always do:
bitmap_empty_p (dfs[bb-index]) to see if something was
computed for bb.


[Bug tree-optimization/46590] [4.6/4.7/4.8 Regression] long compile time with -O2 and many loops

2012-08-22 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46590

--- Comment #27 from Richard Guenther rguenth at gcc dot gnu.org 2012-08-22 
11:39:56 UTC ---
The issue with loop header copying is that we use incremental SSA updating
with PHI insertion for each individual loop header copied.  That computes
dominance frontiers which is always for the whole function.

I thought that loop header copying wouldn't need to insert new PHI nodes
and thus can do with TODO_update_ssa_no_phi if we are in loop-closed SSA form,
but appearantly that is not the case (I didn't inverstigate further here,
but possibly that's just because virtual operands are not in loop-closed
SSA form).


[Bug tree-optimization/46590] [4.6/4.7/4.8 Regression] long compile time with -O2 and many loops

2012-08-22 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46590

--- Comment #28 from Richard Guenther rguenth at gcc dot gnu.org 2012-08-22 
13:17:42 UTC ---
Author: rguenth
Date: Wed Aug 22 13:17:26 2012
New Revision: 190594

URL: http://gcc.gnu.org/viewcvs?root=gccview=revrev=190594
Log:
2012-08-22  Richard Guenther  rguent...@suse.de

PR tree-optimization/46590
* tree-ssa-alias.h (get_continuation_for_phi): Add alias query
counter output argument.
(walk_non_aliased_vuses): Add alias query counter argument
to the walker callback.
* tree-ssa-alias.c (maybe_skip_until): Add alias query counter
output argument and count alias queries.
(get_continuation_for_phi_1): Likewise.
(get_continuation_for_phi): Likewise.
(walk_non_aliased_vuses): Add alias query counter argument
to the walker callback and allow it to abort the walk by
returning -1.
* tree-ssa-pre.c (translate_vuse_through_block): Adjust.
* tree-ssa-sccvn.c (vn_reference_lookup_2): Add alias query
counter parmeter, abort walk if that is bigger than
--param sccvn-max-alias-queries-per-access.
* params.def (sccvn-max-alias-queries-per-access): New param.
* doc/invoke.texi (sccvn-max-alias-queries-per-access): Document.

Modified:
trunk/gcc/ChangeLog
trunk/gcc/doc/invoke.texi
trunk/gcc/params.def
trunk/gcc/tree-ssa-alias.c
trunk/gcc/tree-ssa-alias.h
trunk/gcc/tree-ssa-pre.c
trunk/gcc/tree-ssa-sccvn.c


[Bug tree-optimization/46590] [4.6/4.7/4.8 Regression] long compile time with -O2 and many loops

2012-08-22 Thread stevenb.gcc at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46590

--- Comment #29 from stevenb.gcc at gmail dot com stevenb.gcc at gmail dot 
com 2012-08-22 17:33:00 UTC ---
 I thought that loop header copying wouldn't need to insert new PHI nodes
 and thus can do with TODO_update_ssa_no_phi if we are in loop-closed SSA form,
 but appearantly that is not the case (I didn't inverstigate further here,
 but possibly that's just because virtual operands are not in loop-closed
 SSA form).

I'll experiment with VOPs in LC-SSA.


[Bug tree-optimization/46590] [4.6/4.7/4.8 Regression] long compile time with -O2 and many loops

2012-08-21 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46590

Richard Guenther rguenth at gcc dot gnu.org changed:

   What|Removed |Added

 CC||nbhargava at google dot com

--- Comment #22 from Richard Guenther rguenth at gcc dot gnu.org 2012-08-21 
08:09:55 UTC ---
*** Bug 54337 has been marked as a duplicate of this bug. ***


[Bug tree-optimization/46590] [4.6/4.7/4.8 Regression] long compile time with -O2 and many loops

2012-08-21 Thread steven at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46590

Steven Bosscher steven at gcc dot gnu.org changed:

   What|Removed |Added

 CC||steven at gcc dot gnu.org

--- Comment #23 from Steven Bosscher steven at gcc dot gnu.org 2012-08-21 
09:42:42 UTC ---
(In reply to comment #13)
 The I/O parts of the FRE cost are due to value-numbering stores in
 visit_reference_op_store.  They can be drastically cut by an equivalent
 of (not generating code)
 
 Index: trans-io.c
 ===
 --- trans-io.c  (revision 167111)
 +++ trans-io.c  (working copy)
 @@ -1670,6 +1670,7 @@ build_dt (tree function, gfc_code * code
gfc_init_block (post_iu_block);
 
var = gfc_create_var (st_parameter[IOPARM_ptype_dt].type, dt_parm);
 +  gfc_add_modify (block, var, build_constructor (TREE_TYPE (var), NULL));
 
set_error_locus (block, var, code-loc);
 

You didn't post/commit this, but it looks like a reasonable change to me.


[Bug tree-optimization/46590] [4.6/4.7/4.8 Regression] long compile time with -O2 and many loops

2012-08-21 Thread rguenther at suse dot de
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46590

--- Comment #24 from rguenther at suse dot de rguenther at suse dot de 
2012-08-21 09:59:41 UTC ---
On Tue, 21 Aug 2012, steven at gcc dot gnu.org wrote:

 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46590
 
 Steven Bosscher steven at gcc dot gnu.org changed:
 
What|Removed |Added
 
  CC||steven at gcc dot gnu.org
 
 --- Comment #23 from Steven Bosscher steven at gcc dot gnu.org 2012-08-21 
 09:42:42 UTC ---
 (In reply to comment #13)
  The I/O parts of the FRE cost are due to value-numbering stores in
  visit_reference_op_store.  They can be drastically cut by an equivalent
  of (not generating code)
  
  Index: trans-io.c
  ===
  --- trans-io.c  (revision 167111)
  +++ trans-io.c  (working copy)
  @@ -1670,6 +1670,7 @@ build_dt (tree function, gfc_code * code
 gfc_init_block (post_iu_block);
  
 var = gfc_create_var (st_parameter[IOPARM_ptype_dt].type, dt_parm);
  +  gfc_add_modify (block, var, build_constructor (TREE_TYPE (var), NULL));
  
 set_error_locus (block, var, code-loc);
  
 
 You didn't post/commit this, but it looks like a reasonable change to me.

I think this is obsolete now with Michas change to have CLOBBERs.  Or
if still useful, the above should be a CLOBBER now.

Richard.


[Bug tree-optimization/46590] [4.6/4.7/4.8 Regression] long compile time with -O2 and many loops

2012-08-21 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46590

--- Comment #25 from Richard Guenther rguenth at gcc dot gnu.org 2012-08-21 
14:10:38 UTC ---
I have a patch for the SCCVN issue, but trying to gather current trunk status
first.


[Bug tree-optimization/46590] [4.6/4.7/4.8 Regression] long compile time with -O2 and many loops

2012-08-21 Thread rguenth at gcc dot gnu.org
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=46590

--- Comment #26 from Richard Guenther rguenth at gcc dot gnu.org 2012-08-21 
14:56:41 UTC ---
For a somewhat reduced testcase I now get at -O1:

 alias stmt walking  : 105.51 (45%) usr   0.33 (24%) sys
 tree SSA rewrite:  22.01 ( 9%) usr   0.04 ( 3%) sys
 tree SSA incremental:  25.25 (11%) usr   0.07 ( 5%) sys
 dominance frontiers :  35.35 (15%) usr   0.02 ( 1%) sys
 dominance computation   :  14.60 ( 6%) usr   0.09 ( 7%) sys
 TOTAL : 234.28 1.38

as previously said most of the non-alias-stmt walk time is spent
on loop header copying.  WIth -O1 -fno-tree-ch we have

 alias stmt walking  : 101.52 (68%) usr   0.37 (34%) sys
 tree SSA rewrite:   4.14 ( 3%) usr   0.01 ( 1%) sys
 tree SSA incremental:   8.00 ( 5%) usr   0.02 ( 2%) sys
 dominance frontiers :   6.14 ( 4%) usr   0.01 ( 1%) sys
 dominance computation   :   4.74 ( 3%) usr   0.06 ( 6%) sys
 TOTAL : 150.14 1.09

limiting stmt walk results in the ability to arbitrarily scale down its cost
with a param (we can either limit alias oracle query numbers or SCCVN
table lookups).  With 100 alias oracle queries per load/store we end up with

 alias stmt walking  :   1.60 ( 3%) usr   0.05 ( 5%) sys

with 100 SCCVN table lookups the figure is

 alias stmt walking  :   1.60 ( 3%) usr   0.06 ( 6%) sys

one assumes the lookups are expensive, the other one assumes the walk itself
is.
Increasing the latter to 1000 SCCVN table lookup produces

 alias stmt walking  :   9.24 (16%) usr   0.18 (19%) sys

which is around the expected 10-fold increase (but still reasonable given
the artificial nature of the testcase).  We could also, instead of
limiting each walk to a constant cost, have a per-function budget that
we can use up first before limiting further walks individually (helps
to not regress reasonably sized cases).