Re: better wpa [1/n]: merge types during read-in

2011-04-29 Thread Michael Matz
Hi,

On Thu, 21 Apr 2011, Richard Guenther wrote:

   It would have been nice to have the top-level tree merging as a
   separate patch, as I am not convinced it is correct, but see below ...
 
  I'll split it out.
 
  Like so (also including the other remarks).
 
  Regstrapping on x86_64-linux in progress.
 
 Ok if it passed.

r173155.


Ciao,
Michael.


Re: better wpa [1/n]: merge types during read-in

2011-04-22 Thread Jan Hubicka
Hi,
I run the patch on Mozilla.  W/o the patch it is:
Execution times (seconds)
 garbage collection:  20.19 ( 3%) usr   0.02 ( 0%) sys  20.22 ( 3%) wall
   0 kB ( 0%) ggc
 callgraph optimization:   3.53 ( 1%) usr   0.01 ( 0%) sys   3.53 ( 1%) wall   
15248 kB ( 1%) ggc
 varpool construction  :   0.77 ( 0%) usr   0.02 ( 0%) sys   0.80 ( 0%) wall   
51607 kB ( 4%) ggc
 ipa cp:   2.12 ( 0%) usr   0.10 ( 1%) sys   2.23 ( 0%) wall  
119701 kB (10%) ggc
 ipa lto gimple in :   0.07 ( 0%) usr   0.02 ( 0%) sys   0.07 ( 0%) wall
   0 kB ( 0%) ggc
 ipa lto gimple out:  11.63 ( 2%) usr   1.01 ( 8%) sys  12.63 ( 2%) wall
   0 kB ( 0%) ggc
 ipa lto decl in   : 182.15 (28%) usr   4.06 (32%) sys 188.10 (28%) wall  
392863 kB (31%) ggc
 ipa lto decl out  : 149.86 (23%) usr   0.32 ( 3%) sys 150.25 (22%) wall
   0 kB ( 0%) ggc
 ipa lto decl init I/O :   0.14 ( 0%) usr   0.03 ( 0%) sys   0.16 ( 0%) wall
  31 kB ( 0%) ggc
 ipa lto cgraph I/O:   2.09 ( 0%) usr   0.27 ( 2%) sys   2.37 ( 0%) wall  
428623 kB (34%) ggc
 ipa lto decl merge: 219.70 (33%) usr   1.93 (15%) sys 221.75 (33%) wall  
162687 kB (13%) ggc
 ipa lto cgraph merge  :   2.68 ( 0%) usr   0.00 ( 0%) sys   2.69 ( 0%) wall   
15895 kB ( 1%) ggc
 whopr wpa :   1.65 ( 0%) usr   0.04 ( 0%) sys   1.71 ( 0%) wall
   1 kB ( 0%) ggc
 whopr wpa I/O :   2.20 ( 0%) usr   4.55 (36%) sys   7.20 ( 1%) wall
   0 kB ( 0%) ggc
 ipa reference :   4.12 ( 1%) usr   0.00 ( 0%) sys   4.09 ( 1%) wall
   0 kB ( 0%) ggc
 ipa profile   :   0.18 ( 0%) usr   0.00 ( 0%) sys   0.17 ( 0%) wall
   0 kB ( 0%) ggc
 ipa pure const:   3.15 ( 0%) usr   0.04 ( 0%) sys   3.19 ( 0%) wall
   0 kB ( 0%) ggc
 parser:   1.56 ( 0%) usr   0.00 ( 0%) sys   1.56 ( 0%) wall   
37684 kB ( 3%) ggc
 inline heuristics :  47.26 ( 7%) usr   0.05 ( 0%) sys  47.33 ( 7%) wall   
21988 kB ( 2%) ggc
 callgraph verifier:   0.42 ( 0%) usr   0.04 ( 0%) sys   0.47 ( 0%) wall
   0 kB ( 0%) ggc
 varconst  :   0.02 ( 0%) usr   0.01 ( 0%) sys   0.06 ( 0%) wall
   0 kB ( 0%) ggc
 unaccounted todo  :   1.19 ( 0%) usr   0.00 ( 0%) sys   1.17 ( 0%) wall
   0 kB ( 0%) ggc
 TOTAL : 657.0712.64   672.26
1247550 kB

note that total GGC use seems obviously wrong.  The peak GGC report reads: {GC
4079042k - 4043085k}

with the patch
Execution times (seconds)
 garbage collection:  13.85 ( 3%) usr   0.02 ( 0%) sys  13.88 ( 3%) wall
   0 kB ( 0%) ggc
 callgraph optimization:   2.40 ( 0%) usr   0.00 ( 0%) sys   2.40 ( 0%) wall   
15248 kB ( 1%) ggc
 varpool construction  :   0.69 ( 0%) usr   0.03 ( 0%) sys   0.71 ( 0%) wall   
51621 kB ( 4%) ggc
 ipa cp:   1.86 ( 0%) usr   0.11 ( 1%) sys   1.97 ( 0%) wall  
119697 kB ( 9%) ggc
 ipa lto gimple in :   0.04 ( 0%) usr   0.02 ( 0%) sys   0.06 ( 0%) wall
   0 kB ( 0%) ggc
 ipa lto gimple out:  11.86 ( 2%) usr   0.92 ( 9%) sys  12.80 ( 2%) wall
   0 kB ( 0%) ggc
 ipa lto decl in   : 287.52 (54%) usr   3.49 (35%) sys 291.13 (54%) wall  
713694 kB (51%) ggc
 ipa lto decl out  : 127.76 (24%) usr   0.94 ( 9%) sys 128.79 (24%) wall
   0 kB ( 0%) ggc
 ipa lto decl init I/O :   0.13 ( 0%) usr   0.02 ( 0%) sys   0.15 ( 0%) wall
  31 kB ( 0%) ggc
 ipa lto cgraph I/O:   1.66 ( 0%) usr   0.29 ( 3%) sys   1.94 ( 0%) wall  
428623 kB (30%) ggc
 ipa lto decl merge:  18.12 ( 3%) usr   0.13 ( 1%) sys  18.26 ( 3%) wall
 978 kB ( 0%) ggc
 ipa lto cgraph merge  :   1.90 ( 0%) usr   0.00 ( 0%) sys   1.91 ( 0%) wall   
15143 kB ( 1%) ggc
 whopr wpa :   1.99 ( 0%) usr   0.05 ( 0%) sys   2.01 ( 0%) wall
   1 kB ( 0%) ggc
 whopr wpa I/O :   2.40 ( 0%) usr   3.77 (38%) sys   6.47 ( 1%) wall
   0 kB ( 0%) ggc
 ipa reference :   4.56 ( 1%) usr   0.00 ( 0%) sys   4.58 ( 1%) wall
   0 kB ( 0%) ggc
 ipa profile   :   0.16 ( 0%) usr   0.00 ( 0%) sys   0.15 ( 0%) wall
   0 kB ( 0%) ggc
 ipa pure const:   3.33 ( 1%) usr   0.03 ( 0%) sys   3.36 ( 1%) wall
   0 kB ( 0%) ggc
 parser:   1.85 ( 0%) usr   0.03 ( 0%) sys   1.87 ( 0%) wall   
37684 kB ( 3%) ggc
 inline heuristics :  47.34 ( 9%) usr   0.04 ( 0%) sys  47.42 ( 9%) wall   
21988 kB ( 2%) ggc
 tree CFG construction :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall
   1 kB ( 0%) ggc
 callgraph verifier:   0.45 ( 0%) usr   0.05 ( 0%) sys   0.55 ( 0%) wall
   0 kB ( 0%) ggc
 varconst  :   0.00 ( 0%) usr   0.03 ( 0%) sys   0.03 ( 0%) wall
   0 kB ( 0%) ggc
 unaccounted todo  :   1.38 ( 0%) usr   0.00 ( 0%) sys   1.37 ( 0%) wall
   0 kB ( 0%) ggc
 TOTAL : 531.6610.05   542.31
1405930 kB

and peak memory use 2688637k - 2136908k.  So 50% GGC memory (we need another
about 4G for non-GGC memory, probaly largely in mmap pool) and 

Re: better wpa [1/n]: merge types during read-in

2011-04-22 Thread Richard Guenther
On Fri, Apr 22, 2011 at 1:58 PM, Jan Hubicka hubi...@ucw.cz wrote:
 Hi,
 I run the patch on Mozilla.  W/o the patch it is:
 Execution times (seconds)
  garbage collection    :  20.19 ( 3%) usr   0.02 ( 0%) sys  20.22 ( 3%) wall  
      0 kB ( 0%) ggc
  callgraph optimization:   3.53 ( 1%) usr   0.01 ( 0%) sys   3.53 ( 1%) wall  
  15248 kB ( 1%) ggc
  varpool construction  :   0.77 ( 0%) usr   0.02 ( 0%) sys   0.80 ( 0%) wall  
  51607 kB ( 4%) ggc
  ipa cp                :   2.12 ( 0%) usr   0.10 ( 1%) sys   2.23 ( 0%) wall  
 119701 kB (10%) ggc
  ipa lto gimple in     :   0.07 ( 0%) usr   0.02 ( 0%) sys   0.07 ( 0%) wall  
      0 kB ( 0%) ggc
  ipa lto gimple out    :  11.63 ( 2%) usr   1.01 ( 8%) sys  12.63 ( 2%) wall  
      0 kB ( 0%) ggc
  ipa lto decl in       : 182.15 (28%) usr   4.06 (32%) sys 188.10 (28%) wall  
 392863 kB (31%) ggc
  ipa lto decl out      : 149.86 (23%) usr   0.32 ( 3%) sys 150.25 (22%) wall  
      0 kB ( 0%) ggc
  ipa lto decl init I/O :   0.14 ( 0%) usr   0.03 ( 0%) sys   0.16 ( 0%) wall  
     31 kB ( 0%) ggc
  ipa lto cgraph I/O    :   2.09 ( 0%) usr   0.27 ( 2%) sys   2.37 ( 0%) wall  
 428623 kB (34%) ggc
  ipa lto decl merge    : 219.70 (33%) usr   1.93 (15%) sys 221.75 (33%) wall  
 162687 kB (13%) ggc
  ipa lto cgraph merge  :   2.68 ( 0%) usr   0.00 ( 0%) sys   2.69 ( 0%) wall  
  15895 kB ( 1%) ggc
  whopr wpa             :   1.65 ( 0%) usr   0.04 ( 0%) sys   1.71 ( 0%) wall  
      1 kB ( 0%) ggc
  whopr wpa I/O         :   2.20 ( 0%) usr   4.55 (36%) sys   7.20 ( 1%) wall  
      0 kB ( 0%) ggc
  ipa reference         :   4.12 ( 1%) usr   0.00 ( 0%) sys   4.09 ( 1%) wall  
      0 kB ( 0%) ggc
  ipa profile           :   0.18 ( 0%) usr   0.00 ( 0%) sys   0.17 ( 0%) wall  
      0 kB ( 0%) ggc
  ipa pure const        :   3.15 ( 0%) usr   0.04 ( 0%) sys   3.19 ( 0%) wall  
      0 kB ( 0%) ggc
  parser                :   1.56 ( 0%) usr   0.00 ( 0%) sys   1.56 ( 0%) wall  
  37684 kB ( 3%) ggc
  inline heuristics     :  47.26 ( 7%) usr   0.05 ( 0%) sys  47.33 ( 7%) wall  
  21988 kB ( 2%) ggc
  callgraph verifier    :   0.42 ( 0%) usr   0.04 ( 0%) sys   0.47 ( 0%) wall  
      0 kB ( 0%) ggc
  varconst              :   0.02 ( 0%) usr   0.01 ( 0%) sys   0.06 ( 0%) wall  
      0 kB ( 0%) ggc
  unaccounted todo      :   1.19 ( 0%) usr   0.00 ( 0%) sys   1.17 ( 0%) wall  
      0 kB ( 0%) ggc
  TOTAL                 : 657.07            12.64           672.26            
 1247550 kB

 note that total GGC use seems obviously wrong.  The peak GGC report reads: {GC
 4079042k - 4043085k}

 with the patch
 Execution times (seconds)
  garbage collection    :  13.85 ( 3%) usr   0.02 ( 0%) sys  13.88 ( 3%) wall  
      0 kB ( 0%) ggc
  callgraph optimization:   2.40 ( 0%) usr   0.00 ( 0%) sys   2.40 ( 0%) wall  
  15248 kB ( 1%) ggc
  varpool construction  :   0.69 ( 0%) usr   0.03 ( 0%) sys   0.71 ( 0%) wall  
  51621 kB ( 4%) ggc
  ipa cp                :   1.86 ( 0%) usr   0.11 ( 1%) sys   1.97 ( 0%) wall  
 119697 kB ( 9%) ggc
  ipa lto gimple in     :   0.04 ( 0%) usr   0.02 ( 0%) sys   0.06 ( 0%) wall  
      0 kB ( 0%) ggc
  ipa lto gimple out    :  11.86 ( 2%) usr   0.92 ( 9%) sys  12.80 ( 2%) wall  
      0 kB ( 0%) ggc
  ipa lto decl in       : 287.52 (54%) usr   3.49 (35%) sys 291.13 (54%) wall  
 713694 kB (51%) ggc
  ipa lto decl out      : 127.76 (24%) usr   0.94 ( 9%) sys 128.79 (24%) wall  
      0 kB ( 0%) ggc
  ipa lto decl init I/O :   0.13 ( 0%) usr   0.02 ( 0%) sys   0.15 ( 0%) wall  
     31 kB ( 0%) ggc
  ipa lto cgraph I/O    :   1.66 ( 0%) usr   0.29 ( 3%) sys   1.94 ( 0%) wall  
 428623 kB (30%) ggc
  ipa lto decl merge    :  18.12 ( 3%) usr   0.13 ( 1%) sys  18.26 ( 3%) wall  
    978 kB ( 0%) ggc
  ipa lto cgraph merge  :   1.90 ( 0%) usr   0.00 ( 0%) sys   1.91 ( 0%) wall  
  15143 kB ( 1%) ggc
  whopr wpa             :   1.99 ( 0%) usr   0.05 ( 0%) sys   2.01 ( 0%) wall  
      1 kB ( 0%) ggc
  whopr wpa I/O         :   2.40 ( 0%) usr   3.77 (38%) sys   6.47 ( 1%) wall  
      0 kB ( 0%) ggc
  ipa reference         :   4.56 ( 1%) usr   0.00 ( 0%) sys   4.58 ( 1%) wall  
      0 kB ( 0%) ggc
  ipa profile           :   0.16 ( 0%) usr   0.00 ( 0%) sys   0.15 ( 0%) wall  
      0 kB ( 0%) ggc
  ipa pure const        :   3.33 ( 1%) usr   0.03 ( 0%) sys   3.36 ( 1%) wall  
      0 kB ( 0%) ggc
  parser                :   1.85 ( 0%) usr   0.03 ( 0%) sys   1.87 ( 0%) wall  
  37684 kB ( 3%) ggc
  inline heuristics     :  47.34 ( 9%) usr   0.04 ( 0%) sys  47.42 ( 9%) wall  
  21988 kB ( 2%) ggc
  tree CFG construction :   0.00 ( 0%) usr   0.00 ( 0%) sys   0.01 ( 0%) wall  
      1 kB ( 0%) ggc
  callgraph verifier    :   0.45 ( 0%) usr   0.05 ( 0%) sys   0.55 ( 0%) wall  
      0 kB ( 0%) ggc
  varconst              :   0.00 ( 0%) usr   0.03 ( 0%) sys   0.03 ( 0%) wall  
      0 kB ( 0%) ggc
  unaccounted todo      :   1.38 ( 0%) usr   0.00 ( 0%) sys   1.37 ( 0%) wall  
      0 kB ( 0%) ggc
  TOTAL                 : 531.66            10.05           

Re: better wpa [1/n]: merge types during read-in

2011-04-22 Thread Jan Hubicka
 Yes, that's very likely.  If we'd get around to re-do the LTO option saving 
 code
 we might want to forbid -g0 compile and -g link (dropping -g at link
 time as soon
 as we see a single module compiled with -g0).  Then we can free some more
 stuff, at least with -g0 - though I'm not sure -g0 matters in practice.

Don't know if it matters to care about it terribly much, but definitely default
mozilla builds are -g0.

I assume that now we want to be more selective about what to save for -g and 
also
try to push more stuff to bypass WPA. I don't see need for all the type merging
for stuff that is only referred from debug info. It will be fun to get this done
however. (either by early debug info output that goes to extra section and is
distributed to ltrans same way as we distribute bodies or by some other means)

You had the tree code memory use statistics patch.  If you send me pointer or
commit it, I can generate tree code statistics.

Honza
 
 Maybe we can shift numbers back from ipa lto decl in to ipa lto decl
 merge by some
 timevar adjustments?
 
 Richard.
 
  Honza
 


Re: better wpa [1/n]: merge types during read-in

2011-04-22 Thread Jan Hubicka
  Yes, that's very likely.  If we'd get around to re-do the LTO option saving 
  code
  we might want to forbid -g0 compile and -g link (dropping -g at link
  time as soon
  as we see a single module compiled with -g0).  Then we can free some more
  stuff, at least with -g0 - though I'm not sure -g0 matters in practice.
 
 Don't know if it matters to care about it terribly much, but definitely 
 default
 mozilla builds are -g0.
 
 I assume that now we want to be more selective about what to save for -g and 
 also
 try to push more stuff to bypass WPA. I don't see need for all the type 
 merging
 for stuff that is only referred from debug info. It will be fun to get this 
 done
 however. (either by early debug info output that goes to extra section and is
 distributed to ltrans same way as we distribute bodies or by some other means)

But overall 2GB of IL for Mozilla is not that bad.  Ohter compilers consume
easilly more.  However we need to figure out where the other 4GB is going (mmap
pool is obviously just part of it, probably hashtables consume good part of it,
too) and do something about those 10 minutes of hashtable lookups needed to
read and write out data.

Problem is amount of data we produce for ltrans, that is partly because we ship
debug info into units that won't output it.

I guess I should look into perf and try to produce low overhead profile with
backtraces so we know what hashtables are worse than others. Though we probably
won't see there much surprises.

Honza


Re: better wpa [1/n]: merge types during read-in

2011-04-22 Thread Jan Hubicka
 Yes, that's very likely.  If we'd get around to re-do the LTO option saving 
 code
 we might want to forbid -g0 compile and -g link (dropping -g at link
 time as soon
 as we see a single module compiled with -g0).  Then we can free some more
 stuff, at least with -g0 - though I'm not sure -g0 matters in practice.

I would a lot preffer if dwaf2out was able to do the right thing here: output
debug info for things that came with debug info in (i.e. was originally compiled
with -g) and not for thing that did not.

Traditionally it was not neccesary to set -g or -g0 at linktime and we should
just get to the same behaviour, perhaps with option to overwrite to -g0
explicitely.

Honza


Re: better wpa [1/n]: merge types during read-in

2011-04-21 Thread Michael Matz
Hi,

On Wed, 20 Apr 2011, Michael Matz wrote:

  It would have been nice to have the top-level tree merging as a 
  separate patch, as I am not convinced it is correct, but see below ...
 
 I'll split it out.

Like so (also including the other remarks).

Regstrapping on x86_64-linux in progress.


Ciao,
Michael.

* lto-streamer.c (lto_streamer_cache_insert_1): Accept to override
other trees that just builtins.
(lto_record_common_node): Don't leave NULL TYPE_CANONICAL.

lto/
* lto.c (toplevel): Include tree-flow.h.
(lto_read_in_decl_state): Don't merge types here.
(tree_with_vars): New static hash table.
(remember_with_vars): New static functions.
(LTO_FIXUP_TYPE): New macro.
(lto_ft_common, lto_ft_decl_minimal, lto_ft_decl_common,
lto_ft_decl_with_vis, lto_ft_decl_non_common, lto_ft_function,
lto_ft_field_decl, lto_ft_type, lto_ft_binfo, lto_ft_constructor,
lto_ft_expr, lto_fixup_types, uniquify_nodes): New static functions.
(lto_read_decls): Uniquify while reading in trees.
(lto_fixup_data_t, LTO_FIXUP_SUBTREE,
LTO_REGISTER_TYPE_AND_FIXUP_SUBTREE, no_fixup_p, lto_fixup_common,
lto_fixup_decl_minimal, lto_fixup_decl_common, lto_fixup_decl_with_vis,
lto_fixup_decl_non_common, lto_fixup_function, lto_fixup_field_decl,
lto_fixup_type, lto_fixup_binfo, lto_fixup_constructor,
lto_fixup_tree): Remove.
(lto_fixup_state): Remove data argument.  Use
lto_symtab_prevailing_decl.
(LTO_SET_PREVAIL, LTO_NO_PREVAIL): New macros.
(lto_fixup_prevailing_decls): New function.
(lto_fixup_state_aux): Argument aux is unused.
(lto_fixup_decls): Don't allocate pointer sets, don't use
lto_fixup_tree, use lto_fixup_prevailing_decls.
(read_cgraph_and_symbols): Allocate and remove tree_with_vars.
* Make-lang.in (lto/lto.o): Depend on $(TREE_FLOW_H).

Index: lto-streamer.c
===
*** lto-streamer.c  (revision 172769)
--- lto-streamer.c  (working copy)
*** lto_streamer_cache_insert_1 (struct lto_
*** 383,401 
{
  /* If the caller wants to insert T at a specific slot
 location, and ENTRY-TO does not match *IX_P, add T to
!the requested location slot.  This situation arises when
!streaming builtin functions.
! 
!For instance, on the writer side we could have two
!FUNCTION_DECLS T1 and T2 that are represented by the same
!builtin function.  The reader will only instantiate the
!canonical builtin, but since T1 and T2 had been
!originally stored in different cache slots (S1 and S2),
!the reader must be able to find the canonical builtin
!function at slots S1 and S2.  */
! gcc_assert (lto_stream_as_builtin_p (t));
  ix = *ix_p;
- 
  lto_streamer_cache_add_to_node_array (cache, ix, t);
}
  
--- 383,390 
{
  /* If the caller wants to insert T at a specific slot
 location, and ENTRY-TO does not match *IX_P, add T to
!the requested location slot.  */
  ix = *ix_p;
  lto_streamer_cache_add_to_node_array (cache, ix, t);
}
  
*** lto_record_common_node (tree *nodep, VEC
*** 513,518 
--- 502,509 
TYPE_CANONICAL (node) = NULL_TREE;
node = gimple_register_type (node);
TYPE_CANONICAL (node) = gimple_register_canonical_type (node);
+   if (in_lto_p)
+   TYPE_CANONICAL (*nodep) = TYPE_CANONICAL (node);
*nodep = node;
  }
  
Index: lto/lto.c
===
*** lto/lto.c   (revision 172769)
--- lto/lto.c   (working copy)
*** along with GCC; see the file COPYING3.
*** 24,29 
--- 24,30 
  #include opts.h
  #include toplev.h
  #include tree.h
+ #include tree-flow.h
  #include diagnostic-core.h
  #include tm.h
  #include cgraph.h
*** lto_read_in_decl_state (struct data_in *
*** 215,228 
tree *decls = ggc_alloc_vec_tree (size);
  
for (j = 0; j  size; j++)
!   {
! decls[j] = lto_streamer_cache_get (data_in-reader_cache, data[j]);
! 
! /* Register every type in the global type table.  If the
!type existed already, use the existing type.  */
! if (TYPE_P (decls[j]))
!   decls[j] = gimple_register_type (decls[j]);
!   }
  
state-streams[i].size = size;
state-streams[i].trees = decls;
--- 216,222 
tree *decls = ggc_alloc_vec_tree (size);
  
for (j = 0; j  size; j++)
!   decls[j] = lto_streamer_cache_get (data_in-reader_cache, data[j]);
  
state-streams[i].size = size;
state-streams[i].trees = decls;
*** lto_read_in_decl_state (struct data_in *

Re: better wpa [1/n]: merge types during read-in

2011-04-21 Thread Richard Guenther
On Thu, Apr 21, 2011 at 3:46 PM, Michael Matz m...@suse.de wrote:
 Hi,

 On Wed, 20 Apr 2011, Michael Matz wrote:

  It would have been nice to have the top-level tree merging as a
  separate patch, as I am not convinced it is correct, but see below ...

 I'll split it out.

 Like so (also including the other remarks).

 Regstrapping on x86_64-linux in progress.

Ok if it passed.

Thanks,
Richard.


 Ciao,
 Michael.

        * lto-streamer.c (lto_streamer_cache_insert_1): Accept to override
        other trees that just builtins.
        (lto_record_common_node): Don't leave NULL TYPE_CANONICAL.

 lto/
        * lto.c (toplevel): Include tree-flow.h.
        (lto_read_in_decl_state): Don't merge types here.
        (tree_with_vars): New static hash table.
        (remember_with_vars): New static functions.
        (LTO_FIXUP_TYPE): New macro.
        (lto_ft_common, lto_ft_decl_minimal, lto_ft_decl_common,
        lto_ft_decl_with_vis, lto_ft_decl_non_common, lto_ft_function,
        lto_ft_field_decl, lto_ft_type, lto_ft_binfo, lto_ft_constructor,
        lto_ft_expr, lto_fixup_types, uniquify_nodes): New static functions.
        (lto_read_decls): Uniquify while reading in trees.
        (lto_fixup_data_t, LTO_FIXUP_SUBTREE,
        LTO_REGISTER_TYPE_AND_FIXUP_SUBTREE, no_fixup_p, lto_fixup_common,
        lto_fixup_decl_minimal, lto_fixup_decl_common, lto_fixup_decl_with_vis,
        lto_fixup_decl_non_common, lto_fixup_function, lto_fixup_field_decl,
        lto_fixup_type, lto_fixup_binfo, lto_fixup_constructor,
        lto_fixup_tree): Remove.
        (lto_fixup_state): Remove data argument.  Use
        lto_symtab_prevailing_decl.
        (LTO_SET_PREVAIL, LTO_NO_PREVAIL): New macros.
        (lto_fixup_prevailing_decls): New function.
        (lto_fixup_state_aux): Argument aux is unused.
        (lto_fixup_decls): Don't allocate pointer sets, don't use
        lto_fixup_tree, use lto_fixup_prevailing_decls.
        (read_cgraph_and_symbols): Allocate and remove tree_with_vars.
        * Make-lang.in (lto/lto.o): Depend on $(TREE_FLOW_H).

 Index: lto-streamer.c
 ===
 *** lto-streamer.c      (revision 172769)
 --- lto-streamer.c      (working copy)
 *** lto_streamer_cache_insert_1 (struct lto_
 *** 383,401 
        {
          /* If the caller wants to insert T at a specific slot
             location, and ENTRY-TO does not match *IX_P, add T to
 !            the requested location slot.  This situation arises when
 !            streaming builtin functions.
 !
 !            For instance, on the writer side we could have two
 !            FUNCTION_DECLS T1 and T2 that are represented by the same
 !            builtin function.  The reader will only instantiate the
 !            canonical builtin, but since T1 and T2 had been
 !            originally stored in different cache slots (S1 and S2),
 !            the reader must be able to find the canonical builtin
 !            function at slots S1 and S2.  */
 !         gcc_assert (lto_stream_as_builtin_p (t));
          ix = *ix_p;
 -
          lto_streamer_cache_add_to_node_array (cache, ix, t);
        }

 --- 383,390 
        {
          /* If the caller wants to insert T at a specific slot
             location, and ENTRY-TO does not match *IX_P, add T to
 !            the requested location slot.  */
          ix = *ix_p;
          lto_streamer_cache_add_to_node_array (cache, ix, t);
        }

 *** lto_record_common_node (tree *nodep, VEC
 *** 513,518 
 --- 502,509 
        TYPE_CANONICAL (node) = NULL_TREE;
        node = gimple_register_type (node);
        TYPE_CANONICAL (node) = gimple_register_canonical_type (node);
 +       if (in_lto_p)
 +       TYPE_CANONICAL (*nodep) = TYPE_CANONICAL (node);
        *nodep = node;
      }

 Index: lto/lto.c
 ===
 *** lto/lto.c   (revision 172769)
 --- lto/lto.c   (working copy)
 *** along with GCC; see the file COPYING3.
 *** 24,29 
 --- 24,30 
  #include opts.h
  #include toplev.h
  #include tree.h
 + #include tree-flow.h
  #include diagnostic-core.h
  #include tm.h
  #include cgraph.h
 *** lto_read_in_decl_state (struct data_in *
 *** 215,228 
        tree *decls = ggc_alloc_vec_tree (size);

        for (j = 0; j  size; j++)
 !       {
 !         decls[j] = lto_streamer_cache_get (data_in-reader_cache, data[j]);
 !
 !         /* Register every type in the global type table.  If the
 !            type existed already, use the existing type.  */
 !         if (TYPE_P (decls[j]))
 !           decls[j] = gimple_register_type (decls[j]);
 !       }

        state-streams[i].size = size;
        state-streams[i].trees = decls;
 --- 216,222 
        tree *decls = ggc_alloc_vec_tree (size);

        for (j = 0; j  size; j++)
 !       decls[j] = lto_streamer_cache_get (data_in-reader_cache, data[j]);

   

Re: better wpa [1/n]: merge types during read-in

2011-04-20 Thread Michael Matz
Hi,

On Wed, 20 Apr 2011, Richard Guenther wrote:

  + /* A hashtable of trees that potentially refer to variables or functions
  +    that must be replaced with their prevailing variant.  */
  + static GTY((if_marked (ggc_marked_p), param_is (union tree_node))) 
  htab_t
  +   tree_with_vars;
  + /* A hashtable of top-level trees that can potentially be merged with 
  trees
  +    from other compilation units.  */
 
 It would have been nice to have the top-level tree merging as a separate
 patch, as I am not convinced it is correct, but see below ...

I'll split it out.

  +       {
  +         tree *t2 = (tree *) htab_find_slot (top_decls, t, INSERT);
  +         if (*t2)
  +           t = *t2;
  +         else
  +           *t2 = t;
 
 I don't think we can share TREE_LISTs.  Where do they hang off
 when you do this?  Are not all but one of those dead?  All
 TREE_LIST manipulation routines I know of do not even think about
 the possibility of shared lists.

Well, it clearly works for the use in global trees :-)  The problem I 
solved with this is, that all trees are stored in the reader cache.  If 
(say) two function types, or enum types are merged, one of their 
TYPE_ARG_TYPES or TYPE_VALUES lists should be dead.  But as they're still 
referred from the cache they aren't collected.  I could remove them from 
the cache in a similar way to how I deal with the FIELD_DECLs.  Or I could 
store tree_list nodes not in the cache at all.  The latter seems more 
worthwhile to me, so I'm going to try this.

  + /* Fix up fields of a field_decl T.  */
  +
  + static void
  + lto_ft_field_decl (tree t)
  + {
  +   lto_ft_decl_common (t);
  +   LTO_FIXUP_TYPE (DECL_FIELD_OFFSET (t));
  +   LTO_FIXUP_TYPE (DECL_BIT_FIELD_TYPE (t));
  +   LTO_FIXUP_TYPE (DECL_QUALIFIER (t));
 
 here (and earlier) we had some no_fixup_p asserts.  What happened to
 them?

no_fixup_p checked only that the tree wasn't a type or a 
var/function_decl.

 In particular, don't we need to fixup the DECL_FIELD_BIT_OFFSET 
 integer-cst?

But I was indeed too eager to remove those lines, I should still deal with 
DECL_SECTION_NAME, DECL_FIELD_BIT_OFFSET and BINFO_OFFSET.  Thanks for 
catching this.

  +
  +       /* First fixup the fields of T.  */
  +       lto_fixup_types (t);
  +
  +       /* Now try to find a canonical variant of T itself.  */
  +       if (TYPE_P (t))
  +       {
 
 If t is a type, why fix up its field if it may not be the canonical variant?

Because type merging to work sometimes requires already canonicalized 
fields, at least that's what I found in investigating why some types 
weren't merged that should have been.  Hence I'm first canonicalizing all 
fields of everything and then see if something merged.

  +         if (t == oldt
  +              TYPE_MAIN_VARIANT (t) != t)
  +           {
  +             /* If this is its own type, link it into the variant
  +                chain.  */
  +             TYPE_NEXT_VARIANT (t) = TYPE_NEXT_VARIANT (TYPE_MAIN_VARIANT 
  (t));
  +             TYPE_NEXT_VARIANT (TYPE_MAIN_VARIANT (t)) = t;
 
 Hmm - isn't this taken care of in lto_fixup_types?

Nope.  I've taken it out of the (old) lto_fixup_type, because (now that 
I've removed the seen pointer_set) it can be called for the same type 
multiple times, which would lead to it being included multiple times in 
the NEXT_VARIANT chain.  I found it more clear to do this kind of list 
manipulation at the place where we're sure to see every type only once.

  +             for (f1 = TYPE_FIELDS (t), f2 = TYPE_FIELDS (oldt);
  +                  f1  f2; f1 = TREE_CHAIN (f1), f2 = TREE_CHAIN (f2))
  +               if (TREE_CODE (f1) == FIELD_DECL
  +                    f1 != f2
 
 This should always be true unless TYPE_FIELDS (t) == TYPE_FIELDS (oldt)

Think shared field_decl chains.  I'll have fixed up the chain for one of 
the type pairs already and can later come to a type referring exactly the 
same field_decls again.

  +                    lto_streamer_cache_lookup (cache, f2, ix))
 
 This lookup should always succeed, no?

Yes.  I'll assert this.

  +                 {
  +                   gcc_assert (DECL_NAME (f1) == DECL_NAME (f2));
  +                   /* If we're going to replace an element which we'd
  +                      still visit in the next iterations, we wouldn't
  +                      handle it, so do it here.  */
  +                   if (ix  i)
  +                     lto_fixup_types (f2);
 
 ?  But it's dead, no?

That is true, but it can refer to integer_cst which I can only fix up by 
walking the fields.  If I don't do that, then even though the field_decl 
will not be in the cache anymore, its integer_cst (and their non-fixed up 
types) will stay uncollected.

  +                   lto_streamer_cache_insert_at (cache, f1, ix);
  +                 }
 
 Btw, nice.  Does this get rid of the need for the field-decl fixup code
 in input_gimple_stmt?

Hmm, it's gross but seems to me still required for the 

Re: better wpa [1/n]: merge types during read-in

2011-04-20 Thread Michael Matz
Hi,

On Wed, 20 Apr 2011, Richard Guenther wrote:

  If t is a type, why fix up its field if it may not be the canonical 
  variant?
 
  Because type merging to work sometimes requires already canonicalized
  fields, at least that's what I found in investigating why some types
  weren't merged that should have been.  Hence I'm first canonicalizing all
  fields of everything and then see if something merged.
 
 That sounds like a bug in type-merging.  You don't happen to have
 a small testcase? ;)

cc1 was my testcase :-/

  Think shared field_decl chains.  I'll have fixed up the chain for one of
  the type pairs already and can later come to a type referring exactly the
  same field_decls again.
 
 But only in case the first one is already equal.  What I wanted to say is
 that we shouldn't have partially shared chains, so
 
   if (TYPE_FIELDS (t) != TYPE_FIELDS (oldt))
 for (...)
   if (TREE_CODE (f1) == FIELD_DECL)
 ...
 
 should be enough, no?

Indeed.

 In fact, why restrict fixing up the cache to
 FIELD_DECLs and not also do it for TYPE_DECLs or FUNCTION_DECLs
 that may reside in this chain?

non-FIELD_DECLs are removed in free_lang_data.  But even more so I can 
remove the test for FIELD_DECL.

  Hmm, it's gross but seems to me still required for the diagnostic and 
  to emit the VIEW_CONVERT_EXPR, at least for invalid input code.  OTOH 
  if the streamed out code ensures that a field_decl in a component_ref 
  always is included in its DECL_CONTEXT, then the new merging should 
  indeed make sure that this also holds after streaming in.  Do we have 
  testcases specifically trying this code?  greping for mismatching in 
  testsuite/ doesn't show anything relevant.
 
 The lto testsuite harness doesn't support dg-error/warning, so there are 
 no testcases.  There are testcases that ICEd (type verification) before 
 introducing these fixups though.

Okay, I'll leave investigating this to a follow up.


Ciao,
Michael.

better wpa [1/n]: merge types during read-in

2011-04-19 Thread Michael Matz
Hi,

I have a backlog of random improvements to the WPA phase of LTO 
compilations, all with the common goal of reducing peak memory usage.  I 
was basically dumping all trees that the WPA phase read in, and then tried 
to think about which trees can be merged with already existing ones very 
early or alternatively which shouldn't have been in the global section 
from the beginning.  I.e. whenever there were two trees left after reading 
in all units, that IMO should have been the same, I either tried to merge 
them, or find reasons why they in fact were not the same.

This is the first part, that in essence merely moves the merging of types 
somewhat earlier.  Currently all non-local trees of all object files are 
read in, then symbols are merged, then types and prevailing decls are set.  
That's the maximum of peak memory usage you can get, because during 
read-in virtually no trees will be garbage collected.

There's much one can merge already during read-in, types are the easiest 
thing, some obvious top-level trees are in the same league (integer and 
string constants, tree_lists used to collect arguments, field_decls of the 
just replaced types).  Later patches will extend this set.  This current 
one will deal just with the mentioned entities.

One nice property of our reader is, that all trees are nicely collected in 
a sequential array (the reader cache).  This implies that we don't have to 
use walk_tree to get to all reachable trees, we merely have to iterate 
through that array and are sure we'll catch everything (except some 
builtin trees).  We also don't have to remember which trees we've already 
visited, we'll naturally visit each one just once.

The current callbacks of walk_tree are used for two purposes: merging 
types and setting prevailing decls.  The latter can only be done after the 
prevailing decls are known, and that is only when we've seen all object 
files (even in the case of the linker plugin).  Hence, while uniquifying 
the read trees I'm also remembering those that (potentially) refer to 
variables or functions in a hashtable, which conveniently has support in 
the garbage collector to remove unreachable trees (so, we'll later only 
iterate over those trees that are known to be reachable).  I'm using that 
hash at the place where currently walk_tree is used to iterate over only 
those trees that potentially refer to a var/function_decl.

The type merger is in essence a moved and slightly changed copy of the 
original walk_tree callback routines.  Via some asserts I've verified that 
I'm replacing all reachable vars or functions with their prevailing decl, 
before removing the old callbacks (the same cannot be done with the merged 
types, due to ordering issues type merging depends on other types already 
being merged, or, worse, some decls being merged; that's already currently 
the case, rerunning the type merger will produce more merged types).

I'm starting to merge tree as soon as a top-level invocation of 
lto_input_tree is finished.  That is the earliest point at which all 
recursively reachable trees are known and read in.  It's better to do it 
at that place than to defer it until the whole unit is read in because so 
the next invocations of input_tree (possibly referring back to already 
existing trees in the cache) will make use of the already merged variants.

Some statistics for an LTO bootstrap, in this case only for the cc1 
executable, only for the WPA phase, generated with a -O0 compiled lto1 
(the cc1 object are from stage3 of a normal lto bootstrap).  Without the 
patch:

Merging declarations
 {GC 320745k - 204020k}Reading summaries
 ipa lto gimple out:   6.08 ( 7%) usr   0.14 ( 9%) sys   6.23 ( 7%) wall   
0 kB ( 0%) ggc
 ipa lto decl in   :  21.24 (23%) usr   0.38 (24%) sys  21.65 (23%) wall  
311556 kB (65%) ggc
 ipa lto decl out  :  17.89 (19%) usr   0.01 ( 1%) sys  17.91 (19%) wall   
0 kB ( 0%) ggc
 ipa lto decl merge:  12.09 (13%) usr   0.13 ( 8%) sys  12.22 (13%) wall   
7704 kB ( 2%) ggc
 inline heuristics :  21.05 (23%) usr   0.00 ( 0%) sys  21.07 (22%) wall   
28972 kB ( 6%) ggc
 TOTAL :  92.99 1.5894.68 
479817 kB

I.e. right after reading in all units we have 320MB of ggc memory.

With the patch:

Merging declarations
 {GC 193371k - 176876k}Reading summaries
 ipa lto gimple out:   6.63 ( 8%) usr   0.19 ( 9%) sys  11.13 (12%) wall
   0 kB ( 0%) ggc
 ipa lto decl in   :  27.09 (32%) usr   0.30 (14%) sys  27.79 (30%) wall  
318044 kB (66%) ggc
 ipa lto decl out  :  13.96 (17%) usr   0.02 ( 1%) sys  13.99 (15%) wall
   0 kB ( 0%) ggc
 ipa lto decl merge:   0.32 ( 0%) usr   0.00 ( 0%) sys   0.33 ( 0%) wall
   5 kB ( 0%) ggc
 inline heuristics :  21.04 (25%) usr   0.03 ( 1%) sys  21.10 (23%) wall   
28972 kB ( 6%) ggc
 TOTAL :  84.10 2.1791.78 
478594 kB

So, about 193MB of ggc memory after 

Re: better wpa [1/n]: merge types during read-in

2011-04-19 Thread Jan Hubicka
 Hi,
 
 I have a backlog of random improvements to the WPA phase of LTO 
 compilations, all with the common goal of reducing peak memory usage.  I 
 was basically dumping all trees that the WPA phase read in, and then tried 
 to think about which trees can be merged with already existing ones very 
 early or alternatively which shouldn't have been in the global section 
 from the beginning.  I.e. whenever there were two trees left after reading 
 in all units, that IMO should have been the same, I either tried to merge 
 them, or find reasons why they in fact were not the same.

Thanks, this hard work was on my TODO list for a while.  I am very heappy you
beat me on this!
I will give a try your patch on Mozilla, I think the reuslts should be a lot
better than on GCC.

 There's much one can merge already during read-in, types are the easiest 
 thing, some obvious top-level trees are in the same league (integer and 
 string constants, tree_lists used to collect arguments, field_decls of the 
 just replaced types).  Later patches will extend this set.  This current 
 one will deal just with the mentioned entities.

For DECls, obviously the IPA passes will be in the way.
You mentioned the code that just reads things in and applies to same cgraph
nodes.  This will break in non-trivial situation.
I however think that all we need is a way to read in a DECL and or cgraph ID
from stream file and get the corresponding (merged) DECL or cgraph plus
information if the merged DECL is one the ID originally referred to. I guess
all the existing stream in code for cgraph  IPA passes can be quite easilly
restructured to ignore or merge the data on prevailed nodes during read.
In fact I planned this design originally. I also planned to organize the 
summaries
into per-function sections, instead of per-pass sections, so one can read in
only data of those that won in the merging battle.  Obviously the implementation
went other way early in LTO branch design.
 
 One nice property of our reader is, that all trees are nicely collected in 
 a sequential array (the reader cache).  This implies that we don't have to 
 use walk_tree to get to all reachable trees, we merely have to iterate 
 through that array and are sure we'll catch everything (except some 
 builtin trees).  We also don't have to remember which trees we've already 
 visited, we'll naturally visit each one just once.
 
 The current callbacks of walk_tree are used for two purposes: merging 
 types and setting prevailing decls.  The latter can only be done after the 
 prevailing decls are known, and that is only when we've seen all object 
 files (even in the case of the linker plugin).  Hence, while uniquifying 
 the read trees I'm also remembering those that (potentially) refer to 
 variables or functions in a hashtable, which conveniently has support in 
 the garbage collector to remove unreachable trees (so, we'll later only 
 iterate over those trees that are known to be reachable).  I'm using that 
 hash at the place where currently walk_tree is used to iterate over only 
 those trees that potentially refer to a var/function_decl.
 
 The type merger is in essence a moved and slightly changed copy of the 
 original walk_tree callback routines.  Via some asserts I've verified that 
 I'm replacing all reachable vars or functions with their prevailing decl, 
 before removing the old callbacks (the same cannot be done with the merged 
 types, due to ordering issues type merging depends on other types already 
 being merged, or, worse, some decls being merged; that's already currently 
 the case, rerunning the type merger will produce more merged types).

Yeah, this also sounds all good. I wondered why the code is not organized this 
way
and instead keeps walking everything recursively.
 
 Merging declarations
  {GC 320745k - 204020k}Reading summaries
  ipa lto gimple out:   6.08 ( 7%) usr   0.14 ( 9%) sys   6.23 ( 7%) wall  
  0 kB ( 0%) ggc
  ipa lto decl in   :  21.24 (23%) usr   0.38 (24%) sys  21.65 (23%) wall  
 311556 kB (65%) ggc
  ipa lto decl out  :  17.89 (19%) usr   0.01 ( 1%) sys  17.91 (19%) wall  
  0 kB ( 0%) ggc
  ipa lto decl merge:  12.09 (13%) usr   0.13 ( 8%) sys  12.22 (13%) wall  
  7704 kB ( 2%) ggc
  inline heuristics :  21.05 (23%) usr   0.00 ( 0%) sys  21.07 (22%) wall  
  28972 kB ( 6%) ggc
  TOTAL :  92.99 1.5894.68 
 479817 kB
 
 I.e. right after reading in all units we have 320MB of ggc memory.
 
 With the patch:
 
 Merging declarations
  {GC 193371k - 176876k}Reading summaries
  ipa lto gimple out:   6.63 ( 8%) usr   0.19 ( 9%) sys  11.13 (12%) wall  
  0 kB ( 0%) ggc
  ipa lto decl in   :  27.09 (32%) usr   0.30 (14%) sys  27.79 (30%) wall  
 318044 kB (66%) ggc
  ipa lto decl out  :  13.96 (17%) usr   0.02 ( 1%) sys  13.99 (15%) wall  
  0 kB ( 0%) ggc
  ipa lto decl merge:   0.32 ( 0%) usr   0.00 ( 0%) sys   0.33 ( 0%) wall  
  5 kB (