Re: better wpa [1/n]: merge types during read-in
Hi, On Thu, 21 Apr 2011, Richard Guenther wrote: It would have been nice to have the top-level tree merging as a separate patch, as I am not convinced it is correct, but see below ... I'll split it out. Like so (also including the other remarks). Regstrapping on x86_64-linux in progress. Ok if it passed. r173155. Ciao, Michael.
Re: better wpa [1/n]: merge types during read-in
Hi, I run the patch on Mozilla. W/o the patch it is: Execution times (seconds) garbage collection: 20.19 ( 3%) usr 0.02 ( 0%) sys 20.22 ( 3%) wall 0 kB ( 0%) ggc callgraph optimization: 3.53 ( 1%) usr 0.01 ( 0%) sys 3.53 ( 1%) wall 15248 kB ( 1%) ggc varpool construction : 0.77 ( 0%) usr 0.02 ( 0%) sys 0.80 ( 0%) wall 51607 kB ( 4%) ggc ipa cp: 2.12 ( 0%) usr 0.10 ( 1%) sys 2.23 ( 0%) wall 119701 kB (10%) ggc ipa lto gimple in : 0.07 ( 0%) usr 0.02 ( 0%) sys 0.07 ( 0%) wall 0 kB ( 0%) ggc ipa lto gimple out: 11.63 ( 2%) usr 1.01 ( 8%) sys 12.63 ( 2%) wall 0 kB ( 0%) ggc ipa lto decl in : 182.15 (28%) usr 4.06 (32%) sys 188.10 (28%) wall 392863 kB (31%) ggc ipa lto decl out : 149.86 (23%) usr 0.32 ( 3%) sys 150.25 (22%) wall 0 kB ( 0%) ggc ipa lto decl init I/O : 0.14 ( 0%) usr 0.03 ( 0%) sys 0.16 ( 0%) wall 31 kB ( 0%) ggc ipa lto cgraph I/O: 2.09 ( 0%) usr 0.27 ( 2%) sys 2.37 ( 0%) wall 428623 kB (34%) ggc ipa lto decl merge: 219.70 (33%) usr 1.93 (15%) sys 221.75 (33%) wall 162687 kB (13%) ggc ipa lto cgraph merge : 2.68 ( 0%) usr 0.00 ( 0%) sys 2.69 ( 0%) wall 15895 kB ( 1%) ggc whopr wpa : 1.65 ( 0%) usr 0.04 ( 0%) sys 1.71 ( 0%) wall 1 kB ( 0%) ggc whopr wpa I/O : 2.20 ( 0%) usr 4.55 (36%) sys 7.20 ( 1%) wall 0 kB ( 0%) ggc ipa reference : 4.12 ( 1%) usr 0.00 ( 0%) sys 4.09 ( 1%) wall 0 kB ( 0%) ggc ipa profile : 0.18 ( 0%) usr 0.00 ( 0%) sys 0.17 ( 0%) wall 0 kB ( 0%) ggc ipa pure const: 3.15 ( 0%) usr 0.04 ( 0%) sys 3.19 ( 0%) wall 0 kB ( 0%) ggc parser: 1.56 ( 0%) usr 0.00 ( 0%) sys 1.56 ( 0%) wall 37684 kB ( 3%) ggc inline heuristics : 47.26 ( 7%) usr 0.05 ( 0%) sys 47.33 ( 7%) wall 21988 kB ( 2%) ggc callgraph verifier: 0.42 ( 0%) usr 0.04 ( 0%) sys 0.47 ( 0%) wall 0 kB ( 0%) ggc varconst : 0.02 ( 0%) usr 0.01 ( 0%) sys 0.06 ( 0%) wall 0 kB ( 0%) ggc unaccounted todo : 1.19 ( 0%) usr 0.00 ( 0%) sys 1.17 ( 0%) wall 0 kB ( 0%) ggc TOTAL : 657.0712.64 672.26 1247550 kB note that total GGC use seems obviously wrong. The peak GGC report reads: {GC 4079042k - 4043085k} with the patch Execution times (seconds) garbage collection: 13.85 ( 3%) usr 0.02 ( 0%) sys 13.88 ( 3%) wall 0 kB ( 0%) ggc callgraph optimization: 2.40 ( 0%) usr 0.00 ( 0%) sys 2.40 ( 0%) wall 15248 kB ( 1%) ggc varpool construction : 0.69 ( 0%) usr 0.03 ( 0%) sys 0.71 ( 0%) wall 51621 kB ( 4%) ggc ipa cp: 1.86 ( 0%) usr 0.11 ( 1%) sys 1.97 ( 0%) wall 119697 kB ( 9%) ggc ipa lto gimple in : 0.04 ( 0%) usr 0.02 ( 0%) sys 0.06 ( 0%) wall 0 kB ( 0%) ggc ipa lto gimple out: 11.86 ( 2%) usr 0.92 ( 9%) sys 12.80 ( 2%) wall 0 kB ( 0%) ggc ipa lto decl in : 287.52 (54%) usr 3.49 (35%) sys 291.13 (54%) wall 713694 kB (51%) ggc ipa lto decl out : 127.76 (24%) usr 0.94 ( 9%) sys 128.79 (24%) wall 0 kB ( 0%) ggc ipa lto decl init I/O : 0.13 ( 0%) usr 0.02 ( 0%) sys 0.15 ( 0%) wall 31 kB ( 0%) ggc ipa lto cgraph I/O: 1.66 ( 0%) usr 0.29 ( 3%) sys 1.94 ( 0%) wall 428623 kB (30%) ggc ipa lto decl merge: 18.12 ( 3%) usr 0.13 ( 1%) sys 18.26 ( 3%) wall 978 kB ( 0%) ggc ipa lto cgraph merge : 1.90 ( 0%) usr 0.00 ( 0%) sys 1.91 ( 0%) wall 15143 kB ( 1%) ggc whopr wpa : 1.99 ( 0%) usr 0.05 ( 0%) sys 2.01 ( 0%) wall 1 kB ( 0%) ggc whopr wpa I/O : 2.40 ( 0%) usr 3.77 (38%) sys 6.47 ( 1%) wall 0 kB ( 0%) ggc ipa reference : 4.56 ( 1%) usr 0.00 ( 0%) sys 4.58 ( 1%) wall 0 kB ( 0%) ggc ipa profile : 0.16 ( 0%) usr 0.00 ( 0%) sys 0.15 ( 0%) wall 0 kB ( 0%) ggc ipa pure const: 3.33 ( 1%) usr 0.03 ( 0%) sys 3.36 ( 1%) wall 0 kB ( 0%) ggc parser: 1.85 ( 0%) usr 0.03 ( 0%) sys 1.87 ( 0%) wall 37684 kB ( 3%) ggc inline heuristics : 47.34 ( 9%) usr 0.04 ( 0%) sys 47.42 ( 9%) wall 21988 kB ( 2%) ggc tree CFG construction : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 1 kB ( 0%) ggc callgraph verifier: 0.45 ( 0%) usr 0.05 ( 0%) sys 0.55 ( 0%) wall 0 kB ( 0%) ggc varconst : 0.00 ( 0%) usr 0.03 ( 0%) sys 0.03 ( 0%) wall 0 kB ( 0%) ggc unaccounted todo : 1.38 ( 0%) usr 0.00 ( 0%) sys 1.37 ( 0%) wall 0 kB ( 0%) ggc TOTAL : 531.6610.05 542.31 1405930 kB and peak memory use 2688637k - 2136908k. So 50% GGC memory (we need another about 4G for non-GGC memory, probaly largely in mmap pool) and
Re: better wpa [1/n]: merge types during read-in
On Fri, Apr 22, 2011 at 1:58 PM, Jan Hubicka hubi...@ucw.cz wrote: Hi, I run the patch on Mozilla. W/o the patch it is: Execution times (seconds) garbage collection : 20.19 ( 3%) usr 0.02 ( 0%) sys 20.22 ( 3%) wall 0 kB ( 0%) ggc callgraph optimization: 3.53 ( 1%) usr 0.01 ( 0%) sys 3.53 ( 1%) wall 15248 kB ( 1%) ggc varpool construction : 0.77 ( 0%) usr 0.02 ( 0%) sys 0.80 ( 0%) wall 51607 kB ( 4%) ggc ipa cp : 2.12 ( 0%) usr 0.10 ( 1%) sys 2.23 ( 0%) wall 119701 kB (10%) ggc ipa lto gimple in : 0.07 ( 0%) usr 0.02 ( 0%) sys 0.07 ( 0%) wall 0 kB ( 0%) ggc ipa lto gimple out : 11.63 ( 2%) usr 1.01 ( 8%) sys 12.63 ( 2%) wall 0 kB ( 0%) ggc ipa lto decl in : 182.15 (28%) usr 4.06 (32%) sys 188.10 (28%) wall 392863 kB (31%) ggc ipa lto decl out : 149.86 (23%) usr 0.32 ( 3%) sys 150.25 (22%) wall 0 kB ( 0%) ggc ipa lto decl init I/O : 0.14 ( 0%) usr 0.03 ( 0%) sys 0.16 ( 0%) wall 31 kB ( 0%) ggc ipa lto cgraph I/O : 2.09 ( 0%) usr 0.27 ( 2%) sys 2.37 ( 0%) wall 428623 kB (34%) ggc ipa lto decl merge : 219.70 (33%) usr 1.93 (15%) sys 221.75 (33%) wall 162687 kB (13%) ggc ipa lto cgraph merge : 2.68 ( 0%) usr 0.00 ( 0%) sys 2.69 ( 0%) wall 15895 kB ( 1%) ggc whopr wpa : 1.65 ( 0%) usr 0.04 ( 0%) sys 1.71 ( 0%) wall 1 kB ( 0%) ggc whopr wpa I/O : 2.20 ( 0%) usr 4.55 (36%) sys 7.20 ( 1%) wall 0 kB ( 0%) ggc ipa reference : 4.12 ( 1%) usr 0.00 ( 0%) sys 4.09 ( 1%) wall 0 kB ( 0%) ggc ipa profile : 0.18 ( 0%) usr 0.00 ( 0%) sys 0.17 ( 0%) wall 0 kB ( 0%) ggc ipa pure const : 3.15 ( 0%) usr 0.04 ( 0%) sys 3.19 ( 0%) wall 0 kB ( 0%) ggc parser : 1.56 ( 0%) usr 0.00 ( 0%) sys 1.56 ( 0%) wall 37684 kB ( 3%) ggc inline heuristics : 47.26 ( 7%) usr 0.05 ( 0%) sys 47.33 ( 7%) wall 21988 kB ( 2%) ggc callgraph verifier : 0.42 ( 0%) usr 0.04 ( 0%) sys 0.47 ( 0%) wall 0 kB ( 0%) ggc varconst : 0.02 ( 0%) usr 0.01 ( 0%) sys 0.06 ( 0%) wall 0 kB ( 0%) ggc unaccounted todo : 1.19 ( 0%) usr 0.00 ( 0%) sys 1.17 ( 0%) wall 0 kB ( 0%) ggc TOTAL : 657.07 12.64 672.26 1247550 kB note that total GGC use seems obviously wrong. The peak GGC report reads: {GC 4079042k - 4043085k} with the patch Execution times (seconds) garbage collection : 13.85 ( 3%) usr 0.02 ( 0%) sys 13.88 ( 3%) wall 0 kB ( 0%) ggc callgraph optimization: 2.40 ( 0%) usr 0.00 ( 0%) sys 2.40 ( 0%) wall 15248 kB ( 1%) ggc varpool construction : 0.69 ( 0%) usr 0.03 ( 0%) sys 0.71 ( 0%) wall 51621 kB ( 4%) ggc ipa cp : 1.86 ( 0%) usr 0.11 ( 1%) sys 1.97 ( 0%) wall 119697 kB ( 9%) ggc ipa lto gimple in : 0.04 ( 0%) usr 0.02 ( 0%) sys 0.06 ( 0%) wall 0 kB ( 0%) ggc ipa lto gimple out : 11.86 ( 2%) usr 0.92 ( 9%) sys 12.80 ( 2%) wall 0 kB ( 0%) ggc ipa lto decl in : 287.52 (54%) usr 3.49 (35%) sys 291.13 (54%) wall 713694 kB (51%) ggc ipa lto decl out : 127.76 (24%) usr 0.94 ( 9%) sys 128.79 (24%) wall 0 kB ( 0%) ggc ipa lto decl init I/O : 0.13 ( 0%) usr 0.02 ( 0%) sys 0.15 ( 0%) wall 31 kB ( 0%) ggc ipa lto cgraph I/O : 1.66 ( 0%) usr 0.29 ( 3%) sys 1.94 ( 0%) wall 428623 kB (30%) ggc ipa lto decl merge : 18.12 ( 3%) usr 0.13 ( 1%) sys 18.26 ( 3%) wall 978 kB ( 0%) ggc ipa lto cgraph merge : 1.90 ( 0%) usr 0.00 ( 0%) sys 1.91 ( 0%) wall 15143 kB ( 1%) ggc whopr wpa : 1.99 ( 0%) usr 0.05 ( 0%) sys 2.01 ( 0%) wall 1 kB ( 0%) ggc whopr wpa I/O : 2.40 ( 0%) usr 3.77 (38%) sys 6.47 ( 1%) wall 0 kB ( 0%) ggc ipa reference : 4.56 ( 1%) usr 0.00 ( 0%) sys 4.58 ( 1%) wall 0 kB ( 0%) ggc ipa profile : 0.16 ( 0%) usr 0.00 ( 0%) sys 0.15 ( 0%) wall 0 kB ( 0%) ggc ipa pure const : 3.33 ( 1%) usr 0.03 ( 0%) sys 3.36 ( 1%) wall 0 kB ( 0%) ggc parser : 1.85 ( 0%) usr 0.03 ( 0%) sys 1.87 ( 0%) wall 37684 kB ( 3%) ggc inline heuristics : 47.34 ( 9%) usr 0.04 ( 0%) sys 47.42 ( 9%) wall 21988 kB ( 2%) ggc tree CFG construction : 0.00 ( 0%) usr 0.00 ( 0%) sys 0.01 ( 0%) wall 1 kB ( 0%) ggc callgraph verifier : 0.45 ( 0%) usr 0.05 ( 0%) sys 0.55 ( 0%) wall 0 kB ( 0%) ggc varconst : 0.00 ( 0%) usr 0.03 ( 0%) sys 0.03 ( 0%) wall 0 kB ( 0%) ggc unaccounted todo : 1.38 ( 0%) usr 0.00 ( 0%) sys 1.37 ( 0%) wall 0 kB ( 0%) ggc TOTAL : 531.66 10.05
Re: better wpa [1/n]: merge types during read-in
Yes, that's very likely. If we'd get around to re-do the LTO option saving code we might want to forbid -g0 compile and -g link (dropping -g at link time as soon as we see a single module compiled with -g0). Then we can free some more stuff, at least with -g0 - though I'm not sure -g0 matters in practice. Don't know if it matters to care about it terribly much, but definitely default mozilla builds are -g0. I assume that now we want to be more selective about what to save for -g and also try to push more stuff to bypass WPA. I don't see need for all the type merging for stuff that is only referred from debug info. It will be fun to get this done however. (either by early debug info output that goes to extra section and is distributed to ltrans same way as we distribute bodies or by some other means) You had the tree code memory use statistics patch. If you send me pointer or commit it, I can generate tree code statistics. Honza Maybe we can shift numbers back from ipa lto decl in to ipa lto decl merge by some timevar adjustments? Richard. Honza
Re: better wpa [1/n]: merge types during read-in
Yes, that's very likely. If we'd get around to re-do the LTO option saving code we might want to forbid -g0 compile and -g link (dropping -g at link time as soon as we see a single module compiled with -g0). Then we can free some more stuff, at least with -g0 - though I'm not sure -g0 matters in practice. Don't know if it matters to care about it terribly much, but definitely default mozilla builds are -g0. I assume that now we want to be more selective about what to save for -g and also try to push more stuff to bypass WPA. I don't see need for all the type merging for stuff that is only referred from debug info. It will be fun to get this done however. (either by early debug info output that goes to extra section and is distributed to ltrans same way as we distribute bodies or by some other means) But overall 2GB of IL for Mozilla is not that bad. Ohter compilers consume easilly more. However we need to figure out where the other 4GB is going (mmap pool is obviously just part of it, probably hashtables consume good part of it, too) and do something about those 10 minutes of hashtable lookups needed to read and write out data. Problem is amount of data we produce for ltrans, that is partly because we ship debug info into units that won't output it. I guess I should look into perf and try to produce low overhead profile with backtraces so we know what hashtables are worse than others. Though we probably won't see there much surprises. Honza
Re: better wpa [1/n]: merge types during read-in
Yes, that's very likely. If we'd get around to re-do the LTO option saving code we might want to forbid -g0 compile and -g link (dropping -g at link time as soon as we see a single module compiled with -g0). Then we can free some more stuff, at least with -g0 - though I'm not sure -g0 matters in practice. I would a lot preffer if dwaf2out was able to do the right thing here: output debug info for things that came with debug info in (i.e. was originally compiled with -g) and not for thing that did not. Traditionally it was not neccesary to set -g or -g0 at linktime and we should just get to the same behaviour, perhaps with option to overwrite to -g0 explicitely. Honza
Re: better wpa [1/n]: merge types during read-in
Hi, On Wed, 20 Apr 2011, Michael Matz wrote: It would have been nice to have the top-level tree merging as a separate patch, as I am not convinced it is correct, but see below ... I'll split it out. Like so (also including the other remarks). Regstrapping on x86_64-linux in progress. Ciao, Michael. * lto-streamer.c (lto_streamer_cache_insert_1): Accept to override other trees that just builtins. (lto_record_common_node): Don't leave NULL TYPE_CANONICAL. lto/ * lto.c (toplevel): Include tree-flow.h. (lto_read_in_decl_state): Don't merge types here. (tree_with_vars): New static hash table. (remember_with_vars): New static functions. (LTO_FIXUP_TYPE): New macro. (lto_ft_common, lto_ft_decl_minimal, lto_ft_decl_common, lto_ft_decl_with_vis, lto_ft_decl_non_common, lto_ft_function, lto_ft_field_decl, lto_ft_type, lto_ft_binfo, lto_ft_constructor, lto_ft_expr, lto_fixup_types, uniquify_nodes): New static functions. (lto_read_decls): Uniquify while reading in trees. (lto_fixup_data_t, LTO_FIXUP_SUBTREE, LTO_REGISTER_TYPE_AND_FIXUP_SUBTREE, no_fixup_p, lto_fixup_common, lto_fixup_decl_minimal, lto_fixup_decl_common, lto_fixup_decl_with_vis, lto_fixup_decl_non_common, lto_fixup_function, lto_fixup_field_decl, lto_fixup_type, lto_fixup_binfo, lto_fixup_constructor, lto_fixup_tree): Remove. (lto_fixup_state): Remove data argument. Use lto_symtab_prevailing_decl. (LTO_SET_PREVAIL, LTO_NO_PREVAIL): New macros. (lto_fixup_prevailing_decls): New function. (lto_fixup_state_aux): Argument aux is unused. (lto_fixup_decls): Don't allocate pointer sets, don't use lto_fixup_tree, use lto_fixup_prevailing_decls. (read_cgraph_and_symbols): Allocate and remove tree_with_vars. * Make-lang.in (lto/lto.o): Depend on $(TREE_FLOW_H). Index: lto-streamer.c === *** lto-streamer.c (revision 172769) --- lto-streamer.c (working copy) *** lto_streamer_cache_insert_1 (struct lto_ *** 383,401 { /* If the caller wants to insert T at a specific slot location, and ENTRY-TO does not match *IX_P, add T to !the requested location slot. This situation arises when !streaming builtin functions. ! !For instance, on the writer side we could have two !FUNCTION_DECLS T1 and T2 that are represented by the same !builtin function. The reader will only instantiate the !canonical builtin, but since T1 and T2 had been !originally stored in different cache slots (S1 and S2), !the reader must be able to find the canonical builtin !function at slots S1 and S2. */ ! gcc_assert (lto_stream_as_builtin_p (t)); ix = *ix_p; - lto_streamer_cache_add_to_node_array (cache, ix, t); } --- 383,390 { /* If the caller wants to insert T at a specific slot location, and ENTRY-TO does not match *IX_P, add T to !the requested location slot. */ ix = *ix_p; lto_streamer_cache_add_to_node_array (cache, ix, t); } *** lto_record_common_node (tree *nodep, VEC *** 513,518 --- 502,509 TYPE_CANONICAL (node) = NULL_TREE; node = gimple_register_type (node); TYPE_CANONICAL (node) = gimple_register_canonical_type (node); + if (in_lto_p) + TYPE_CANONICAL (*nodep) = TYPE_CANONICAL (node); *nodep = node; } Index: lto/lto.c === *** lto/lto.c (revision 172769) --- lto/lto.c (working copy) *** along with GCC; see the file COPYING3. *** 24,29 --- 24,30 #include opts.h #include toplev.h #include tree.h + #include tree-flow.h #include diagnostic-core.h #include tm.h #include cgraph.h *** lto_read_in_decl_state (struct data_in * *** 215,228 tree *decls = ggc_alloc_vec_tree (size); for (j = 0; j size; j++) ! { ! decls[j] = lto_streamer_cache_get (data_in-reader_cache, data[j]); ! ! /* Register every type in the global type table. If the !type existed already, use the existing type. */ ! if (TYPE_P (decls[j])) ! decls[j] = gimple_register_type (decls[j]); ! } state-streams[i].size = size; state-streams[i].trees = decls; --- 216,222 tree *decls = ggc_alloc_vec_tree (size); for (j = 0; j size; j++) ! decls[j] = lto_streamer_cache_get (data_in-reader_cache, data[j]); state-streams[i].size = size; state-streams[i].trees = decls; *** lto_read_in_decl_state (struct data_in *
Re: better wpa [1/n]: merge types during read-in
On Thu, Apr 21, 2011 at 3:46 PM, Michael Matz m...@suse.de wrote: Hi, On Wed, 20 Apr 2011, Michael Matz wrote: It would have been nice to have the top-level tree merging as a separate patch, as I am not convinced it is correct, but see below ... I'll split it out. Like so (also including the other remarks). Regstrapping on x86_64-linux in progress. Ok if it passed. Thanks, Richard. Ciao, Michael. * lto-streamer.c (lto_streamer_cache_insert_1): Accept to override other trees that just builtins. (lto_record_common_node): Don't leave NULL TYPE_CANONICAL. lto/ * lto.c (toplevel): Include tree-flow.h. (lto_read_in_decl_state): Don't merge types here. (tree_with_vars): New static hash table. (remember_with_vars): New static functions. (LTO_FIXUP_TYPE): New macro. (lto_ft_common, lto_ft_decl_minimal, lto_ft_decl_common, lto_ft_decl_with_vis, lto_ft_decl_non_common, lto_ft_function, lto_ft_field_decl, lto_ft_type, lto_ft_binfo, lto_ft_constructor, lto_ft_expr, lto_fixup_types, uniquify_nodes): New static functions. (lto_read_decls): Uniquify while reading in trees. (lto_fixup_data_t, LTO_FIXUP_SUBTREE, LTO_REGISTER_TYPE_AND_FIXUP_SUBTREE, no_fixup_p, lto_fixup_common, lto_fixup_decl_minimal, lto_fixup_decl_common, lto_fixup_decl_with_vis, lto_fixup_decl_non_common, lto_fixup_function, lto_fixup_field_decl, lto_fixup_type, lto_fixup_binfo, lto_fixup_constructor, lto_fixup_tree): Remove. (lto_fixup_state): Remove data argument. Use lto_symtab_prevailing_decl. (LTO_SET_PREVAIL, LTO_NO_PREVAIL): New macros. (lto_fixup_prevailing_decls): New function. (lto_fixup_state_aux): Argument aux is unused. (lto_fixup_decls): Don't allocate pointer sets, don't use lto_fixup_tree, use lto_fixup_prevailing_decls. (read_cgraph_and_symbols): Allocate and remove tree_with_vars. * Make-lang.in (lto/lto.o): Depend on $(TREE_FLOW_H). Index: lto-streamer.c === *** lto-streamer.c (revision 172769) --- lto-streamer.c (working copy) *** lto_streamer_cache_insert_1 (struct lto_ *** 383,401 { /* If the caller wants to insert T at a specific slot location, and ENTRY-TO does not match *IX_P, add T to ! the requested location slot. This situation arises when ! streaming builtin functions. ! ! For instance, on the writer side we could have two ! FUNCTION_DECLS T1 and T2 that are represented by the same ! builtin function. The reader will only instantiate the ! canonical builtin, but since T1 and T2 had been ! originally stored in different cache slots (S1 and S2), ! the reader must be able to find the canonical builtin ! function at slots S1 and S2. */ ! gcc_assert (lto_stream_as_builtin_p (t)); ix = *ix_p; - lto_streamer_cache_add_to_node_array (cache, ix, t); } --- 383,390 { /* If the caller wants to insert T at a specific slot location, and ENTRY-TO does not match *IX_P, add T to ! the requested location slot. */ ix = *ix_p; lto_streamer_cache_add_to_node_array (cache, ix, t); } *** lto_record_common_node (tree *nodep, VEC *** 513,518 --- 502,509 TYPE_CANONICAL (node) = NULL_TREE; node = gimple_register_type (node); TYPE_CANONICAL (node) = gimple_register_canonical_type (node); + if (in_lto_p) + TYPE_CANONICAL (*nodep) = TYPE_CANONICAL (node); *nodep = node; } Index: lto/lto.c === *** lto/lto.c (revision 172769) --- lto/lto.c (working copy) *** along with GCC; see the file COPYING3. *** 24,29 --- 24,30 #include opts.h #include toplev.h #include tree.h + #include tree-flow.h #include diagnostic-core.h #include tm.h #include cgraph.h *** lto_read_in_decl_state (struct data_in * *** 215,228 tree *decls = ggc_alloc_vec_tree (size); for (j = 0; j size; j++) ! { ! decls[j] = lto_streamer_cache_get (data_in-reader_cache, data[j]); ! ! /* Register every type in the global type table. If the ! type existed already, use the existing type. */ ! if (TYPE_P (decls[j])) ! decls[j] = gimple_register_type (decls[j]); ! } state-streams[i].size = size; state-streams[i].trees = decls; --- 216,222 tree *decls = ggc_alloc_vec_tree (size); for (j = 0; j size; j++) ! decls[j] = lto_streamer_cache_get (data_in-reader_cache, data[j]);
Re: better wpa [1/n]: merge types during read-in
Hi, On Wed, 20 Apr 2011, Richard Guenther wrote: + /* A hashtable of trees that potentially refer to variables or functions + that must be replaced with their prevailing variant. */ + static GTY((if_marked (ggc_marked_p), param_is (union tree_node))) htab_t + tree_with_vars; + /* A hashtable of top-level trees that can potentially be merged with trees + from other compilation units. */ It would have been nice to have the top-level tree merging as a separate patch, as I am not convinced it is correct, but see below ... I'll split it out. + { + tree *t2 = (tree *) htab_find_slot (top_decls, t, INSERT); + if (*t2) + t = *t2; + else + *t2 = t; I don't think we can share TREE_LISTs. Where do they hang off when you do this? Are not all but one of those dead? All TREE_LIST manipulation routines I know of do not even think about the possibility of shared lists. Well, it clearly works for the use in global trees :-) The problem I solved with this is, that all trees are stored in the reader cache. If (say) two function types, or enum types are merged, one of their TYPE_ARG_TYPES or TYPE_VALUES lists should be dead. But as they're still referred from the cache they aren't collected. I could remove them from the cache in a similar way to how I deal with the FIELD_DECLs. Or I could store tree_list nodes not in the cache at all. The latter seems more worthwhile to me, so I'm going to try this. + /* Fix up fields of a field_decl T. */ + + static void + lto_ft_field_decl (tree t) + { + lto_ft_decl_common (t); + LTO_FIXUP_TYPE (DECL_FIELD_OFFSET (t)); + LTO_FIXUP_TYPE (DECL_BIT_FIELD_TYPE (t)); + LTO_FIXUP_TYPE (DECL_QUALIFIER (t)); here (and earlier) we had some no_fixup_p asserts. What happened to them? no_fixup_p checked only that the tree wasn't a type or a var/function_decl. In particular, don't we need to fixup the DECL_FIELD_BIT_OFFSET integer-cst? But I was indeed too eager to remove those lines, I should still deal with DECL_SECTION_NAME, DECL_FIELD_BIT_OFFSET and BINFO_OFFSET. Thanks for catching this. + + /* First fixup the fields of T. */ + lto_fixup_types (t); + + /* Now try to find a canonical variant of T itself. */ + if (TYPE_P (t)) + { If t is a type, why fix up its field if it may not be the canonical variant? Because type merging to work sometimes requires already canonicalized fields, at least that's what I found in investigating why some types weren't merged that should have been. Hence I'm first canonicalizing all fields of everything and then see if something merged. + if (t == oldt + TYPE_MAIN_VARIANT (t) != t) + { + /* If this is its own type, link it into the variant + chain. */ + TYPE_NEXT_VARIANT (t) = TYPE_NEXT_VARIANT (TYPE_MAIN_VARIANT (t)); + TYPE_NEXT_VARIANT (TYPE_MAIN_VARIANT (t)) = t; Hmm - isn't this taken care of in lto_fixup_types? Nope. I've taken it out of the (old) lto_fixup_type, because (now that I've removed the seen pointer_set) it can be called for the same type multiple times, which would lead to it being included multiple times in the NEXT_VARIANT chain. I found it more clear to do this kind of list manipulation at the place where we're sure to see every type only once. + for (f1 = TYPE_FIELDS (t), f2 = TYPE_FIELDS (oldt); + f1 f2; f1 = TREE_CHAIN (f1), f2 = TREE_CHAIN (f2)) + if (TREE_CODE (f1) == FIELD_DECL + f1 != f2 This should always be true unless TYPE_FIELDS (t) == TYPE_FIELDS (oldt) Think shared field_decl chains. I'll have fixed up the chain for one of the type pairs already and can later come to a type referring exactly the same field_decls again. + lto_streamer_cache_lookup (cache, f2, ix)) This lookup should always succeed, no? Yes. I'll assert this. + { + gcc_assert (DECL_NAME (f1) == DECL_NAME (f2)); + /* If we're going to replace an element which we'd + still visit in the next iterations, we wouldn't + handle it, so do it here. */ + if (ix i) + lto_fixup_types (f2); ? But it's dead, no? That is true, but it can refer to integer_cst which I can only fix up by walking the fields. If I don't do that, then even though the field_decl will not be in the cache anymore, its integer_cst (and their non-fixed up types) will stay uncollected. + lto_streamer_cache_insert_at (cache, f1, ix); + } Btw, nice. Does this get rid of the need for the field-decl fixup code in input_gimple_stmt? Hmm, it's gross but seems to me still required for the
Re: better wpa [1/n]: merge types during read-in
Hi, On Wed, 20 Apr 2011, Richard Guenther wrote: If t is a type, why fix up its field if it may not be the canonical variant? Because type merging to work sometimes requires already canonicalized fields, at least that's what I found in investigating why some types weren't merged that should have been. Hence I'm first canonicalizing all fields of everything and then see if something merged. That sounds like a bug in type-merging. You don't happen to have a small testcase? ;) cc1 was my testcase :-/ Think shared field_decl chains. I'll have fixed up the chain for one of the type pairs already and can later come to a type referring exactly the same field_decls again. But only in case the first one is already equal. What I wanted to say is that we shouldn't have partially shared chains, so if (TYPE_FIELDS (t) != TYPE_FIELDS (oldt)) for (...) if (TREE_CODE (f1) == FIELD_DECL) ... should be enough, no? Indeed. In fact, why restrict fixing up the cache to FIELD_DECLs and not also do it for TYPE_DECLs or FUNCTION_DECLs that may reside in this chain? non-FIELD_DECLs are removed in free_lang_data. But even more so I can remove the test for FIELD_DECL. Hmm, it's gross but seems to me still required for the diagnostic and to emit the VIEW_CONVERT_EXPR, at least for invalid input code. OTOH if the streamed out code ensures that a field_decl in a component_ref always is included in its DECL_CONTEXT, then the new merging should indeed make sure that this also holds after streaming in. Do we have testcases specifically trying this code? greping for mismatching in testsuite/ doesn't show anything relevant. The lto testsuite harness doesn't support dg-error/warning, so there are no testcases. There are testcases that ICEd (type verification) before introducing these fixups though. Okay, I'll leave investigating this to a follow up. Ciao, Michael.
better wpa [1/n]: merge types during read-in
Hi, I have a backlog of random improvements to the WPA phase of LTO compilations, all with the common goal of reducing peak memory usage. I was basically dumping all trees that the WPA phase read in, and then tried to think about which trees can be merged with already existing ones very early or alternatively which shouldn't have been in the global section from the beginning. I.e. whenever there were two trees left after reading in all units, that IMO should have been the same, I either tried to merge them, or find reasons why they in fact were not the same. This is the first part, that in essence merely moves the merging of types somewhat earlier. Currently all non-local trees of all object files are read in, then symbols are merged, then types and prevailing decls are set. That's the maximum of peak memory usage you can get, because during read-in virtually no trees will be garbage collected. There's much one can merge already during read-in, types are the easiest thing, some obvious top-level trees are in the same league (integer and string constants, tree_lists used to collect arguments, field_decls of the just replaced types). Later patches will extend this set. This current one will deal just with the mentioned entities. One nice property of our reader is, that all trees are nicely collected in a sequential array (the reader cache). This implies that we don't have to use walk_tree to get to all reachable trees, we merely have to iterate through that array and are sure we'll catch everything (except some builtin trees). We also don't have to remember which trees we've already visited, we'll naturally visit each one just once. The current callbacks of walk_tree are used for two purposes: merging types and setting prevailing decls. The latter can only be done after the prevailing decls are known, and that is only when we've seen all object files (even in the case of the linker plugin). Hence, while uniquifying the read trees I'm also remembering those that (potentially) refer to variables or functions in a hashtable, which conveniently has support in the garbage collector to remove unreachable trees (so, we'll later only iterate over those trees that are known to be reachable). I'm using that hash at the place where currently walk_tree is used to iterate over only those trees that potentially refer to a var/function_decl. The type merger is in essence a moved and slightly changed copy of the original walk_tree callback routines. Via some asserts I've verified that I'm replacing all reachable vars or functions with their prevailing decl, before removing the old callbacks (the same cannot be done with the merged types, due to ordering issues type merging depends on other types already being merged, or, worse, some decls being merged; that's already currently the case, rerunning the type merger will produce more merged types). I'm starting to merge tree as soon as a top-level invocation of lto_input_tree is finished. That is the earliest point at which all recursively reachable trees are known and read in. It's better to do it at that place than to defer it until the whole unit is read in because so the next invocations of input_tree (possibly referring back to already existing trees in the cache) will make use of the already merged variants. Some statistics for an LTO bootstrap, in this case only for the cc1 executable, only for the WPA phase, generated with a -O0 compiled lto1 (the cc1 object are from stage3 of a normal lto bootstrap). Without the patch: Merging declarations {GC 320745k - 204020k}Reading summaries ipa lto gimple out: 6.08 ( 7%) usr 0.14 ( 9%) sys 6.23 ( 7%) wall 0 kB ( 0%) ggc ipa lto decl in : 21.24 (23%) usr 0.38 (24%) sys 21.65 (23%) wall 311556 kB (65%) ggc ipa lto decl out : 17.89 (19%) usr 0.01 ( 1%) sys 17.91 (19%) wall 0 kB ( 0%) ggc ipa lto decl merge: 12.09 (13%) usr 0.13 ( 8%) sys 12.22 (13%) wall 7704 kB ( 2%) ggc inline heuristics : 21.05 (23%) usr 0.00 ( 0%) sys 21.07 (22%) wall 28972 kB ( 6%) ggc TOTAL : 92.99 1.5894.68 479817 kB I.e. right after reading in all units we have 320MB of ggc memory. With the patch: Merging declarations {GC 193371k - 176876k}Reading summaries ipa lto gimple out: 6.63 ( 8%) usr 0.19 ( 9%) sys 11.13 (12%) wall 0 kB ( 0%) ggc ipa lto decl in : 27.09 (32%) usr 0.30 (14%) sys 27.79 (30%) wall 318044 kB (66%) ggc ipa lto decl out : 13.96 (17%) usr 0.02 ( 1%) sys 13.99 (15%) wall 0 kB ( 0%) ggc ipa lto decl merge: 0.32 ( 0%) usr 0.00 ( 0%) sys 0.33 ( 0%) wall 5 kB ( 0%) ggc inline heuristics : 21.04 (25%) usr 0.03 ( 1%) sys 21.10 (23%) wall 28972 kB ( 6%) ggc TOTAL : 84.10 2.1791.78 478594 kB So, about 193MB of ggc memory after
Re: better wpa [1/n]: merge types during read-in
Hi, I have a backlog of random improvements to the WPA phase of LTO compilations, all with the common goal of reducing peak memory usage. I was basically dumping all trees that the WPA phase read in, and then tried to think about which trees can be merged with already existing ones very early or alternatively which shouldn't have been in the global section from the beginning. I.e. whenever there were two trees left after reading in all units, that IMO should have been the same, I either tried to merge them, or find reasons why they in fact were not the same. Thanks, this hard work was on my TODO list for a while. I am very heappy you beat me on this! I will give a try your patch on Mozilla, I think the reuslts should be a lot better than on GCC. There's much one can merge already during read-in, types are the easiest thing, some obvious top-level trees are in the same league (integer and string constants, tree_lists used to collect arguments, field_decls of the just replaced types). Later patches will extend this set. This current one will deal just with the mentioned entities. For DECls, obviously the IPA passes will be in the way. You mentioned the code that just reads things in and applies to same cgraph nodes. This will break in non-trivial situation. I however think that all we need is a way to read in a DECL and or cgraph ID from stream file and get the corresponding (merged) DECL or cgraph plus information if the merged DECL is one the ID originally referred to. I guess all the existing stream in code for cgraph IPA passes can be quite easilly restructured to ignore or merge the data on prevailed nodes during read. In fact I planned this design originally. I also planned to organize the summaries into per-function sections, instead of per-pass sections, so one can read in only data of those that won in the merging battle. Obviously the implementation went other way early in LTO branch design. One nice property of our reader is, that all trees are nicely collected in a sequential array (the reader cache). This implies that we don't have to use walk_tree to get to all reachable trees, we merely have to iterate through that array and are sure we'll catch everything (except some builtin trees). We also don't have to remember which trees we've already visited, we'll naturally visit each one just once. The current callbacks of walk_tree are used for two purposes: merging types and setting prevailing decls. The latter can only be done after the prevailing decls are known, and that is only when we've seen all object files (even in the case of the linker plugin). Hence, while uniquifying the read trees I'm also remembering those that (potentially) refer to variables or functions in a hashtable, which conveniently has support in the garbage collector to remove unreachable trees (so, we'll later only iterate over those trees that are known to be reachable). I'm using that hash at the place where currently walk_tree is used to iterate over only those trees that potentially refer to a var/function_decl. The type merger is in essence a moved and slightly changed copy of the original walk_tree callback routines. Via some asserts I've verified that I'm replacing all reachable vars or functions with their prevailing decl, before removing the old callbacks (the same cannot be done with the merged types, due to ordering issues type merging depends on other types already being merged, or, worse, some decls being merged; that's already currently the case, rerunning the type merger will produce more merged types). Yeah, this also sounds all good. I wondered why the code is not organized this way and instead keeps walking everything recursively. Merging declarations {GC 320745k - 204020k}Reading summaries ipa lto gimple out: 6.08 ( 7%) usr 0.14 ( 9%) sys 6.23 ( 7%) wall 0 kB ( 0%) ggc ipa lto decl in : 21.24 (23%) usr 0.38 (24%) sys 21.65 (23%) wall 311556 kB (65%) ggc ipa lto decl out : 17.89 (19%) usr 0.01 ( 1%) sys 17.91 (19%) wall 0 kB ( 0%) ggc ipa lto decl merge: 12.09 (13%) usr 0.13 ( 8%) sys 12.22 (13%) wall 7704 kB ( 2%) ggc inline heuristics : 21.05 (23%) usr 0.00 ( 0%) sys 21.07 (22%) wall 28972 kB ( 6%) ggc TOTAL : 92.99 1.5894.68 479817 kB I.e. right after reading in all units we have 320MB of ggc memory. With the patch: Merging declarations {GC 193371k - 176876k}Reading summaries ipa lto gimple out: 6.63 ( 8%) usr 0.19 ( 9%) sys 11.13 (12%) wall 0 kB ( 0%) ggc ipa lto decl in : 27.09 (32%) usr 0.30 (14%) sys 27.79 (30%) wall 318044 kB (66%) ggc ipa lto decl out : 13.96 (17%) usr 0.02 ( 1%) sys 13.99 (15%) wall 0 kB ( 0%) ggc ipa lto decl merge: 0.32 ( 0%) usr 0.00 ( 0%) sys 0.33 ( 0%) wall 5 kB (