Re: DECL_STRUCT_FUNCTION(cgraphnode-decl) == NULL when attempting Link-Time Optimization in plugin
On Mon, Oct 8, 2012 at 11:29 PM, David Malcolm dmalc...@redhat.com wrote: On Mon, 2012-10-08 at 18:21 +0200, Richard Guenther wrote: On Mon, Oct 8, 2012 at 5:17 PM, David Malcolm dmalc...@redhat.com wrote: I'm working on a static analysis extension to GCC via my gcc-python-plugin [1] The analysis is interprocedural (memory leak detection, as it happens). I have it working on one translation unit at a time, and I'm attempting to get it to work within Link Time Optimization so that it can see the interactions of all of the functions within a program or library, but I'm running what might be a bug (or I could be misunderstanding LTO). I have my plugin working within lto1, using -flto and -fplugin; my command line looks like: $ gcc -fPIC -shared -flto -g -fplugin=/home/david/coding/gcc-python/gcc-python/sm/python.so -fplugin-arg-python-script=examples/show-lto-supergraph.py tests/sm/lto/input-*.c --save-temps lto1 appears to be invoked twice by gcc. The first time, there are two passes: IPA_PASS named whole-program IPA_PASS named inline and in_lto_p is true. The second invocation has these passes, called per-function: GIMPLE_PASS *free_cfg_annotations GIMPLE_PASS cplxlower0 GIMPLE_PASS optimized RTL_PASS expand GIMPLE_PASS *rest_of_compilation RTL_PASS *init_function ...etc.., normal rest of compilation from RTL stage onwards and in_lto_p is false. I've attempted to wire in my interprocedural analysis as a new IPA pass before or after whole-program and inline within the first invocation of lto1. In each case it is able to walk the callgraph, and the callgraph it sees appears to be correct: walking the linked list of cgraph_nodes the various cgraph_edge and cgraph_node are as expected, and for each cgraph_node, cgraph_node-decl is a non-NULL tree instance of type FUNCTION_DECL (as expected); I'm able to generate a graphviz rendering of the whole-program callgraph within lto1 from my plugin. However, for every cgraph_node, the DECL_STRUCT_FUNCTION(cgraph_node-decl) is (struct function*)NULL, which thwarts my code's attempt to look up the CFG of each underlying function and work on the underlying gimple. Looking with eu-readelf at the .o files shows these lto sections are present (f is the name of one of the functions that I'd like to access the gimple CFG of): $ eu-readelf -a input-f.o|grep lto [ 5] .gnu.lto_.inline.c8904cb9a96e7417 PROGBITS 006c 0026 0 E 0 0 1 [ 6] .gnu.lto_f.c8904cb9a96e7417 PROGBITS 0092 0164 0 E 0 0 1 [ 7] .gnu.lto_.cgraph.c8904cb9a96e7417 PROGBITS 01f6 0032 0 E 0 0 1 [ 8] .gnu.lto_.vars.c8904cb9a96e7417 PROGBITS 0228 0012 0 E 0 0 1 [ 9] .gnu.lto_.refs.c8904cb9a96e7417 PROGBITS 023a 0013 0 E 0 0 1 [10] .gnu.lto_.statics.c8904cb9a96e7417 PROGBITS 024d 0014 0 E 0 0 1 [11] .gnu.lto_.decls.c8904cb9a96e7417 PROGBITS 0261 01f3 0 E 0 0 1 [12] .gnu.lto_.symtab.c8904cb9a96e7417 PROGBITS 0454 0011 0 E 0 0 1 [13] .gnu.lto_.opts PROGBITS 0465 00e9 0 E 0 0 1 25: 0001 1 OBJECT GLOBAL DEFAULT COMMON __gnu_lto_v1 [2a] __gnu_lto_v1 Are the (struct function*) and CFG meant to exist at this stage, and be in gimple form, and is DECL_STRUCT_FUNCTION meant to be non-NULL for functions in translation units that were compiled with -flto? (or have I misunderstood LTO?) It depends if you are in the WPA stage (lto1 was invoked with -fwpa) in which case no function bodies and thus no CFG is available at all, or if you are in LTRANS stage (lto1 was invoked with -fltrans) which see _part_ of the whole programs callgraph and function bodies (and thus CFGs). As a workaround you probably can make your example work by using -flto-partition=none which merges WPA and LTRANS stages and pull in the wole program into a single link-time TU. Many thanks: -fwpa was indeed being passed by gcc to lto1, and on adding -flto-partition=none to the gcc command line -fwpa goes away, and my plugin is able to see all of the gimple CFG from all of the .o files (and thus generate pretty graphviz renderings of the supergraph of all CFGs etc). FWIW the -fwpa behavior was surprising to me, and felt like something of a gotcha. Based on my reading of http://gcc.gnu.org/onlinedocs/gccint/LTO-Overview.html#LTO-Overview I had assumed that LTO mode was the default when setting -flto, and that WHOPR mode was something that I had to add an extra option to enable (and thus I merely skimmed those parts of the docs), whereas it seems to be the other way around. In my defense
Re: patch to fix constant math
On Mon, Oct 8, 2012 at 9:55 PM, Richard Sandiford rdsandif...@googlemail.com wrote: Robert Dewar de...@adacore.com writes: On 10/8/2012 11:01 AM, Nathan Froyd wrote: - Original Message - Btw, as for Richards idea of conditionally placing the length field in rtx_def looks like overkill to me. These days we'd merely want to optimize for 64bit hosts, thus unconditionally adding a 32 bit field to rtx_def looks ok to me (you can wrap that inside a union to allow both descriptive names and eventual different use - see what I've done to tree_base) IMHO, unconditionally adding that field isn't optimize for 64-bit hosts, but gratuitously make one of the major compiler data structures bigger on 32-bit hosts. Not everybody can cross-compile from a 64-bit host. And even those people who can don't necessarily want to. Please try to consider what's best for all the people who use GCC, not just the cases you happen to be working with every day. I think that's rasonable in general, but as time goes on, and every $300 laptop is 64-bit capable, one should not go TOO far out of the way trying to make sure we can compile everything on a 32-bit machine. It's not 64-bit machine vs. 32-bit machine. It's an LP64 ABI vs. an ILP32 ABI. HJ co. have put considerable effort into developing the x32 ABI for x86_64 precisely because ILP32 is still useful for 64-bit machines. Just as it was for MIPS when SGI invented n32 (which is still useful now). I believe 64-bit SPARC has a similar thing, and no doubt other architectures do too. After all, there shouldn't be much need for more than 2GB of virtual address space in an AVR cross compiler. So why pay the cache penalty of 64-bit pointers and longs (GCC generally tries to avoid using long directly) when a 32-bit pointer will do? Many years ago, I moved the HOST_WIDE_INT fields out of rtunion and into the main rtx_def union because it produced a significant speed-up on n32 IRIX. That was before tree-level optimisation, But of course that doesn't change the alignment requirement of that union which is the issue on lp64 hosts. Also as HOST_WIDE_INT is 64bits for most ilp32 targets as well it doesn't necessarily save ilp32 hosts from having this issue (unless long long isn't aligned to 8 bytes for them). So what I said is that arranging for the 32bit hole to contain useful data for most RTXen should be possible. Richard. but I don't think we've really pruned that much RTL optimisation since then, so I'd be surprised if much has changed.
Re: Small cleanup/memory leak plugs for lto
On Mon, Oct 8, 2012 at 10:05 PM, Tobias Burnus bur...@net-b.de wrote: Some more issues found by Coverity scanner. lto-cgraph.c: The code seems to be unused, besides, it's a zero-trip loop as parm_num is set to 0 and then checked non nonzeroness. lto-opts: The check whether first_p is non NULL is always false: All calls have a variable ref as argument - and first_p is unconditionally dereferenced. lto_obj_file_open: One could check additionally check lo is NULL, but that has then to be directly after the XCNEW as already lto_file_init dereferences lo. static void append_to_collect_gcc_options (struct obstack *ob, bool *first_p, const char *opt) { const char *p, *q = opt; - if (!first_p) -obstack_grow (ob, , 1); bogus change. It should be if (!*first_p) obstack_grow (ob, , 1); The rest looks ok. Please re-test with the above change. Thanks, Richard. Build and regtested on x86-64-gnu-linux Tobias
Re: [lra] 3rd patch to speed more compilation of PR54146
On Tue, Oct 9, 2012 at 6:10 PM, Vladimir Makarov vmaka...@redhat.com wrote: On 10/08/2012 05:14 PM, Steven Bosscher wrote: On Mon, Oct 8, 2012 at 10:26 PM, Vladimir Makarov vmaka...@redhat.com wrote: So I checked it on big file with hundred functionson Intel machine and got before a part of the patch implementing the insn stack as sbitmap real=243.40 user=241.61 system=1.00 after the part of the patch: real=244.89 user=243.02 system=0.96 Is that more than just noise, you think? A ~1.5s difference on ~240 total isn't very much. I measured the timings on my set of cc1-i files, and sometimes the without-patch compiler was faster by a tiny amount, and sometimes it was slower. Even on an average of 10 runs I really couldn't say that the patch was a win or loss on the whole. I measured this on a mostly idle machine at home, not gcc17, which seems to be even more busy than usual lately, thanks to you and me :-) Sorry, Steven. It was a noise. I ran it again now 3 times and found that was a noise. After some thinking, I realized that sbitmap is really the best representation for this particular case. That is because at the 1st constraint pass we have all bits are set as we process *all* insns on the 1st pass. So sbitmap (at least before the extension and if we have pretty compact UIDs) will be always smaller than bitmap. I committed the following your patch to the branch (as rev. 192264). And again, sorry for the false conclusions. Btw, congratulations for all the speedups (even though they probably are noise for regular programs?)! I'm looking forward to the merge of LRA for x86 now. Richard. 2012-10-09 Steven Bosscher ste...@gcc.gnu.org * lra-int.h (lra_constraint_insn_stack_bitmap, lra_constraint_insn_stack): Remove. (lra_pop_insn, lra_insn_stack_length): New prototypes. * lra.c (lra_constraint_insn_stack_bitmap): Make static sbitmap. (lra_constraint_insn_stack): Make static. (lra_push_insn_1): New function. (lra_push_insn): Rewrite using lra_push_insn_1. (lra_push_insn_and_update_insn_regno_info): Likewise. (lra_pop_insn, lra_insn_stack_length): New functions. * lra_constraints.c (lra_constraints): Use new interface to insns stack instead of manipulating in-place.
Re: TODO_rebuild_alias and -O0
On Mon, Oct 8, 2012 at 11:26 AM, Ludovic Courtès l...@gnu.org wrote: Hello, Consider the attached plug-in, which adds a new dummy pass after the “ssa” pass, with ‘TODO_rebuild_alias’ as its start flags: When compiling with -O0 a non-trivial file with that plug-in, one ends up with: numbers.c:8728:10: internal compiler error: in make_decl_rtl, at varasm.c:1163 gcc: internal compiler error: Aborted (program cc1) and a backtrace like this: --8---cut here---start-8--- #6 0x0056bfa4 in fancy_abort (file=optimized out, line=1163, function=0xc2a3f5 make_decl_rtl) at ../../gcc-4.6.2/gcc/diagnostic.c:893 #7 0x00853bae in make_decl_rtl (decl=0x7f3cfd0458c0) at ../../gcc-4.6.2/gcc/varasm.c:1159 #8 0x005bd485 in expand_expr_real_1 (exp=0x7f3cfd0458c0, target=optimized out, tmode=VOIDmode, modifier=EXPAND_WRITE, alt_rtl=optimized out) at ../../gcc-4.6.2/gcc/expr.c:8462 #9 0x005c83a8 in expand_expr (modifier=EXPAND_WRITE, mode=VOIDmode, target=0x0, exp=0x7f3cfd0458c0) at ../../gcc-4.6.2/gcc/expr.h:422 #10 expand_assignment (nontemporal=0 '\000', from=0x7f3cfc967160, to=0x7f3cfd0458c0) at ../../gcc-4.6.2/gcc/expr.c:4422 #11 expand_assignment (to=0x7f3cfd0458c0, from=0x7f3cfc967160, nontemporal=0 '\000') at ../../gcc-4.6.2/gcc/expr.c:4119 #12 0x005350c8 in expand_gimple_stmt_1 (stmt=0x7f3cfd046460) at ../../gcc-4.6.2/gcc/cfgexpand.c:1975 #13 expand_gimple_stmt (stmt=0x7f3cfd046460) at ../../gcc-4.6.2/gcc/cfgexpand.c:2084 #14 0x00536218 in expand_gimple_basic_block (bb=0x7f3cfd047a90) at ../../gcc-4.6.2/gcc/cfgexpand.c:3626 #15 0x00539bbe in gimple_expand_cfg () at ../../gcc-4.6.2/gcc/cfgexpand.c:4109 #16 0x0068b589 in execute_one_pass (pass=0x10a3dc0) at ../../gcc-4.6.2/gcc/passes.c:1556 #17 0x0068b835 in execute_pass_list (pass=0x10a3dc0) at ../../gcc-4.6.2/gcc/passes.c:1611 #18 0x00757751 in tree_rest_of_compilation (fndecl=0x7f3cfdb03900) at ../../gcc-4.6.2/gcc/tree-optimize.c:422 #19 0x008978cf in cgraph_expand_function (node=0x7f3cfd2c8420) at ../../gcc-4.6.2/gcc/cgraphunit.c:1576 #20 0x0089975e in cgraph_output_in_order () at ../../gcc-4.6.2/gcc/cgraphunit.c:1733 #21 cgraph_optimize () at ../../gcc-4.6.2/gcc/cgraphunit.c:1894 #22 0x0089992a in cgraph_finalize_compilation_unit () at ../../gcc-4.6.2/gcc/cgraphunit.c:1096 #23 0x004a3df5 in c_write_global_declarations () at ../../gcc-4.6.2/gcc/c-decl.c:9871 --8---cut here---end---8--- There’s no such problem when optimizations are turned on, though. Does it mean that extra work is needed at -O0 before one can use alias info? At -O0 no virtual operands are produced. TODO_rebuild_alias only computes points-to sets which are in itself not useful. What do you want to achieve with TODO_rebuild_alias? Richard. Thanks, Ludo’.
Re: TODO_rebuild_alias and -O0
On Mon, Oct 8, 2012 at 11:58 AM, Ludovic Courtès l...@gnu.org wrote: Richard Guenther richard.guent...@gmail.com skribis: At -O0 no virtual operands are produced. TODO_rebuild_alias only computes points-to sets which are in itself not useful. What do you want to achieve with TODO_rebuild_alias? I basically want to use ‘ptr_derefs_may_alias_p’ in this particular pass. That should work. Thanks, Ludo’.
Re: TODO_rebuild_alias and -O0
On Mon, Oct 8, 2012 at 1:42 PM, Ludovic Courtès l...@gnu.org wrote: Richard Guenther richard.guent...@gmail.com skribis: On Mon, Oct 8, 2012 at 11:58 AM, Ludovic Courtès l...@gnu.org wrote: Richard Guenther richard.guent...@gmail.com skribis: At -O0 no virtual operands are produced. TODO_rebuild_alias only computes points-to sets which are in itself not useful. What do you want to achieve with TODO_rebuild_alias? I basically want to use ‘ptr_derefs_may_alias_p’ in this particular pass. That should work. It actually does, except that ‘ptr_derefs_may_alias_p’ returns true for two SSA names in cases like this: double *p, *q; p = malloc (123); q = malloc (234); (Where ‘malloc’ has the ‘malloc’ attribute.) For some reason, there’s no such false positive when using TODO_rebuild_alias. Well, that should work means it works non-conservatively with TODO_rebuild_alias, of course. Richard. Ludo’.
Re: TODO_rebuild_alias and -O0
On Mon, Oct 8, 2012 at 3:32 PM, Ludovic Courtès l...@gnu.org wrote: Richard Guenther richard.guent...@gmail.com skribis: On Mon, Oct 8, 2012 at 1:42 PM, Ludovic Courtès l...@gnu.org wrote: Richard Guenther richard.guent...@gmail.com skribis: On Mon, Oct 8, 2012 at 11:58 AM, Ludovic Courtès l...@gnu.org wrote: Richard Guenther richard.guent...@gmail.com skribis: At -O0 no virtual operands are produced. TODO_rebuild_alias only computes points-to sets which are in itself not useful. What do you want to achieve with TODO_rebuild_alias? I basically want to use ‘ptr_derefs_may_alias_p’ in this particular pass. That should work. It actually does, except that ‘ptr_derefs_may_alias_p’ returns true for two SSA names in cases like this: double *p, *q; p = malloc (123); q = malloc (234); (Where ‘malloc’ has the ‘malloc’ attribute.) For some reason, there’s no such false positive when using TODO_rebuild_alias. Well, that should work means it works non-conservatively with TODO_rebuild_alias, of course. Right. :-) So how can I get maximum accuracy, while avoid the assertion failure on -O0? You have to analyze the assert more. From the backtrace I cannot see anything suspicious. Thanks, Ludo’.
Re: TODO_rebuild_alias and -O0
On Mon, Oct 8, 2012 at 6:04 PM, Ludovic Courtès ludovic.cour...@inria.fr wrote: Richard Guenther richard.guent...@gmail.com skribis: On Mon, Oct 8, 2012 at 3:32 PM, Ludovic Courtès l...@gnu.org wrote: Richard Guenther richard.guent...@gmail.com skribis: On Mon, Oct 8, 2012 at 1:42 PM, Ludovic Courtès l...@gnu.org wrote: Richard Guenther richard.guent...@gmail.com skribis: On Mon, Oct 8, 2012 at 11:58 AM, Ludovic Courtès l...@gnu.org wrote: Richard Guenther richard.guent...@gmail.com skribis: At -O0 no virtual operands are produced. TODO_rebuild_alias only computes points-to sets which are in itself not useful. What do you want to achieve with TODO_rebuild_alias? I basically want to use ‘ptr_derefs_may_alias_p’ in this particular pass. That should work. It actually does, except that ‘ptr_derefs_may_alias_p’ returns true for two SSA names in cases like this: double *p, *q; p = malloc (123); q = malloc (234); (Where ‘malloc’ has the ‘malloc’ attribute.) For some reason, there’s no such false positive when using TODO_rebuild_alias. Well, that should work means it works non-conservatively with TODO_rebuild_alias, of course. Right. :-) So how can I get maximum accuracy, while avoid the assertion failure on -O0? You have to analyze the assert more. From the backtrace I cannot see anything suspicious. The assert in ‘make_decl_rtl’ is: gcc_assert (TREE_CODE (decl) != VAR_DECL || TREE_STATIC (decl) || TREE_PUBLIC (decl) || DECL_EXTERNAL (decl) || DECL_REGISTER (decl)); Upon inspection, ‘decl’ is VAR_DECL, but has all 4 flags above cleared. In my example source file, the assertion is hit here: numbers.c: In function ‘scm_c_make_polar’: numbers.c:8728:10: internal compiler error: in make_decl_rtl, at varasm.c:1163 That line in numbers.c reads this: sincos (ang, s, c); And indeed, the dump of my pass (with -ftree-dump-all) reads this: ;; Function scm_c_make_polar (scm_c_make_polar) Now a gimple register: D.26884 No longer having address taken: s No longer having address taken: c Symbols to be put in SSA form { s c D.26884 } Incremental SSA update started at block: 0 Number of blocks in CFG: 8 Number of blocks to update: 7 ( 88%) So it looks like there’s additional state corresponding to these variables that needs updating? Ah, no, but I suppose TODO_update_address_taken doesn't really work at -O0 so you need to fix that to do nothing at -O0. Because s and c _do_ have their address taken. Richard. Thanks, Ludo’.
Re: DECL_STRUCT_FUNCTION(cgraphnode-decl) == NULL when attempting Link-Time Optimization in plugin
On Mon, Oct 8, 2012 at 5:17 PM, David Malcolm dmalc...@redhat.com wrote: I'm working on a static analysis extension to GCC via my gcc-python-plugin [1] The analysis is interprocedural (memory leak detection, as it happens). I have it working on one translation unit at a time, and I'm attempting to get it to work within Link Time Optimization so that it can see the interactions of all of the functions within a program or library, but I'm running what might be a bug (or I could be misunderstanding LTO). I have my plugin working within lto1, using -flto and -fplugin; my command line looks like: $ gcc -fPIC -shared -flto -g -fplugin=/home/david/coding/gcc-python/gcc-python/sm/python.so -fplugin-arg-python-script=examples/show-lto-supergraph.py tests/sm/lto/input-*.c --save-temps lto1 appears to be invoked twice by gcc. The first time, there are two passes: IPA_PASS named whole-program IPA_PASS named inline and in_lto_p is true. The second invocation has these passes, called per-function: GIMPLE_PASS *free_cfg_annotations GIMPLE_PASS cplxlower0 GIMPLE_PASS optimized RTL_PASS expand GIMPLE_PASS *rest_of_compilation RTL_PASS *init_function ...etc.., normal rest of compilation from RTL stage onwards and in_lto_p is false. I've attempted to wire in my interprocedural analysis as a new IPA pass before or after whole-program and inline within the first invocation of lto1. In each case it is able to walk the callgraph, and the callgraph it sees appears to be correct: walking the linked list of cgraph_nodes the various cgraph_edge and cgraph_node are as expected, and for each cgraph_node, cgraph_node-decl is a non-NULL tree instance of type FUNCTION_DECL (as expected); I'm able to generate a graphviz rendering of the whole-program callgraph within lto1 from my plugin. However, for every cgraph_node, the DECL_STRUCT_FUNCTION(cgraph_node-decl) is (struct function*)NULL, which thwarts my code's attempt to look up the CFG of each underlying function and work on the underlying gimple. Looking with eu-readelf at the .o files shows these lto sections are present (f is the name of one of the functions that I'd like to access the gimple CFG of): $ eu-readelf -a input-f.o|grep lto [ 5] .gnu.lto_.inline.c8904cb9a96e7417 PROGBITS 006c 0026 0 E 0 0 1 [ 6] .gnu.lto_f.c8904cb9a96e7417 PROGBITS 0092 0164 0 E 0 0 1 [ 7] .gnu.lto_.cgraph.c8904cb9a96e7417 PROGBITS 01f6 0032 0 E 0 0 1 [ 8] .gnu.lto_.vars.c8904cb9a96e7417 PROGBITS 0228 0012 0 E 0 0 1 [ 9] .gnu.lto_.refs.c8904cb9a96e7417 PROGBITS 023a 0013 0 E 0 0 1 [10] .gnu.lto_.statics.c8904cb9a96e7417 PROGBITS 024d 0014 0 E 0 0 1 [11] .gnu.lto_.decls.c8904cb9a96e7417 PROGBITS 0261 01f3 0 E 0 0 1 [12] .gnu.lto_.symtab.c8904cb9a96e7417 PROGBITS 0454 0011 0 E 0 0 1 [13] .gnu.lto_.opts PROGBITS 0465 00e9 0 E 0 0 1 25: 0001 1 OBJECT GLOBAL DEFAULT COMMON __gnu_lto_v1 [2a] __gnu_lto_v1 Are the (struct function*) and CFG meant to exist at this stage, and be in gimple form, and is DECL_STRUCT_FUNCTION meant to be non-NULL for functions in translation units that were compiled with -flto? (or have I misunderstood LTO?) It depends if you are in the WPA stage (lto1 was invoked with -fwpa) in which case no function bodies and thus no CFG is available at all, or if you are in LTRANS stage (lto1 was invoked with -fltrans) which see _part_ of the whole programs callgraph and function bodies (and thus CFGs). As a workaround you probably can make your example work by using -flto-partition=none which merges WPA and LTRANS stages and pull in the wole program into a single link-time TU. Richard. This is gcc 4.7 on Fedora 17, specifically gcc-4.7.0-5.fc17.x86_64 Thanks Dave [1] https://fedorahosted.org/gcc-python-plugin/
Re: Fixup INTEGER_CST
On Sun, Oct 7, 2012 at 7:22 PM, Jan Hubicka hubi...@ucw.cz wrote: On Sun, Oct 7, 2012 at 5:15 PM, Jan Hubicka hubi...@ucw.cz wrote: Hi, I added a santy check that after fixup all types that lost in the merging are really dead. And it turns out we have some zombies around. INTEGER_CST needs special care because it is special cased by the streamer. We also do not want to do inplace modificaitons on it because that would corrupt the hashtable used by tree.c's sharing code Bootstrapped/regtested x86_64-linux, OK? No, I don't think we want to fixup INTEGER_CSTs this way. Instead we want to fixup them where they end up used unfixed. Erm, I think it is what the patch does? Ah, indeed. It replaces pointers to integer_cst with type that did not survive by pointer to new integer_cst. (with the optimization that INTEGER_CST with overflow is changed in place because it is allowed to do so). Btw ... @@ -1526,6 +1549,11 @@ lto_ft_type (tree t) LTO_FIXUP_TREE (t-type_non_common.binfo); LTO_FIXUP_TREE (TYPE_CONTEXT (t)); + + if (TREE_CODE (t) == METHOD_TYPE) +TYPE_METHOD_BASETYPE (t); + if (TREE_CODE (t) == OFFSET_TYPE) +TYPE_OFFSET_BASETYPE (t); that looks like a no-op to me ... (both are TYPE_MAXVAL which is already fixed up). Thus, ok with removing the above hunk. Thanks, Richard. } /* Fix up fields of a BINFO T. */
Re: handle isl and cloog in contrib/download_prerequisites
On Mon, Oct 8, 2012 at 3:16 AM, Jonathan Wakely jwakely@gmail.com wrote: On 7 October 2012 21:31, Manuel López-Ibáñez wrote: On 7 October 2012 22:13, Jonathan Wakely jwakely@gmail.com wrote: On Oct 7, 2012 12:00 AM, NightStrike nightstr...@gmail.com wrote: On Sat, Oct 6, 2012 at 7:30 AM, Manuel López-Ibáñez lopeziba...@gmail.com wrote: Hi, GCC now requires ISL and a very new CLOOG but download_prerequisites does not download those. Also, there is only one sensible place to As of what version is isl/cloog no longer optional? If they're really no longer optional then the prerequisites page and 4.8 changes page need to be updated. The patch downloads isl and cloog unconditionally, does gcc build them unconditionally if they're found in the source dir? If they are still optional I don't want download_prerequisites to fetch files that will slow down building gcc by building libs and enabling features I don't use. I guess they are optional in the sense that you can configure gcc to not require them. But the default configure in x86_64-gnu-linux requires isl and cloog. Are you sure? Seems to me the default is still the same as it always has been, i.e. Graphite optimisations can be enabled if ISL and cloog are present, but they're not required. I can bootstrap without ISL anyway. If good enough ISL and cloog are not found graphite is simply disabled unless you explicitely enabled it via specifying either of ISL or cloog configury. Richard.
Re: [lra] patch to speed more compilation of PR54146
On Sun, Oct 7, 2012 at 11:27 PM, Steven Bosscher stevenb@gmail.com wrote: On Sun, Oct 7, 2012 at 5:59 PM, Vladimir Makarov wrote: The following patch speeds LRA up more on PR54146. Below times for compilation of the test on gcc17.fsffrance.org (an AMD machine): Before: real=1214.71 user=1192.05 system=22.48 After: real=1144.37 user=1124.31 system=20.11 Hi Vlad, The next bottle-neck in my timings is in lra-eliminate.c:lra_eliminate(), in this loop: FOR_EACH_BB (bb) FOR_BB_INSNS_SAFE (bb, insn, temp) { if (bitmap_bit_p (insns_with_changed_offsets, INSN_UID (insn))) process_insn_for_elimination (insn, final_p); } The problem is in bitmap_bit_p. Random access to a large bitmap can be very slow. I'm playing with a patch to expand the insns_with_changed_offsets bitmap to an sbitmap, and will send a patch if this works better. Or make insns_with_changed_offsets a VEC of insns (or a pointer-set). Richard. Ciao! Steven
Re: [ping patch] Predict for loop exits in short-circuit conditions
On Mon, Oct 8, 2012 at 4:50 AM, Dehao Chen de...@google.com wrote: Attached is the updated patch. Yes, if we add a VRP pass before profile pass, this patch would be unnecessary. Should we add a VRP pass? No, we don't want VRP in early optimizations. Richard. Thanks, Dehao On Sat, Oct 6, 2012 at 9:38 AM, Jan Hubicka hubi...@ucw.cz wrote: ping^2 Honza, do you think this patch can make into 4.8 stage 1? + if (check_value_one ^ integer_onep (val)) Probably better as != (especially because GNU coding standard allows predicates to return more than just boolean) +{ + edge e1; + edge_iterator ei; + tree val = gimple_phi_arg_def (phi_stmt, i); + edge e = gimple_phi_arg_edge (phi_stmt, i); + + if (!TREE_CONSTANT (val) || !(integer_zerop (val) || integer_onep (val))) + continue; + if (check_value_one ^ integer_onep (val)) + continue; + if (VEC_length (edge, e-src-succs) != 1) + { + if (!predicted_by_p (exit_edge-src, PRED_LOOP_ITERATIONS_GUESSED) + !predicted_by_p (exit_edge-src, PRED_LOOP_ITERATIONS) + !predicted_by_p (exit_edge-src, PRED_LOOP_EXIT)) + predict_edge_def (e, PRED_LOOP_EXIT, NOT_TAKEN); + continue; + } + + FOR_EACH_EDGE (e1, ei, e-src-preds) + if (!predicted_by_p (exit_edge-src, PRED_LOOP_ITERATIONS_GUESSED) +!predicted_by_p (exit_edge-src, PRED_LOOP_ITERATIONS) +!predicted_by_p (exit_edge-src, PRED_LOOP_EXIT)) + predict_edge_def (e1, PRED_LOOP_EXIT, NOT_TAKEN); Here you found an edge that you know is going to terminate the loop and you want to predict all paths to this edge as unlikely. Perhaps you want to use predict paths leading_to_edge for edge? You do not need to check PRED_LOOP_ITERATIONS and PRED_LOOP_ITERATIONS_GUESSED because those never go to the non-exit edges. The nature of predict_paths_for_bb type heuristic is that they are not really additive: if the path leads to two different aborts it does not make it more sure that it will be unlikely. So perhaps you can add !predicted_by_p (e, pred) prior predict_edge_def call in the function? I wonder if we did VRP just before branch predction to jump thread the shortcut condtions into loopback edges, would be there still cases where this optimization will match? Honza
Re: [PATCH] Fix PR54489 - FRE needing AVAIL_OUT
On Fri, 5 Oct 2012, Steven Bosscher wrote: On Fri, Sep 14, 2012 at 2:26 PM, Richard Guenther rguent...@suse.de wrote: If you can figure out a better name for the function we should probably move it to cfganal.c It looks like my previous e-mail about this appears to have gone got somehow, so retry: Your my_rev_post_order_compute is simply inverted_post_order_compute. The only difference is that you'll want to ignore EXIT_BLOCK, which is always added to the list by inverted_post_order_compute. Indeed. inverted_post_order_compute seems to handle a CFG without infinite-loop and noreturns connected to exit though. Possibly that's why it doesn't care for not handling entry/exit. I'm testing a patch to use inverted_post_order_compute from PRE. Richard.
Re: vec_cond_expr adjustments
On Fri, Oct 5, 2012 at 5:01 PM, Marc Glisse marc.gli...@inria.fr wrote: [I am still a little confused, sorry for the long email...] On Tue, 2 Oct 2012, Richard Guenther wrote: + if (TREE_CODE (op0) == VECTOR_CST TREE_CODE (op1) == VECTOR_CST) +{ + int count = VECTOR_CST_NELTS (op0); + tree *elts = XALLOCAVEC (tree, count); + gcc_assert (TREE_CODE (type) == VECTOR_TYPE); + + for (int i = 0; i count; i++) + { + tree elem_type = TREE_TYPE (type); + tree elem0 = VECTOR_CST_ELT (op0, i); + tree elem1 = VECTOR_CST_ELT (op1, i); + + elts[i] = fold_relational_const (code, elem_type, + elem0, elem1); + + if(elts[i] == NULL_TREE) + return NULL_TREE; + + elts[i] = fold_negate_const (elts[i], elem_type); I think you need to invent something new similar to STORE_FLAG_VALUE or use STORE_FLAG_VALUE here. With the above you try to map {0, 1} to {0, -1} which is only true if the operation on the element types returns {0, 1} (thus, STORE_FLAG_VALUE is 1). Er, seems to me that constant folding of a scalar comparison in the front/middle-end only returns {0, 1}. [and later] I'd say adjust your fold-const patch to not negate the scalar result but build a proper -1 / 0 value based on integer_zerop(). I don't mind doing it that way, but I would like to understand first. LT_EXPR on scalars is guaranteed (in generic.texi) to be 0 or 1. So negating should be the same as testing with integer_zerop to build -1 or 0. Is it just a matter of style (then I am ok), or am I missing a reason which makes the negation wrong? Just a matter of style. Negating is a lot less descriptive for the actual set of return values we produce. The point is we need to define some semantics for vector comparison results. Yes. I think a documentation patch should come first: generic.texi is missing an entry for VEC_COND_EXPR and the entry for LT_EXPR doesn't mention vectors. But before that we need to decide what to put there... One variant is to make it target independent which in turn would inhibit (or make it more difficult) to exploit some target features. You for example use {0, -1} for truth values - probably to exploit target features - Actually it was mostly because that is the meaning in the language. OpenCL says that ab is a vector of 0 and -1, and that ?: only looks at the MSB of the elements in the condition. The fact that it matches what some targets do is a simple consequence of the fact that OpenCL was based on what hardware already did. Yes, it seems that the {0, -1} choice is most reasonable for GENERIC. So let's document that. even though the most natural middle-end way would be to use {0, 1} as for everything else I agree that it would be natural and convenient in a number of places. (caveat: there may be both signed and unsigned bools, we don't allow vector components with non-mode precision, thus you could argue that a signed bool : 1 is just sign-extended for your solution). Not sure how that would translate in the code. A different variant is to make it target dependent to leverage optimization opportunities That's an interesting possibility... that's why STORE_FLAG_VALUE exists. AFAICS it only appears when we go from gimple to rtl, not before (and there is already a VECTOR_STORE_FLAG_VALUE, although no target defines it). Which doesn't mean we couldn't make it appear earlier for vectors. For example with vector comparisons a v result, when performing bitwise operations on it, you either have to make the target expand code to produce {0, -1} even if the natural compare instruction would, say, produce {0, 0x8} - or not constrain the possible values of its result (like forwprop would do with your patch). In general we want constant folding to yield the same results as if the HW carried out the operation to make -O0 code not diverge from -O1. Thus, v4si g; int main() { g = { 1, 2, 3, 4 } { 4, 3, 2, 1}; } should not assign different values to g dependent on constant propagation performed or not. That one is clear, OpenCL constrains the answer to be {-1,-1,0,0}, whether your target likes it or not. Depending on how things are handled, comparisons could be constrained internally to only appear (possibly indirectly) in the first argument of a vec_cond_expr. Yes, I realized that later. The easiest way out is something like STORE_FLAG_VALUE if there does not exist a middle-end choice for vector true / false components that can be easily generated from what the target produces. Like if you perform a FP comparison int main () { double x = 1.0; static _Bool b; b = x 3.0; } you get without CCP on x86_64: ucomisd -8(%rbp), %xmm0 seta%al movb%al, b.1715(%rip) thus the equivalent of flag_reg = x 3.0; b = flag_reg ? 1 : 0; where
Re: patch to fix constant math - second small patch
On Sat, Oct 6, 2012 at 12:48 AM, Kenneth Zadeck zad...@naturalbridge.com wrote: This patch adds machinery to genmodes.c so that largest possible sizes of various data structures can be determined at gcc build time. These functions create 3 symbols that are available in insn-modes.h: MAX_BITSIZE_MODE_INT - the bitsize of the largest int. MAX_BITSIZE_MODE_PARTIAL_INT - the bitsize of the largest partial int. MAX_BITSIZE_MODE_ANY_INT - the largest bitsize of any kind of int. Ok. Please document these macros in rtl.texi. Richard.
Re: patch to fix constant math - third small patch
On Sat, Oct 6, 2012 at 5:55 PM, Kenneth Zadeck zad...@naturalbridge.com wrote: This is the third patch in the series of patches to fix constant math. this one changes some predicates at the rtl level to use the new predicate CONST_SCALAR_INT_P. I did not include a few that were tightly intertwined with other changes. Not all of these changes are strictly mechanical. Richard, when reviewing this had me make additional changes to remove what he thought were latent bugs at the rtl level. However, it appears that the bugs were not latent. I do not know what is going on here but i am smart enough to not look a gift horse in the mouth. All of this was done on the same machine with no changes and identical configs. It is an x86-64 with ubuntu 12-4. ok for commit? Patch missing, but if it's just mechanical changes and introduction of CONST_SCALAR_INT_P consider it pre-approved. Richard. in the logs below, gbBaseline is a trunk from friday and the gbWide is the same revision but with my patches. Some of this like gfortran.dg/pr32627 is obviously flutter, but the rest does not appear to be. = heracles:~/gcc(13) gccBaseline/contrib/compare_tests gbBaseline/gcc/testsuite/gcc/gcc.log gbWide/gcc/testsuite/gcc/gcc.log New tests that PASS: gcc.dg/builtins-85.c scan-assembler mysnprintf gcc.dg/builtins-85.c scan-assembler-not __chk_fail gcc.dg/builtins-85.c (test for excess errors) heracles:~/gcc(14) gccBaseline/contrib/compare_tests gbBaseline/gcc/testsuite/gfortran/gfortran.log gbWide/gcc/testsuite/gfortran/gfortran.log New tests that PASS: gfortran.dg/pr32627.f03 -O3 -fomit-frame-pointer -funroll-loops (test for excess errors) gfortran.dg/pr32627.f03 -O3 -fomit-frame-pointer (test for excess errors) gfortran.dg/pr32627.f03 -Os (test for excess errors) gfortran.dg/pr32635.f -O0 execution test gfortran.dg/pr32635.f -O0 (test for excess errors) gfortran.dg/substr_6.f90 -O2 (test for excess errors) Old tests that passed, that have disappeared: (Eeek!) gfortran.dg/pr32627.f03 -O1 (test for excess errors) gfortran.dg/pr32627.f03 -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions (test for excess errors) gfortran.dg/pr32627.f03 -O3 -g (test for excess errors) gfortran.dg/substring_equivalence.f90 -O (test for excess errors) Using /home/zadeck/gcc/gccBaseline/gcc/testsuite/config/default.exp as tool-and-target-specific interface file. === g++ Summary === # of expected passes49793 # of expected failures284 # of unsupported tests601 runtest completed at Fri Oct 5 16:10:20 2012 heracles:~/gcc(16) tail gbWide/gcc/testsuite/g++/g++.log Using /usr/share/dejagnu/config/unix.exp as generic interface file for target. Using /home/zadeck/gcc/gccWide/gcc/testsuite/config/default.exp as tool-and-target-specific interface file. === g++ Summary === # of expected passes50472 # of expected failures284 # of unsupported tests613 runtest completed at Fri Oct 5 19:51:50 2012
Re: Check that unlinked uses do not contain ssa-names when renaming.
On Sun, Oct 7, 2012 at 12:44 PM, Tom de Vries tom_devr...@mentor.com wrote: Richard, attached patch checks that unlinked uses do not contain ssa-names when renaming. This assert triggers when compiling (without the fix) the PR54735 example. AFAIU, it was due to chance that we caught the PR54735 bug by hitting the verification failure, because the new vdef introduced by renaming happened to be the same name as the ssa name referenced in the invalid unlinked use (in terms of maybe_replace_use: rdef == use). The assert from this patch catches all cases that an unlinked use contains an ssa-name. Bootstrapped and reg-tested on x86_64 (Ada inclusive). OK for trunk? I don't think that is exactly what we should assert here ... (I thought about adding checking myself ...). What we'd want to assert is that before any new DEF is registered (which may re-allocate an SSA name) that no uses with SSA_NAME_IN_FREELIST appear. Thus, a light verification pass would be necessary at the beginning of update_ssa (which I queued onto my TODO list ...). We'd want that anyway to for example catch the case where a non-virtual operand is partially renamed. Thanks, Richard. Thanks, - Tom 2012-10-07 Tom de Vries t...@codesourcery.com * tree-into-ssa.c (maybe_replace_use): Add assert.
Re: patch to fix constant math
On Sun, Oct 7, 2012 at 4:58 PM, Kenneth Zadeck zad...@naturalbridge.com wrote: On 10/07/2012 09:19 AM, Richard Guenther wrote: In fact, you could argue that the tree level did it wrong (not that i am suggesting to change this). But it makes me think what was going on when the decision to make TYPE_PRECISION be an INT_CST rather than just a HWI was made. For that there is an implication that it could never take more than a HWI since no place in the code even checks TREE_INT_CST_HIGH for these. Well - on the tree level we now always have two HWIs for all INTEGER_CSTs. If we can, based on the size of the underlying mode, reduce that to one HWI we already win something. If we add an explicit length to allow a smaller encoding for larger modes (tree_base conveniently has an available 'int' for this ...) then we'd win in more cases. Thus, is CONST_INT really necessarily better than optimized CONST_WIDE_INT storage? i have to admit, that looking at these data structures gives me a headache. This all looks like something that Rube Goldberg would have invented had he done object oriented design (Richard S did not know who Rube Goldberg when i mentioned this name to him a few years ago since this is an American thing, but the british had their own equivalent and I assume the germans do too.). i did the first cut of changing the rtl level structure and Richard S threw up on it and suggested what is there now, which happily (for me) i was able to get mike to implement. mike also did the tree level version of the data structures for me. i will make sure he used that left over length field. The bottom line is that you most likely just save the length, but that is a big percent of something this small. Both of rtl ints have a mode, so if we can make that change later, it will be space neutral. Yes. Btw, as for Richards idea of conditionally placing the length field in rtx_def looks like overkill to me. These days we'd merely want to optimize for 64bit hosts, thus unconditionally adding a 32 bit field to rtx_def looks ok to me (you can wrap that inside a union to allow both descriptive names and eventual different use - see what I've done to tree_base) Richard.
Re: Fixup INTEGER_CST
On Mon, Oct 8, 2012 at 11:18 AM, Jan Hubicka hubi...@ucw.cz wrote: On Sun, Oct 7, 2012 at 7:22 PM, Jan Hubicka hubi...@ucw.cz wrote: On Sun, Oct 7, 2012 at 5:15 PM, Jan Hubicka hubi...@ucw.cz wrote: Hi, I added a santy check that after fixup all types that lost in the merging are really dead. And it turns out we have some zombies around. INTEGER_CST needs special care because it is special cased by the streamer. We also do not want to do inplace modificaitons on it because that would corrupt the hashtable used by tree.c's sharing code Bootstrapped/regtested x86_64-linux, OK? No, I don't think we want to fixup INTEGER_CSTs this way. Instead we want to fixup them where they end up used unfixed. Erm, I think it is what the patch does? Ah, indeed. It replaces pointers to integer_cst with type that did not survive by pointer to new integer_cst. (with the optimization that INTEGER_CST with overflow is changed in place because it is allowed to do so). Btw ... @@ -1526,6 +1549,11 @@ lto_ft_type (tree t) LTO_FIXUP_TREE (t-type_non_common.binfo); LTO_FIXUP_TREE (TYPE_CONTEXT (t)); + + if (TREE_CODE (t) == METHOD_TYPE) +TYPE_METHOD_BASETYPE (t); + if (TREE_CODE (t) == OFFSET_TYPE) +TYPE_OFFSET_BASETYPE (t); that looks like a no-op to me ... (both are TYPE_MAXVAL which is already fixed up). Ah, indeed. They were result of experimenting with the stale pointers to the obsoletted types and field decls. I now understand where they come from. The reason is twofold. 1) after merging records we replace field decls in the cache by new ones. This however does not mean that they die, because the existing pointers to them are not replaced. I have WIP patch for that that however require one extra pass over the list of all trees. Yes, I think this is also why we do /* ??? Not sure the above is all relevant in this path canonicalizing TYPE_FIELDS to that of the main variant. */ if (ix i) lto_fixup_types (f2); streamer_tree_cache_insert_at (cache, f1, ix); something I dislike as well and something we should try to address in a more formal way. 2) As we query the type_hash while we are rewritting the types, we run into instability of the hashtable. This manifests itself as an ICE when one adds sanity check that while merging function types their arg types are equivalent, too. This ICEs compiling i.e. sqlite but I did not really managed to reduce this. I tracked it down to the argument type being inserted into gimple_type_hash but at the time we query the new argument type, the original is no longer found despite their hashes are equivalent. The problem is hidden when things fit into the leader cache, so one needs rather big testcase. Ugh. For reduction you can disable those caches though. The above means there is a disconnect between hashing and comparing. Maybe it's something weird with the early out if (TYPE_ARG_TYPES (t1) == TYPE_ARG_TYPES (t2)) goto same_types; ? So I tried to register all gimple types first. Use TREE_VISITED within the merging code to mark that type is not a leader and then TREE_CHAIN to point to the leader. This avoids need to re-query the hashtable from the later fixups. We only look for types with TREEE_VISITED and replace them by TREE_CHAIN. TREE_CHAIN is unused for types? But we probably shouldn't add a new use ... This has two passes. First we compute the main variants and mark field_decls and type_decls for merging and in last pass we finally do fixup on what remained in the table. This allows me to poison pointers in the removed types in a way so the GGC would ICE if they stayed reachable. I however need the extra pass because 1) I can not update the type_decls/field_decls while registering types or I run into the hash table problems 2) I can not merge the second two passes because at the time I find type/field decls equialent there may be earlier pointers to them. You need to merge all trees reachable from the one you start at once (what I'm working on from time to time - work per tree SCC, in a DFS walk). Richard. Honza
Re: vec_cond_expr adjustments
On Mon, Oct 8, 2012 at 11:34 AM, Marc Glisse marc.gli...@inria.fr wrote: On Mon, 8 Oct 2012, Richard Guenther wrote: VEC_COND_EXPR is more complicated. We could for instance require that it takes as first argument a vector of -1 and 0 (thus 0, !=0 and the neon thing are equivalent). Which would leave to decide what the expansion of vec_cond_expr passes to the targets when the first argument is not a comparison, between !=0, 0, ==-1 or others (I vote for 0 because of opencl). One issue is that targets wouldn't know if it was a dummy comparison that can safely be ignored because the other part is the result of logical operations on comparisons (thus composed of -1 and 0) or a genuine comparison with an arbitrary vector, so a new optimization would be needed (in the back-end I guess or we would need an alternate instruction to vcond) to detect if a vector is a signed boolean vector. We could instead say that vec_cond_expr really follows OpenCL's semantics and looks at the MSB of each element. I am not sure that would change much, it would mostly delay the apparition of 0 to RTL expansion time (and thus make gimple slightly lighter). I think we should delay the decision on how to optimize this. It's indeed not trivial and the GIMPLE middle-end aggressively forwards feeding comparisons into the VEC_COND_EXPR expressions already (somewhat defeating any CSE that might be possible here) in forwprop. Thanks for going through the long email :-) What does that imply for the first argument of VEC_COND_EXPR? Currently, the expander asserts that it is a comparison, but that is not reflected in the gimple checkers. And I don't think we should reflect that in the gimple checkers rather fixup the expander (transparently use p != 0 or p 0). If we document that VEC_COND_EXPR takes a vector of -1 and 0 (which is the case for a comparison), I don't think it prevents from later relaxing that to 0 or !=0. But then I don't know how to handle expansion when the argument is neither a comparison (vcond) nor a constant (vec_merge? I haven't tried but that should be doable), I would have to pass 0 or !=0 to the target. Yes. So is the best choice to document that VEC_COND_EXPR takes as first argument a comparison and make gimple checking reflect that? (seems sad, but at least that would tell me what I can/can't do) No, that would just mean that in GIMPLE you'd add this p != 0 or p 0. And at some point in the future I really really want to push this embedded expression to a separate statement so I have a SSA definition for it. By the way, since we are documenting comparisons as returning 0 and -1, does that bring back the integer_truep predicate? Not sure, true would still be != 0 or all_onesp (all bits of the precision are 1), no? Richard. -- Marc Glisse
Re: [RFC] Make vectorizer to skip loops with small iteration estimate
On Sat, Oct 6, 2012 at 11:34 AM, Jan Hubicka hubi...@ucw.cz wrote: Hi, I benchmarked the patch moving loop header copying and it is quite noticeable win. Some testsuite updating is needed. In many cases it is just because the optimizations are now happening earlier. There are however few testusite failures I have torubles to deal with: ./testsuite/gcc/gcc.sum:FAIL: gcc.dg/tree-ssa/pr21559.c scan-tree-dump-times vrp1 Threaded jump 3 ./testsuite/gcc/gcc.sum:FAIL: gcc.dg/tree-ssa/ssa-dom-thread-2.c scan-tree-dump-times vrp1 Jumps threaded: 1 1 ./testsuite/gcc/gcc.sum:FAIL: gcc.dg/vect/O3-slp-reduc-10.c scan-tree-dump-times vect vectorized 1 loops 2 ./testsuite/g++/g++.sum:FAIL: g++.dg/tree-ssa/pr18178.C -std=gnu++98 scan-tree-dump-times vrp1 if 1 ./testsuite/g++/g++.sum:FAIL: g++.dg/tree-ssa/pr18178.C -std=gnu++11 scan-tree-dump-times vrp1 if 1 This is mostly about VRP losing its ability to thread some jumps from the duplicated loop header out of the loop across the loopback edge. This seems to be due to loop updating logic. Do we care about these? Yes, I think so. At least we care that the optimized result is the same. Can you elaborate on due to loop updating logic? Can you elaborate on the def_split_header_continue_p change? Which probably should be tested and installed separately? Thanks, Richard. Honza Index: tree-ssa-threadupdate.c === *** tree-ssa-threadupdate.c (revision 192123) --- tree-ssa-threadupdate.c (working copy) *** static bool *** 846,854 def_split_header_continue_p (const_basic_block bb, const void *data) { const_basic_block new_header = (const_basic_block) data; ! return (bb != new_header ! (loop_depth (bb-loop_father) ! = loop_depth (new_header-loop_father))); } /* Thread jumps through the header of LOOP. Returns true if cfg changes. --- 846,860 def_split_header_continue_p (const_basic_block bb, const void *data) { const_basic_block new_header = (const_basic_block) data; ! const struct loop *l; ! ! if (bb == new_header ! || loop_depth (bb-loop_father) loop_depth (new_header-loop_father)) ! return false; ! for (l = bb-loop_father; l; l = loop_outer (l)) ! if (l == new_header-loop_father) ! return true; ! return false; } /* Thread jumps through the header of LOOP. Returns true if cfg changes. Index: testsuite/gcc.dg/unroll_2.c === *** testsuite/gcc.dg/unroll_2.c (revision 192123) --- testsuite/gcc.dg/unroll_2.c (working copy) *** *** 1,5 /* { dg-do compile { target i?86-*-linux* x86_64-*-linux* } } */ ! /* { dg-options -O2 -fdump-rtl-loop2_unroll -fno-peel-loops -fdisable-tree-cunroll=foo -fdisable-tree-cunrolli=foo -fenable-rtl-loop2_unroll } */ unsigned a[100], b[100]; inline void bar() --- 1,5 /* { dg-do compile { target i?86-*-linux* x86_64-*-linux* } } */ ! /* { dg-options -O2 -fdump-rtl-loop2_unroll -fno-peel-loops -fdisable-tree-cunroll=foo -fdisable-tree-cunrolli=foo -fenable-rtl-loop2_unroll -fno-tree-dominator-opts } */ unsigned a[100], b[100]; inline void bar() Index: testsuite/gcc.dg/unroll_3.c === *** testsuite/gcc.dg/unroll_3.c (revision 192123) --- testsuite/gcc.dg/unroll_3.c (working copy) *** *** 1,5 /* { dg-do compile { target i?86-*-linux* x86_64-*-linux* } } */ ! /* { dg-options -O2 -fdump-rtl-loop2_unroll -fno-peel-loops -fdisable-tree-cunroll -fdisable-tree-cunrolli -fenable-rtl-loop2_unroll=foo } */ unsigned a[100], b[100]; inline void bar() --- 1,5 /* { dg-do compile { target i?86-*-linux* x86_64-*-linux* } } */ ! /* { dg-options -O2 -fdump-rtl-loop2_unroll -fno-peel-loops -fdisable-tree-cunroll -fdisable-tree-cunrolli -fenable-rtl-loop2_unroll=foo -fno-tree-dominator-opts } */ unsigned a[100], b[100]; inline void bar() Index: testsuite/gcc.dg/torture/pr23821.c === *** testsuite/gcc.dg/torture/pr23821.c (revision 192123) --- testsuite/gcc.dg/torture/pr23821.c (working copy) *** *** 1,9 /* { dg-do compile } */ /* { dg-skip-if { *-*-* } { -O0 -fno-fat-lto-objects } { } } */ ! /* At -O1 DOM threads a jump in a non-optimal way which leads to the bogus propagation. */ ! /* { dg-skip-if { *-*-* } { -O1 } { } } */ ! /* { dg-options -fdump-tree-ivcanon-details } */ int a[199]; --- 1,8 /* { dg-do compile } */ /* { dg-skip-if { *-*-* } { -O0 -fno-fat-lto-objects } { } } */ ! /* DOM threads a jump in a non-optimal way which leads to the bogus propagation. */ ! /* { dg-options -fdump-tree-ivcanon-details -fno-tree-dominator-opts } */ int a[199];
Re: [ping patch] Predict for loop exits in short-circuit conditions
On Mon, Oct 8, 2012 at 12:01 PM, Jan Hubicka hubi...@ucw.cz wrote: On Mon, Oct 8, 2012 at 11:04 AM, Jan Hubicka hubi...@ucw.cz wrote: On Mon, Oct 8, 2012 at 4:50 AM, Dehao Chen de...@google.com wrote: Attached is the updated patch. Yes, if we add a VRP pass before profile pass, this patch would be unnecessary. Should we add a VRP pass? No, we don't want VRP in early optimizations. I am not quite sure about that. VRP 1) makes branch prediction work better by doing jump threading early Well ... but jump threading may need basic-block duplication which may increase code size. Also VRP and FRE have pass ordering issues. 2) is, after FRE, most effective tree pass on removing code by my profile statistics. We also don't have DSE in early opts. I don't want to end up with the situation that we do everything in early opts ... we should do _less_ there (but eventually iterate properly when processing cycles). Yep, i am not quite sure about most sane variant. Missed simple jump threading in early opts definitely confuse both profile estimate and inline size estimates. But I am also not thrilled by adding more passes to early opts at all. Also last time I looked into this, CCP missed a lot of CCP oppurtunities making VRP to artifically look like more useful. Eh .. that shouldn't happen. Do you have testcases by any chance? I used to duplicate each SSA propagator pass and checked -fdump-statistics-stats for that the 2nd pass does nothing (thus chaining CCP doesn't improve results). But maybe that's not the issue you run into here? Have patch that bit improves profile updating after jump threading (i.e. re-does the profile for simple cases), but still jump threading is the most common case for profile become inconsistent after expand. On a related note, with -fprofile-report I can easilly track how much of code each pass in the queue removed. I was thinking about running this on Mozilla and -O1 and removing those passes that did almost nothing. Those are mostly re-run passes, both at Gimple and RTL level. Our passmanager is not terribly friendly for controlling pass per-repetition. Sure. You can also more thorougly instrument passes and use -fdump-statistics for that (I've done that), but we usually have testcases that require that each pass that still is there is present ... With introduction of -Og pass queue, do you think introducing -O1 pass queue for late tree passes (that will be quite short) is sane? Yes. I don't like the dump-file naming mess that results though, but if we want to support optimized attribute switching between -O1 and -O2 then I guess we have to live with that ... Originally I wanted to base -Og on -O1 (thus have them mostly share the pass queue) and retain the same pass queue for -O2 and -Os. Maybe that's what we eventually want to do. Thus, add a (off for -Og) loop optimizer sub-pass to the queue and schedule some scalar cleanups after it but inside it. What about RTL level? I guess we can split the queues for RTL optimizations, too. All optimizations passes prior register allocation are sort of optional and I guess there are also -Og candidates. Yes. Though I first wanted to see actual issues with the RTL optimizers and -Og. I hoever find the 3 times duplicated queues bit uncool, too, but I guess it is most compatible with PM organization. Indeed ;) We should at least try to share the queues for -Og and -O1. At -O3 the most effective passes on combine.c are: cfg (because of cfg cleanup) -1.5474% Early inlning -0.4991% FRE -7.9369% VRP -0.9321% (if run early), ccp does -0.2273% I think VRP has the advantage of taking loop iteration counts into account. Maybe we can add sth similar to CCP. It's sad that VRP is too expensive, it really is a form of CCP so merging both passes would be best (we can at a single point, add_equivalence, turn off equivalence processing - the most expensive part of VRP, and call that CCP ...). tailr -0.5305% After IPA copyrename -2.2850% (it packs cleanups after inlining) forwprop -0.5432% VRP -0.9700% (if rerun after early passes, otherwise it is about 2%) PRE -2.4123% DOM -0.5182% RTL passes into_cfglayout -3.1400% (i.e. first cleanup_cfg) fwprop1 -3.0467% cprop -2.7786% combine -3.3346% IRA -3.4912% (i.e. the cost model preffers hard regs) bbro -0.9765% The numbers on tramp3d and LTO cc1 binary and not that different. Yes. Richard. Honza
Re: [Patch] Fix PR53397
On Mon, Oct 8, 2012 at 12:01 PM, Kumar, Venkataramanan venkataramanan.ku...@amd.com wrote: Hi Richard, I have incorporated your comments. Yes, call dump_mem_ref then, instead of repeating parts of its body. Reference object is not yet created at the place we check for invariance. It is still a tree expression. I created a common function and used at all places to dump the step, base and delta values of memory reference being analyzed. Please find the modified patch attached. GCC regression make check -k passes with x86_64-unknown-linux-gnu. I presume also bootstrapped. Ok. Thanks, Richard. Regards, Venkat. -Original Message- From: Richard Guenther [mailto:richard.guent...@gmail.com] Sent: Thursday, October 04, 2012 6:26 PM To: Kumar, Venkataramanan Cc: Richard Guenther; gcc-patches@gcc.gnu.org Subject: Re: [Patch] Fix PR53397 On Tue, Oct 2, 2012 at 6:40 PM, Kumar, Venkataramanan venkataramanan.ku...@amd.com wrote: Hi Richi, (Snip) + (!cst_and_fits_in_hwi (step)) +{ + if( loop-inner != NULL) +{ + if (dump_file (dump_flags TDF_DETAILS)) +{ + fprintf (dump_file, Reference %p:\n, (void *) ref); + fprintf (dump_file, (base ); + print_generic_expr (dump_file, base, TDF_SLIM); + fprintf (dump_file, , step ); + print_generic_expr (dump_file, step, TDF_TREE); + fprintf (dump_file, )\n); No need to repeat this - all references are dumped when we gather them. (Snip) The dumping happens at record_ref which is called after these statements to record these references. When the step is invariant we return from the function without recording the references. so I thought of dumping the references here. Is there a cleaner way to dump the references at one place? Yes, call dump_mem_ref then, instead of repeating parts of its body. Richard. Regards, Venkat. -Original Message- From: Richard Guenther [mailto:rguent...@suse.de] Sent: Tuesday, October 02, 2012 5:42 PM To: Kumar, Venkataramanan Cc: gcc-patches@gcc.gnu.org Subject: Re: [Patch] Fix PR53397 On Mon, 1 Oct 2012, venkataramanan.ku...@amd.com wrote: Hi, The below patch fixes the FFT/Scimark regression caused by useless prefetch generation. This fix tries to make prefetch less aggressive by prefetching arrays in the inner loop, when the step is invariant in the entire loop nest. GCC currently tries to prefetch invariant steps when they are in the inner loop. But does not check if the step is variant in outer loops. In the scimark FFT case, the trip count of the inner loop varies by a non constant step, which is invariant in the inner loop. But the step variable is varying in outer loop. This makes inner loop trip count small (at run time varies sometimes as small as 1 iteration) Prefetching ahead x iteration when the inner loop trip count is smaller than x leads to useless prefetches. Flag used: -O3 -march=amdfam10 Before ** ** ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark ** ** for details. (Results can be submitted to p...@nist.gov) ** ** ** Using 2.00 seconds min time per kenel. Composite Score: 550.50 FFT Mflops:38.66(N=1024) SOR Mflops: 617.61(100 x 100) MonteCarlo: Mflops: 173.74 Sparse matmult Mflops: 675.63(N=1000, nz=5000) LU Mflops: 1246.88(M=100, N=100) After ** ** ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark ** ** for details. (Results can be submitted to p...@nist.gov) ** ** ** Using 2.00 seconds min time per kenel. Composite Score: 639.20 FFT Mflops: 479.19(N=1024) SOR Mflops: 617.61(100 x 100) MonteCarlo: Mflops: 173.18 Sparse matmult Mflops: 679.13(N=1000, nz=5000) LU Mflops: 1246.88(M=100, N=100) GCC regression make check -k passes with x86_64-unknown-linux-gnu New tests that PASS: gcc.dg/pr53397-1.c scan-assembler prefetcht0 gcc.dg/pr53397-1.c scan-tree-dump aprefetch Issued prefetch gcc.dg/pr53397-1.c (test for excess errors) gcc.dg/pr53397-2.c scan-tree-dump aprefetch loop variant step gcc.dg/pr53397-2.c scan-tree-dump aprefetch Not prefetching gcc.dg/pr53397-2.c (test for excess errors) Checked CPU2006 and polyhedron on latest AMD processor, no regressions noted. Ok to commit in trunk? regards, Venkat gcc/ChangeLog +2012-10-01 Venkataramanan Kumar venkataramanan.ku...@amd.com + + * tree-ssa-loop-prefetch.c (gather_memory_references_ref):$ + Perform non constant step prefetching in inner loop, only
[PATCH] Remove my_rev_post_order_compute
This replaces my_rev_post_order_compute in PRE by the already existing inverted_post_order_compute, with the necessary adjustments. Bootstrapped and tested on x86_64-unknown-linux-gnu, applied. Richard. 2012-10-08 Richard Guenther rguent...@suse.de * tree-ssa-pre.c (postorder_num): New global. (compute_antic): Initialize all blocks and adjust for generic postorder. (my_rev_post_order_compute): Remove. (init_pre): Use inverted_post_order_compute. Index: gcc/tree-ssa-pre.c === --- gcc/tree-ssa-pre.c (revision 192119) +++ gcc/tree-ssa-pre.c (working copy) @@ -430,6 +430,7 @@ typedef struct bb_bitmap_sets /* Basic block list in postorder. */ static int *postorder; +static int postorder_num; /* This structure is used to keep track of statistics on what optimization PRE was able to perform. */ @@ -2456,7 +2457,7 @@ compute_antic (void) has_abnormal_preds = sbitmap_alloc (last_basic_block); sbitmap_zero (has_abnormal_preds); - FOR_EACH_BB (block) + FOR_ALL_BB (block) { edge_iterator ei; edge e; @@ -2480,9 +2481,7 @@ compute_antic (void) } /* At the exit block we anticipate nothing. */ - ANTIC_IN (EXIT_BLOCK_PTR) = bitmap_set_new (); BB_VISITED (EXIT_BLOCK_PTR) = 1; - PA_IN (EXIT_BLOCK_PTR) = bitmap_set_new (); changed_blocks = sbitmap_alloc (last_basic_block + 1); sbitmap_ones (changed_blocks); @@ -2496,7 +2495,7 @@ compute_antic (void) for PA ANTIC computation. */ num_iterations++; changed = false; - for (i = n_basic_blocks - NUM_FIXED_BLOCKS - 1; i = 0; i--) + for (i = postorder_num - 1; i = 0; i--) { if (TEST_BIT (changed_blocks, postorder[i])) { @@ -2525,7 +2524,7 @@ compute_antic (void) fprintf (dump_file, Starting iteration %d\n, num_iterations); num_iterations++; changed = false; - for (i = n_basic_blocks - NUM_FIXED_BLOCKS - 1 ; i = 0; i--) + for (i = postorder_num - 1 ; i = 0; i--) { if (TEST_BIT (changed_blocks, postorder[i])) { @@ -4593,78 +4592,6 @@ remove_dead_inserted_code (void) BITMAP_FREE (worklist); } -/* Compute a reverse post-order in *POST_ORDER. If INCLUDE_ENTRY_EXIT is - true, then then ENTRY_BLOCK and EXIT_BLOCK are included. Returns - the number of visited blocks. */ - -static int -my_rev_post_order_compute (int *post_order, bool include_entry_exit) -{ - edge_iterator *stack; - int sp; - int post_order_num = 0; - sbitmap visited; - - if (include_entry_exit) -post_order[post_order_num++] = EXIT_BLOCK; - - /* Allocate stack for back-tracking up CFG. */ - stack = XNEWVEC (edge_iterator, n_basic_blocks + 1); - sp = 0; - - /* Allocate bitmap to track nodes that have been visited. */ - visited = sbitmap_alloc (last_basic_block); - - /* None of the nodes in the CFG have been visited yet. */ - sbitmap_zero (visited); - - /* Push the last edge on to the stack. */ - stack[sp++] = ei_start (EXIT_BLOCK_PTR-preds); - - while (sp) -{ - edge_iterator ei; - basic_block src; - basic_block dest; - - /* Look at the edge on the top of the stack. */ - ei = stack[sp - 1]; - src = ei_edge (ei)-src; - dest = ei_edge (ei)-dest; - - /* Check if the edge source has been visited yet. */ - if (src != ENTRY_BLOCK_PTR ! TEST_BIT (visited, src-index)) -{ - /* Mark that we have visited the destination. */ - SET_BIT (visited, src-index); - - if (EDGE_COUNT (src-preds) 0) -/* Since the SRC node has been visited for the first - time, check its predecessors. */ -stack[sp++] = ei_start (src-preds); - else -post_order[post_order_num++] = src-index; -} - else -{ - if (ei_one_before_end_p (ei) dest != EXIT_BLOCK_PTR) -post_order[post_order_num++] = dest-index; - - if (!ei_one_before_end_p (ei)) -ei_next (stack[sp - 1]); - else -sp--; -} -} - - if (include_entry_exit) -post_order[post_order_num++] = ENTRY_BLOCK; - - free (stack); - sbitmap_free (visited); - return post_order_num; -} - /* Initialize data structures used by PRE. */ @@ -4686,9 +4613,8 @@ init_pre (void) connect_infinite_loops_to_exit (); memset (pre_stats, 0, sizeof (pre_stats)); - - postorder = XNEWVEC (int, n_basic_blocks - NUM_FIXED_BLOCKS); - my_rev_post_order_compute (postorder, false); + postorder = XNEWVEC (int, n_basic_blocks); + postorder_num = inverted_post_order_compute (postorder); alloc_aux_for_blocks (sizeof (struct bb_bitmap_sets));
[PATCH] Fix PR54825
This fixes PR54825, properly FRE/PRE vector BIT_FIELD_REFs. Bootstrapped and tested on x86_64-unknown-linux-gnu, applied. Richard. 2012-10-08 Richard Guenther rguent...@suse.de PR tree-optimization/54825 * tree-ssa-sccvn.c (vn_nary_length_from_stmt): Handle BIT_FIELD_REF. (init_vn_nary_op_from_stmt): Likewise. * tree-ssa-pre.c (compute_avail): Use vn_nary_op_lookup_stmt. * tree-ssa-sccvn.h (sizeof_vn_nary_op): Avoid overflow. Index: gcc/tree-ssa-sccvn.c === *** gcc/tree-ssa-sccvn.c(revision 192120) --- gcc/tree-ssa-sccvn.c(working copy) *** vn_nary_length_from_stmt (gimple stmt) *** 2194,2199 --- 2194,2202 case VIEW_CONVERT_EXPR: return 1; + case BIT_FIELD_REF: + return 3; + case CONSTRUCTOR: return CONSTRUCTOR_NELTS (gimple_assign_rhs1 (stmt)); *** init_vn_nary_op_from_stmt (vn_nary_op_t *** 2220,2225 --- 2223,2235 vno-op[0] = TREE_OPERAND (gimple_assign_rhs1 (stmt), 0); break; + case BIT_FIELD_REF: + vno-length = 3; + vno-op[0] = TREE_OPERAND (gimple_assign_rhs1 (stmt), 0); + vno-op[1] = TREE_OPERAND (gimple_assign_rhs1 (stmt), 1); + vno-op[2] = TREE_OPERAND (gimple_assign_rhs1 (stmt), 2); + break; + case CONSTRUCTOR: vno-length = CONSTRUCTOR_NELTS (gimple_assign_rhs1 (stmt)); for (i = 0; i vno-length; ++i) *** init_vn_nary_op_from_stmt (vn_nary_op_t *** 2227,2232 --- 2237,2243 break; default: + gcc_checking_assert (!gimple_assign_single_p (stmt)); vno-length = gimple_num_ops (stmt) - 1; for (i = 0; i vno-length; ++i) vno-op[i] = gimple_op (stmt, i + 1); Index: gcc/tree-ssa-pre.c === *** gcc/tree-ssa-pre.c (revision 192120) --- gcc/tree-ssa-pre.c (working copy) *** compute_avail (void) *** 3850,3860 || code == VEC_COND_EXPR) continue; ! vn_nary_op_lookup_pieces (gimple_num_ops (stmt) - 1, ! code, ! gimple_expr_type (stmt), ! gimple_assign_rhs1_ptr (stmt), ! nary); if (!nary) continue; --- 3850,3856 || code == VEC_COND_EXPR) continue; ! vn_nary_op_lookup_stmt (stmt, nary); if (!nary) continue; Index: gcc/tree-ssa-sccvn.h === *** gcc/tree-ssa-sccvn.h(revision 192120) --- gcc/tree-ssa-sccvn.h(working copy) *** typedef const struct vn_nary_op_s *const *** 51,57 static inline size_t sizeof_vn_nary_op (unsigned int length) { ! return sizeof (struct vn_nary_op_s) + sizeof (tree) * (length - 1); } /* Phi nodes in the hashtable consist of their non-VN_TOP phi --- 51,57 static inline size_t sizeof_vn_nary_op (unsigned int length) { ! return sizeof (struct vn_nary_op_s) + sizeof (tree) * length - sizeof (tree); } /* Phi nodes in the hashtable consist of their non-VN_TOP phi
Re: patch to fix constant math - third small patch
On Mon, Oct 8, 2012 at 1:36 PM, Kenneth Zadeck zad...@naturalbridge.com wrote: yes, my bad. here it is with the patches. Just for the record, ok! Thanks, Richard. On 10/06/2012 11:55 AM, Kenneth Zadeck wrote: This is the third patch in the series of patches to fix constant math. this one changes some predicates at the rtl level to use the new predicate CONST_SCALAR_INT_P. I did not include a few that were tightly intertwined with other changes. Not all of these changes are strictly mechanical. Richard, when reviewing this had me make additional changes to remove what he thought were latent bugs at the rtl level. However, it appears that the bugs were not latent.I do not know what is going on here but i am smart enough to not look a gift horse in the mouth. All of this was done on the same machine with no changes and identical configs. It is an x86-64 with ubuntu 12-4. ok for commit? in the logs below, gbBaseline is a trunk from friday and the gbWide is the same revision but with my patches. Some of this like gfortran.dg/pr32627 is obviously flutter, but the rest does not appear to be. = heracles:~/gcc(13) gccBaseline/contrib/compare_tests gbBaseline/gcc/testsuite/gcc/gcc.log gbWide/gcc/testsuite/gcc/gcc.log New tests that PASS: gcc.dg/builtins-85.c scan-assembler mysnprintf gcc.dg/builtins-85.c scan-assembler-not __chk_fail gcc.dg/builtins-85.c (test for excess errors) heracles:~/gcc(14) gccBaseline/contrib/compare_tests gbBaseline/gcc/testsuite/gfortran/gfortran.log gbWide/gcc/testsuite/gfortran/gfortran.log New tests that PASS: gfortran.dg/pr32627.f03 -O3 -fomit-frame-pointer -funroll-loops (test for excess errors) gfortran.dg/pr32627.f03 -O3 -fomit-frame-pointer (test for excess errors) gfortran.dg/pr32627.f03 -Os (test for excess errors) gfortran.dg/pr32635.f -O0 execution test gfortran.dg/pr32635.f -O0 (test for excess errors) gfortran.dg/substr_6.f90 -O2 (test for excess errors) Old tests that passed, that have disappeared: (Eeek!) gfortran.dg/pr32627.f03 -O1 (test for excess errors) gfortran.dg/pr32627.f03 -O3 -fomit-frame-pointer -funroll-all-loops -finline-functions (test for excess errors) gfortran.dg/pr32627.f03 -O3 -g (test for excess errors) gfortran.dg/substring_equivalence.f90 -O (test for excess errors) Using /home/zadeck/gcc/gccBaseline/gcc/testsuite/config/default.exp as tool-and-target-specific interface file. === g++ Summary === # of expected passes49793 # of expected failures284 # of unsupported tests601 runtest completed at Fri Oct 5 16:10:20 2012 heracles:~/gcc(16) tail gbWide/gcc/testsuite/g++/g++.log Using /usr/share/dejagnu/config/unix.exp as generic interface file for target. Using /home/zadeck/gcc/gccWide/gcc/testsuite/config/default.exp as tool-and-target-specific interface file. === g++ Summary === # of expected passes50472 # of expected failures284 # of unsupported tests613 runtest completed at Fri Oct 5 19:51:50 2012
Re: [RFC] Implement load sinking in loops
On Mon, Oct 8, 2012 at 12:38 PM, Eric Botcazou ebotca...@adacore.com wrote: Hi, we recently noticed that, even at -O3, the compiler doesn't figure out that the following loop is dumb: #define SIZE 64 int foo (int v[]) { int r; for (i = 0; i SIZE; i++) r = v[i]; return r; } which was a bit of a surprise. On second thoughts, this isn't entirely unexpected, as it probably matters only for (slightly) pathological cases. The attached patch nevertheless implements a form of load sinking in loops so as to optimize these cases. It's combined with invariant motion to optimize: int foo (int v[], int a) { int r, i; for (i = 0; i SIZE; i++) r = v[i] + a; return r; } and with store sinking to optimize: int foo (int v1[], int v2[]) { int r[SIZE]; int i, j; for (j = 0; j SIZE; j++) for (i = 0; i SIZE; i++) r[j] = v1[j] + v2[i]; return r[SIZE - 1]; } The optimization is enabled at -O2 in the patch for measurement purposes but, given how rarely it triggers (e.g. exactly 10 occurrences in a GCC bootstrap, compiler-only, all languages except Go), it's probably best suited to -O3. Or perhaps we don't care and it should simply be dropped... Thoughts? Incidentially we have scev-const-prop to deal with the similar case of scalar computations. But I realize this doesn't work for expressions that are dependent on a loop variant load. @@ -103,6 +103,7 @@ typedef struct mem_ref_loc { tree *ref; /* The reference itself. */ gimple stmt; /* The statement in that it occurs. */ + tree lhs;/* The (ultimate) LHS for a load. */ } *mem_ref_loc_p; isn't that the lhs of stmt? +static gimple_seq +copy_load_and_single_use_chain (mem_ref_loc_p loc, tree *new_lhs) +{ + tree mem = *loc-ref; + tree lhs, tmp_var, ssa_name; + gimple_seq seq = NULL; + gimple stmt; + unsigned n = 0; + + /* First copy the load and create the new LHS for it. */ + lhs = gimple_assign_lhs (loc-stmt); + tmp_var = create_tmp_reg (TREE_TYPE (lhs), get_lsm_tmp_name (mem, n++)); use make_temp_ssa_name or simply copy_ssa_name (not sure you need fancy names here). + if (gimple_assign_rhs1 (use_stmt) == lhs) + { + op1 = ssa_name; + op2 = gimple_assign_rhs2 (use_stmt); + } + else + { + op1 = gimple_assign_rhs1 (use_stmt); + op2 = ssa_name; + } this may enlarge lifetime of the other operand? And it looks like it would break with unary stmts (accessing out-of-bounds op2). Also for is_gimple_min_invariant other operand which may be for example a.b you need to unshare_expr it. + lhs = gimple_assign_lhs (use_stmt); + tmp_var = create_tmp_reg (TREE_TYPE (lhs), get_lsm_tmp_name (mem, n++)); + stmt = gimple_build_assign_with_ops (rhs_code, tmp_var, op1, op2); + ssa_name = make_ssa_name (tmp_var, stmt); + gimple_assign_set_lhs (stmt, ssa_name); see above. This can now be simplified to lhs = gimple_assign_lhs (use_stmt); ssa_name = copy_ssa_name (lhs, NULL); stmt = gimple_build_assign_with_ops (rhs_code, ssa_name, op1, op2); Btw - isn't this all a bit backward (I mean the analysis in execute_lm?) What you want is apply this transform to as much of the _DEF_s of the loop-closed PHI nodes - only values used outside of the loop are interesting. Thats (sort-of) what SCEV const-prop does (well, it also uses SCEV to compute the overall effect of the iterations). So what you want to know is whether when walking the DEF chain of the loop closed PHI you end up at definitions before the loop or at definitions that are not otherwise used inside the loop. Which means it is really expression sinking. Does tree-ssa-sink manage to sink anything out of a loop? Even scalar computation parts I mean? For for (..) { a = x[i]; y[i] = a; b = a * 2; } ... = b; it should be able to sink b = a*2. So I think the more natural place to implement this is either SCEV cprop or tree-ssa-sink.c. And process things from the loop-closed PHI use walking the DEFs (first process all, marking interesting things to also catch commonly used exprs for two PHI uses). Again you might simply want to open a bugreport for this unless you want to implement it yourself. Thanks, Richard. Tested on x86_64-suse-linux. 2012-10-08 Eric Botcazou ebotca...@adacore.com * gimple.h (gsi_insert_seq_on_edge_before): Declare. * gimple-iterator.c (gsi_insert_seq_on_edge_before): New function. * tree-ssa-loop-im.c (struct mem_ref_loc): Add LHS field. (mem_ref_in_stmt): Remove gcc_assert. (copy_load_and_single_use_chain): New function. (execute_lm): Likewise. (hoist_memory_references): Hoist the loads after the stores. (ref_always_accessed_p): Rename into... (ref_always_stored_p): ...this. Remove STORE_P and add ONCE_P. (can_lsm_ref_p): New
Re: [RFC] Implement load sinking in loops
On Mon, Oct 8, 2012 at 2:32 PM, Richard Guenther richard.guent...@gmail.com wrote: On Mon, Oct 8, 2012 at 12:38 PM, Eric Botcazou ebotca...@adacore.com wrote: Hi, we recently noticed that, even at -O3, the compiler doesn't figure out that the following loop is dumb: #define SIZE 64 int foo (int v[]) { int r; for (i = 0; i SIZE; i++) r = v[i]; return r; } which was a bit of a surprise. On second thoughts, this isn't entirely unexpected, as it probably matters only for (slightly) pathological cases. The attached patch nevertheless implements a form of load sinking in loops so as to optimize these cases. It's combined with invariant motion to optimize: int foo (int v[], int a) { int r, i; for (i = 0; i SIZE; i++) r = v[i] + a; return r; } and with store sinking to optimize: int foo (int v1[], int v2[]) { int r[SIZE]; int i, j; for (j = 0; j SIZE; j++) for (i = 0; i SIZE; i++) r[j] = v1[j] + v2[i]; return r[SIZE - 1]; } The optimization is enabled at -O2 in the patch for measurement purposes but, given how rarely it triggers (e.g. exactly 10 occurrences in a GCC bootstrap, compiler-only, all languages except Go), it's probably best suited to -O3. Or perhaps we don't care and it should simply be dropped... Thoughts? Incidentially we have scev-const-prop to deal with the similar case of scalar computations. But I realize this doesn't work for expressions that are dependent on a loop variant load. @@ -103,6 +103,7 @@ typedef struct mem_ref_loc { tree *ref; /* The reference itself. */ gimple stmt; /* The statement in that it occurs. */ + tree lhs;/* The (ultimate) LHS for a load. */ } *mem_ref_loc_p; isn't that the lhs of stmt? +static gimple_seq +copy_load_and_single_use_chain (mem_ref_loc_p loc, tree *new_lhs) +{ + tree mem = *loc-ref; + tree lhs, tmp_var, ssa_name; + gimple_seq seq = NULL; + gimple stmt; + unsigned n = 0; + + /* First copy the load and create the new LHS for it. */ + lhs = gimple_assign_lhs (loc-stmt); + tmp_var = create_tmp_reg (TREE_TYPE (lhs), get_lsm_tmp_name (mem, n++)); use make_temp_ssa_name or simply copy_ssa_name (not sure you need fancy names here). + if (gimple_assign_rhs1 (use_stmt) == lhs) + { + op1 = ssa_name; + op2 = gimple_assign_rhs2 (use_stmt); + } + else + { + op1 = gimple_assign_rhs1 (use_stmt); + op2 = ssa_name; + } this may enlarge lifetime of the other operand? And it looks like it would break with unary stmts (accessing out-of-bounds op2). Also for is_gimple_min_invariant other operand which may be for example a.b you need to unshare_expr it. + lhs = gimple_assign_lhs (use_stmt); + tmp_var = create_tmp_reg (TREE_TYPE (lhs), get_lsm_tmp_name (mem, n++)); + stmt = gimple_build_assign_with_ops (rhs_code, tmp_var, op1, op2); + ssa_name = make_ssa_name (tmp_var, stmt); + gimple_assign_set_lhs (stmt, ssa_name); see above. This can now be simplified to lhs = gimple_assign_lhs (use_stmt); ssa_name = copy_ssa_name (lhs, NULL); stmt = gimple_build_assign_with_ops (rhs_code, ssa_name, op1, op2); Btw - isn't this all a bit backward (I mean the analysis in execute_lm?) What you want is apply this transform to as much of the _DEF_s of the loop-closed PHI nodes - only values used outside of the loop are interesting. Thats (sort-of) what SCEV const-prop does (well, it also uses SCEV to compute the overall effect of the iterations). So what you want to know is whether when walking the DEF chain of the loop closed PHI you end up at definitions before the loop or at definitions that are not otherwise used inside the loop. Which means it is really expression sinking. Does tree-ssa-sink manage to sink anything out of a loop? Even scalar computation parts I mean? For for (..) { a = x[i]; y[i] = a; b = a * 2; } ... = b; it should be able to sink b = a*2. So I think the more natural place to implement this is either SCEV cprop or tree-ssa-sink.c. And process things from the loop-closed PHI use walking the DEFs (first process all, marking interesting things to also catch commonly used exprs for two PHI uses). Again you might simply want to open a bugreport for this unless you want to implement it yourself. We indeed sink 2*tem but not a[i] here. Because tree-ssa-sink.c doesn't sink loads (IIRC) at all, but I've seen patches to fix that (IIRC). int a[256]; int foo (int x) { int i, k = 0; for (i = 0; i x; ++i) { int tem = a[i]; k = 2*tem; } return k; } Richard. Thanks, Richard. Tested on x86_64-suse-linux. 2012-10-08 Eric Botcazou ebotca...@adacore.com * gimple.h (gsi_insert_seq_on_edge_before): Declare. * gimple-iterator.c
Re: gcc/lto/lto.c: Free lto_file struct after closing the file
On Mon, Oct 8, 2012 at 2:39 PM, Tobias Burnus bur...@net-b.de wrote: lto_obj_file_open allocates: lo = XCNEW (struct lto_simple_object); However, the data is never freed - neither explicitly nor in lto_obj_file_close. In the attached patch, I free the memory now after the call to lto_obj_file_close. Build and regtested on x86-64-gnu-linux. OK for the trunk? Ok. Thanks, Richard. Tobias
Re: patch to fix constant math
On Mon, Oct 8, 2012 at 5:01 PM, Nathan Froyd froy...@mozilla.com wrote: - Original Message - Btw, as for Richards idea of conditionally placing the length field in rtx_def looks like overkill to me. These days we'd merely want to optimize for 64bit hosts, thus unconditionally adding a 32 bit field to rtx_def looks ok to me (you can wrap that inside a union to allow both descriptive names and eventual different use - see what I've done to tree_base) IMHO, unconditionally adding that field isn't optimize for 64-bit hosts, but gratuitously make one of the major compiler data structures bigger on 32-bit hosts. Not everybody can cross-compile from a 64-bit host. And even those people who can don't necessarily want to. Please try to consider what's best for all the people who use GCC, not just the cases you happen to be working with every day. The challenge would of course be to have the overhead only for a minority of all RTX codes. After all that 32bits are free to be used for every one. And I would not consider RTX a 'major compiler data structure' - of course that makes the whole issue somewhat moot ;) Richard. -Nathan
Re: reverse bitfield patch
On Fri, Oct 5, 2012 at 8:15 PM, DJ Delorie d...@redhat.com wrote: Why do you need to change varasm.c at all? The hunks seem to be completely separate of the attribute. Because static constructors have fields in the original order, not the reversed order. Otherwise code like this is miscompiled: Err - the struct also has fields in the original order - only the bit positions of the fields are different because of the layouting option. struct foo a = { 1, 2, 3 }; because the 1, 2, 3 are in the C layout order, but the underlying data needs to be stored in the reversed order. And you expect no other code looks at fields of a structure and its initializer? It's bad to keep this not in-sync. Thus I don't think it's viable to re-order fields just because bit allocation is reversed. which will severely pessimize bitfield accesses to structs with the bitfield-order attribute. The typical use-case for this feature is memory-mapped hardware, where pessimum access is preferred anyway. I doubt that, looking at constraints for strict volatile bitfields. so you are supporting this as #pragma. Which ends up tacking bit_order to each type. Rather than this, why not operate similar to the packed pragma, thus, adjust a global variable in stor-layout.c. Because when I first proposed this feature, I was told to do it this way. I don't see a value in attaching 'native' or 'msb'/'lsb' if it is equal to 'native'. You un-necessarily pessimize code generation (is different code even desired for a no-op bit_order attribute?). If the attribute corresponds to the native mode, it should be a no-op. The pessimizing only happens when the fields are actually in reverse order. No, because you check for the attribute presence. So no, I don't like this post-process layouting thing. It's a layouting mode so it should have effects at bitfield layout time. The actual reversal happens in stor-layout.c. Everything else is there to compensate for a possible non-linear layout.
Re: patch to fix constant math
On Fri, Oct 5, 2012 at 6:34 PM, Kenneth Zadeck zad...@naturalbridge.com wrote: richard s, there are two comments that i deferred to you. that have the word richard in them, richi, thank, i will start doing this now. kenny On 10/05/2012 09:49 AM, Richard Guenther wrote: On Fri, Oct 5, 2012 at 2:39 PM, Richard Guenther richard.guent...@gmail.com wrote: Ok, I see where you are going. Let me look at the patch again. * The introduction and use of CONST_SCALAR_INT_P could be split out (obvious and good) * DEF_RTL_EXPR(CONST_WIDE_INT, const_wide_int, , RTX_CONST_OBJ) defining that new RTX is orthogonal to providing the wide_int abstraction for operating on CONST_INT and CONST_DOUBLE, right? @@ -144,6 +144,7 @@ static bool prefer_and_bit_test (enum machine_mode mode, int bitnum) { bool speed_p; + wide_int mask = wide_int::set_bit_in_zero (bitnum, mode); set_bit_in_zero ... why use this instead of wide_int::zero (mode).set_bit (bitnum) that would match the old double_int_zero.set_bit interface and be more composed of primitives? adding something like this was just based on usage.We do this operation all over the place, why not make it concise and efficient. The api is very bottom up. I looked at what the clients were doing all over the place and added those functions. wide-int has both and_not and or_not. double-int only has and_not, but there are a lot of places in where you do or_nots, so we have or_not also. if (and_test == 0) { @@ -164,8 +165,7 @@ prefer_and_bit_test (enum machine_mode m } /* Fill in the integers. */ - XEXP (and_test, 1) -= immed_double_int_const (double_int_zero.set_bit (bitnum), mode); + XEXP (and_test, 1) = immed_wide_int_const (mask); I suppose immed_wide_int_const may create a CONST_INT, a CONST_DOUBLE or a CONST_WIDE_INT? yes, on converted targets it builds const_ints and const_wide_ints and on unconverted targets it builds const_ints and const_doubles.The reason i did not want to convert the targets is that the code that lives in the targets generally wants to use the values to create constants and this code is very very very target specific. +#if TARGET_SUPPORTS_WIDE_INT +/* Returns 1 if OP is an operand that is a CONST_INT or CONST_WIDE_INT. */ +int +const_scalar_int_operand (rtx op, enum machine_mode mode) +{ why is it necessary to condition anything on TARGET_SUPPORTS_WIDE_INT? It seems testing/compile coverage of this code will be bad ... Can't you simply map CONST_WIDE_INT to CONST_DOUBLE for targets The accessors for the two are completely different.const doubles know that there are exactly two hwi's. for const_wide_ints, you only know that the len is greater than 1. anything with len 1 would be CONST_INT. In a modern c++ world, const_int would be a subtype of const_int, but that is a huge patch. not supporting CONST_WIDE_INT (what does it mean to support CONST_WIDE_INT? Does a target that supports CONST_WIDE_INT no longer use CONST_DOUBLEs for integers?) yes, that is exactly what it means. + if (!CONST_WIDE_INT_P (op)) +return 0; hm, that doesn't match the comment (CONST_INT is supposed to return 1 as well). The comment doesn't really tell what the function does it seems, I need some context here to reply. + int prec = GET_MODE_PRECISION (mode); + int bitsize = GET_MODE_BITSIZE (mode); + + if (CONST_WIDE_INT_NUNITS (op) * HOST_BITS_PER_WIDE_INT bitsize) + return 0; it's mode argument is not documented. And this doesn't even try to see if the constant would fit the mode anyway (like if it's zero). Not sure what the function is for. I will upgrade the comments, they were inherited from the old version with the const_double changed to the const_wide_int + { + /* Multiword partial int. */ + HOST_WIDE_INT x + = CONST_WIDE_INT_ELT (op, CONST_WIDE_INT_NUNITS (op) - 1); + return (wide_int::sext (x, prec (HOST_BITS_PER_WIDE_INT - 1)) + == x); err - so now we have wide_int with a mode but we pass in another mode and see if they have the same value? Same value as-in signed value? Which probably is why you need to rule out different size constants above? Because you don't know whether to sign or zero extend? no it is worse than that. I made the decision, which i think is correct, that we were not going to carry the mode inside const_wide_int. The justification was that we do not do it for const_int.There is a comment in the constructor for const_wide_int that says it would be so easy to just put this in. But, this is an old api that did not change. only the internals changed to support const_wide_int. +/* Returns 1 if OP is an operand that is a CONST_WIDE_INT. */ +int +const_wide_int_operand (rtx op, enum machine_mode mode) +{ similar comments, common code should be factored out
Re: patch to fix constant math
On Sun, Oct 7, 2012 at 3:11 PM, Kenneth Zadeck zad...@naturalbridge.com wrote: On 10/07/2012 08:47 AM, Richard Guenther wrote: len seems redundant unless you want to optimize encoding. len == (precision + HOST_BITS_PER_WIDE_INT - 1) / HOST_BITS_PER_WIDE_INT. that is exactly what we do. However, we are optimizing time, not space. the value is always stored in compressed form, i.e the same representation as is used inside the CONST_WIDE_INTs and INT_CSTs. this makes the transformation between them very fast. And since we do this a lot, it needs to be fast. So the len is the number of HWIs needed to represent the value which is typically less than what would be implied by the precision. But doesn't this require a sign? Otherwise how do you encode TImode 0x? Would len be 1 here? (and talking about CONST_WIDE_INT vs CONST_INT, wouldn't CONST_INT be used anyway for all ints we can encode small? Or is it (hopefully) the goal to replace CONST_INT with CONST_WIDE_INT everywhere?) Or do you say wide_int is always signed', thus TImode 0x needs len == 2 and an explicit zero upper HWI word? the compression of this has len ==2 with the explicit upper being 0. Or do you say wide_int is always unsigned, thus TImode -1 needs len == 2? Noting that double-ints were supposed to be twos-complement (thus, 'unsigned') numbers having implicit non-zero bits sounds error-prone. That said - I don't see any value in optimizing storage for short-lived (as you say) wide-ints (apart from limiting it to the mode size). For example mutating operations on wide-ints are not really possible if you need to extend storage. the compression used is independent of whether it is sign or unsigned. but the compression looks signed. i.e. you do not represent the upper hwi if they would just be a sign extension of the top hwi that is represented. However, this does not imply anything about the math that was used to set those bits that way, and you can always go back to the full rep independent of the sign. I do not have any long term plans to merge CONST_INT into CONST_WIDE_INT. It would be a huge change in the ports and would be fairly space inefficient. My guess is that in real code, less than 1% of real constants will have a length greater than 1 even on a 32 bit host. CONST_INT is very space efficient. This could have been mentioned as part of Richard's response to your comment about the way we represent the CONST_WIDE_INTs. In practice, there are almost none of them anyway. In fact, you could argue that the tree level did it wrong (not that i am suggesting to change this). But it makes me think what was going on when the decision to make TYPE_PRECISION be an INT_CST rather than just a HWI was made. For that there is an implication that it could never take more than a HWI since no place in the code even checks TREE_INT_CST_HIGH for these. Well - on the tree level we now always have two HWIs for all INTEGER_CSTs. If we can, based on the size of the underlying mode, reduce that to one HWI we already win something. If we add an explicit length to allow a smaller encoding for larger modes (tree_base conveniently has an available 'int' for this ...) then we'd win in more cases. Thus, is CONST_INT really necessarily better than optimized CONST_WIDE_INT storage? Richard. + enum Op { +NONE, we don't have sth like this in double-int.h, why was the double-int mechanism not viable? i have chosen to use enums for things rather than passing booleans. But it's bad to use an enum with 4 values for a thing that can only possibly expect two. You cannot optimize tests as easily. Please retain the bool uns parameters instead. I am breaking it into two enums.
Re: Fixup INTEGER_CST
On Sun, Oct 7, 2012 at 5:15 PM, Jan Hubicka hubi...@ucw.cz wrote: Hi, I added a santy check that after fixup all types that lost in the merging are really dead. And it turns out we have some zombies around. INTEGER_CST needs special care because it is special cased by the streamer. We also do not want to do inplace modificaitons on it because that would corrupt the hashtable used by tree.c's sharing code Bootstrapped/regtested x86_64-linux, OK? No, I don't think we want to fixup INTEGER_CSTs this way. Instead we want to fixup them where they end up used unfixed. Richard. PR lto/54839 * lto/lto.c (remember_with_vars): Also fixup INTEGER_CST. (fixup_integer_cst): New functoin. (lto_ft_type): Fixup BASETYPE of methods and offsets. Index: lto/lto.c === --- lto/lto.c (revision 192164) +++ lto/lto.c (working copy) @@ -1408,11 +1408,36 @@ remember_with_vars (tree t) (tt) = GIMPLE_REGISTER_TYPE (tt); \ if (VAR_OR_FUNCTION_DECL_P (tt) TREE_PUBLIC (tt)) \ remember_with_vars (t); \ + if (TREE_CODE (tt) == INTEGER_CST) \ + (tt) = fixup_integer_cst (tt); \ } \ } while (0) static void lto_fixup_types (tree); +/* Return integer_cst T with updated type. */ + +static tree +fixup_integer_cst (tree t) +{ + tree type = GIMPLE_REGISTER_TYPE (TREE_TYPE (t)); + + if (type == TREE_TYPE (t)) +return t; + + /* If overflow was set, streamer_read_integer_cst + produced local copy of T. */ + if (TREE_OVERFLOW (t)) +{ + TREE_TYPE (t) = type; + return t; +} + else + /* Otherwise produce new shared node for the new type. */ +return build_int_cst_wide (type, TREE_INT_CST_LOW (t), + TREE_INT_CST_HIGH (t)); +} + /* Fix up fields of a tree_typed T. */ static void @@ -1526,6 +1549,11 @@ lto_ft_type (tree t) LTO_FIXUP_TREE (t-type_non_common.binfo); LTO_FIXUP_TREE (TYPE_CONTEXT (t)); + + if (TREE_CODE (t) == METHOD_TYPE) +TYPE_METHOD_BASETYPE (t); + if (TREE_CODE (t) == OFFSET_TYPE) +TYPE_OFFSET_BASETYPE (t); } /* Fix up fields of a BINFO T. */
Re: inlined memcpy/memset degradation in gcc 4.6 or later
On Thu, Oct 4, 2012 at 7:44 PM, Joe Buck joe.b...@synopsys.com wrote: On Tue, Oct 2, 2012 at 4:19 PM, Walter Lee w...@tilera.com wrote: On TILE-Gx, I'm observing a degradation in inlined memcpy/memset in gcc 4.6 and later versus gcc 4.4. Though I find the problem on TILE-Gx, I think this is a problem for any architectures with SLOW_UNALIGNED_ACCESS set to 1. Consider the following program: struct foo { int x; }; void copy(struct foo* f0, struct foo* f1) { memcpy (f0, f1, sizeof(struct foo)); } In gcc 4.4, I get the desired inline memcpy: ... In gcc 4.7, however, I get inlined byte-by-byte copies: ... On Thu, Oct 04, 2012 at 01:58:54PM +0200, Richard Guenther wrote: There is no way to fix it. memcpy does not require aligned arguments and the merely presence of a typed pointer contains zero information of alignment for the middle-end. If you want to excercise C's notion of alignemnt requirements then do not use memcpy but *f0 = *f1; which works equally well. Perhaps I'm missing something. While memcpy is not permitted to assume alignment of its arguments, copy is. Otherwise, if I wrote void copy(struct foo* f0, struct foo* f1) { *f0 = *f1; } the compiler would also be forbidden from assuming any alignment. So, when compiling copy, do we lack the information that the memcpy call is to the standard ISO memcpy function? If we know that it is the standard function we should be able to do the copy assuming everything is properly aligned. If we see the above aggregate copy then we should be able to compile the function assuming that f0 and f1 are properly aligned for type struct foo. If we see C source code using memcpy (f0, f1, sizeof (struct foo)) then we cannot assume anything about the alignment of f0 or f1 based on the fact that the code uses the ISO memcpy function. Thus, I do not understand your question. When we compile copy how is memcpy relevant? Richard. Btw, the new beavior even fixed bugs. Could you point to a PR that was fixed by the change? There must be some way to distinguish this case from those cases.
Re: Functions that are CSEable but not pure
On Thu, Oct 4, 2012 at 8:02 PM, Jason Merrill ja...@redhat.com wrote: On 10/04/2012 01:42 PM, Richard Guenther wrote: So I suppose the testcase that would be valid but break with using pure would be instead int main() { int x = init_count; int *p = get_me(); if (init_count == x) __builtin_abort(); int *q = get_me(); if (init_count == x) __builtin_abort(); } here when get_me is pure we CSE init_count over the _first_ call of get_me. That's OK for C++ thread_local semantics; the initialization is specified to happen some time before the first use, so the testcase is making an invalid assumption. This might not be desirable for user-written singleton functions, though. Ok, so then let me make another example to try breaking a valid program with the thread_local wrapper being pure: int main() { int p = get_me(); *p; if (init_count == 1) __builtin_abort(); } note my using of C++ references (which in this case cannot bind to NULL and thus *p might not trap(?)). So the C++ source would be something like thread_local int init_count; int foo () { init_count = 1; return 0; } thread_local int i = foo (); int main() { i; if (init_count != 1) __builtin_abort (); } is that valid? When lowered to the above we would DCE the load of i and the call to the initialization wrapper (because it is pure). Obviously then init_count is no longer 1. Thus my question is - is a valid program allowed to access side-effects of the dynamic TLS initializer by not going through the TLS variable? I realize that's somewhat ill-formed if the TLS variable was a class with a static method doing the access to init_count - using that method would not keep the class instance life. Preventing DCE but not CSE for const/pure functions can be for example done by using the looping-const-or-pure flag (but that also tells the compiler that this function may not return, so it is very specific about the kind of possible side-effect to be preserved). The very specific nature of thread_local TLS init semantics ('somewhen before') is hard to make use of, so if we want to change looping-const-or-pure to something like const-or-pure-with-side-effects we should constrain things more. Richard. Jason
Re: reverse bitfield patch
On Thu, Oct 4, 2012 at 8:38 PM, DJ Delorie d...@redhat.com wrote: ChangeLog missing, new functions need a toplevel comment documenting function, argument and return value as per coding conventions. Any review of the patch itself? I know the overhead is not there... Why do you need to change varasm.c at all? The hunks seem to be completely separate of the attribute. I don't like that you post-process the layout. How does reversal inter-operate with all the other features we have, like strict volatile bitfields? I for example see + /* If the bitfield-order attribute has been used on this +structure, the fields might not be in bit-order. In that +case, we need a separate representative for each +field. */ + else if (DECL_FIELD_OFFSET (field) DECL_FIELD_OFFSET (repr) + || (DECL_FIELD_OFFSET (field) == DECL_FIELD_OFFSET (repr) + DECL_FIELD_BIT_OFFSET (field) DECL_FIELD_BIT_OFFSET (repr))) + { + finish_bitfield_representative (repr, prev); + repr = start_bitfield_representative (field); + } which will severely pessimize bitfield accesses to structs with the bitfield-order attribute. + + type_attr_list = tree_cons (get_identifier (bit_order), + build_tree_list (NULL_TREE, get_identifier (bit_order)), + type_attr_list); + + TYPE_ATTRIBUTES (node) = type_attr_list; +} + +void +rx_note_pragma_bitorder (char *mode) +{ + if (mode == NULL) so you are supporting this as #pragma. Which ends up tacking bit_order to each type. Rather than this, why not operate similar to the packed pragma, thus, adjust a global variable in stor-layout.c. +static tree +handle_bitorder_attribute (tree *ARG_UNUSED (node), tree ARG_UNUSED (name), + tree ARG_UNUSED (args), + int ARG_UNUSED (flags), bool *no_add_attrs) +{ + tree bmode; + const char *bname; + + /* Allow no arguments to mean native. */ + if (args == NULL_TREE) +return NULL_TREE; + + bmode = TREE_VALUE (args); + + bname = IDENTIFIER_POINTER (bmode); + if (strcmp (bname, msb) + strcmp (bname, lsb) + strcmp (bname, swapped) + strcmp (bname, native)) +{ + error (%qE is not a valid bit_order - use lsb, msb, native, or swapped, bmode); + *no_add_attrs = true; +} + + return NULL_TREE; I don't see a value in attaching 'native' or 'msb'/'lsb' if it is equal to 'native'. You un-necessarily pessimize code generation (is different code even desired for a no-op bit_order attribute?). So no, I don't like this post-process layouting thing. It's a layouting mode so it should have effects at bitfield layout time. Richard.
Re: libbacktrace and darwin
On Fri, Oct 5, 2012 at 9:15 AM, Tristan Gingold ging...@adacore.com wrote: On Oct 4, 2012, at 11:23 PM, Ian Lance Taylor wrote: On Thu, Oct 4, 2012 at 1:32 PM, Jack Howarth howa...@bromo.med.uc.edu wrote: Is libbacktrace currently functional in gcc trunk and is it expected to function on darwin? While I could understand it not working on installed binaries of FSF gcc that were stripped, I would think it should work for make check in the build tree since all of the debug code should be present in the object files. Or doesn't libbacktrace know to look there for the dwarf code? Thanks in advance for any clarifications. libbacktrace is functional in GCC trunk. However, it does not yet support the Mach-O object file format. I hope to work on that at some point, but it would be great if somebody else tackled it. It's probably straightforward to implement based on code in libiberty/simple-object-mach-o.c. I doubt it will, as dwarf info aren't in the executable. I think it will be simpler to support .dsym (separate debug file) at first. So I take it that libbacktrace doesn't work with the separate DWARF debuginfo as shipped by all Linux distributions either? Richard. Tristan.
Re: Variable DECL_SIZE
On Fri, Oct 5, 2012 at 10:28 AM, Ilya Enkovich enkovich@gmail.com wrote: 2012/10/4 Richard Guenther richard.guent...@gmail.com: On Wed, Oct 3, 2012 at 7:05 PM, Ilya Enkovich enkovich@gmail.com wrote: Hi, I fall into ssa verification failure when try to pass field's DECL_SIZE as an operand for CALL_EXPR. The fail occurs if field's size is not a constant. In such case DECL_SIZE holds a VAR_DECL and I need to find it's proper SSA_NAME. I thought it may be the default one but gimple_default_def returns NULL. What is the right way to obtain correct ssa_name in such case? There is no way. You have to know that you need it's size at the point of gimplification. Later there is no way to recover it. Wow. It means I cannot also get an address of subsequent fields in the structure. It looks weird. Is there a way to somehow preserve this information during gimplification and use in later passes? By using it ;) See how gimplification of COMPONENT_REF works for example. The offset becomes an explicit operand and its computation get put into explicit statements. There is no 'reverse-mapping' to lookup or even re-construct this information at later time. Richard. Ilya Richard. Thanks, Ilya
Re: Variable DECL_SIZE
On Fri, Oct 5, 2012 at 2:36 PM, Ilya Enkovich enkovich@gmail.com wrote: 2012/10/5 Richard Guenther richard.guent...@gmail.com: On Fri, Oct 5, 2012 at 10:28 AM, Ilya Enkovich enkovich@gmail.com wrote: 2012/10/4 Richard Guenther richard.guent...@gmail.com: On Wed, Oct 3, 2012 at 7:05 PM, Ilya Enkovich enkovich@gmail.com wrote: Hi, I fall into ssa verification failure when try to pass field's DECL_SIZE as an operand for CALL_EXPR. The fail occurs if field's size is not a constant. In such case DECL_SIZE holds a VAR_DECL and I need to find it's proper SSA_NAME. I thought it may be the default one but gimple_default_def returns NULL. What is the right way to obtain correct ssa_name in such case? There is no way. You have to know that you need it's size at the point of gimplification. Later there is no way to recover it. Wow. It means I cannot also get an address of subsequent fields in the structure. It looks weird. Is there a way to somehow preserve this information during gimplification and use in later passes? By using it ;) See how gimplification of COMPONENT_REF works for example. The offset becomes an explicit operand and its computation get put into explicit statements. There is no 'reverse-mapping' to lookup or even re-construct this information at later time. I think gimplification pass is too early for me to decide if I want to use all these offsets and sizes. Actually when I a look into GIMPLE dumps I see that all these values are computed and corresponding SSA_NAMEs exist in the code. I use the following source exapmle: int foo(int len) { struct { int a; int buf[len]; int b; } s; return foo1(s.buf); } In GIMPLE I see a lot of values computed for all sizes, unit sizes and offsets: saved_stack.2_1 = __builtin_stack_save (); len.0_3 = len_2(D); D.2241_4 = (bitsizetype) len.0_3; D.2242_5 = D.2241_4 * 32; D.2243_6 = (sizetype) len.0_3; D.2244_7 = D.2243_6 * 4; D.2245_8 = (long int) len.0_3; D.2246_9 = D.2245_8 + -1; D.2247_10 = (sizetype) D.2246_9; D.2241_11 = (bitsizetype) len.0_3; D.2242_12 = D.2241_11 * 32; D.2243_13 = (sizetype) len.0_3; D.2244_14 = D.2243_13 * 4; D.2243_15 = (sizetype) len.0_3; D.2248_16 = D.2243_15 + 1; D.2249_17 = D.2248_16 * 4; D.2243_18 = (sizetype) len.0_3; D.2250_19 = D.2243_18 + 2; D.2251_20 = D.2250_19 * 32; D.2252_21 = D.2251_20; ... I suppose there is always a sinle SSA_NAME for such vars and therefore I may associate them, e.g. via default definition. The only problem them is to not kill them until the very end. Currenty all those unused SSA_NAMES are killed by early optimizations pass. But even on O0 now we have all values computed and available for use until expand but cannot access them. But that's exactly the issue. Nothing keeps them live if they are not used. And there is no back-mapping between DECL_SIZE and the SSA name holding it's value at a certain point in the program. Richard. Ilya Richard. Ilya Richard. Thanks, Ilya
Re: RFC: C++ PATCH to support dynamic initialization and destruction of C++11 and OpenMP TLS variables
On Thu, Oct 4, 2012 at 7:38 PM, Jason Merrill ja...@redhat.com wrote: Both C++11 and OpenMP specify that thread_local/threadprivate variables can have dynamic initialization and destruction semantics. This sequence of patches implements that. The first patch adds the C++11 thread_local keyword and implements the C++11 parsing rules for thread_local, which differ somewhat from __thread. It also allows dynamic initialization of function-local thread_local variables. The second patch adds a __cxa_thread_atexit interface for registering cleanups to be run when a thread exits. The current implementation is not fully conformant; on exit, TLS destructors are supposed to run before any destructors for non-TLS variables, but I think that will require glibc support for __cxa_thread_atexit. The third patch implements dynamic initialization of non-function-local TLS variables using the init-on-first-use idiom: uses of such a variable are replaced with calls to a wrapper function, so that int f(); thread_local int i = f(); implies void i_init() { static bool initialized; if (!initialized) { initialized = true; i = f(); } } inline int i_wrapper() { i_init(); return i; } Note that if we see extern thread_local int i; in a header, we don't know whether it has a dynamic initializer in its defining translation unit, so we need to make conservative assumptions. For a type that doesn't always get dynamic initialization, we make i_init a weakref and only call it if it exists. In the same translation unit as the definition, we optimize appropriately. The wrapper function is the function I'm talking about in http://gcc.gnu.org/ml/gcc/2012-10/msg00024.html Any comments before I check this in? I wonder if an implementation is conforming that performs non-local TLS variable inits at thread creation time instead (probably also would require glibc support)? Or if we have the extra indirection via a reference anyway, we could have a pointer TLS variable (NULL initialized) that on the first access will trap where in a trap handler we could then perform initialization and setup of that pointer. Should we document our choice of implementation somewhere in the manual? And thus implicitely provide the hint that using non-local TLS variables with dynamic initialization comes with an abstraction penalty that we might not be easily able to remove? Thanks, Richard. Jason
Re: Use conditional casting with symtab_node
On Thu, Oct 4, 2012 at 8:16 PM, Diego Novillo dnovi...@google.com wrote: On Thu, Oct 4, 2012 at 2:14 PM, Lawrence Crowl cr...@googlers.com wrote: So, Jan Hubicka requested and approved the current spelling. What now? I don't think we should hold this up. The names Jan requested seem reasonable enough. We seem to be running in circles here. I suppose I have your promise that we'll release with consistent names. Please allocate some work hours on your side for the renaming of cgraph_node and varpool_node for the case Honza doesn't get to it in time. I see all these patches with mixed feeling - it puts breaks on all developers because they need to learn the new interface which does not bring any immediate benefit. So I think _your_ development time would be better spent by fixing open bugs or by tackling some of the existing scalability issues in GCC (rather than quoting funny '0.001% faster with 99% confidence' stuff). Thanks, Richard. Diego.
Re: patch to fix constant math
On Thu, Oct 4, 2012 at 9:27 PM, Richard Sandiford rdsandif...@googlemail.com wrote: Kenneth Zadeck zad...@naturalbridge.com writes: On 10/04/2012 12:58 PM, Richard Guenther wrote: On Thu, Oct 4, 2012 at 3:55 PM, Kenneth Zadeck zad...@naturalbridge.com wrote: Let me talk about the mode here first. What this interface/patch provides is a facility where the constant math that is done in optimizations is done exactly the way that it would be done on the target machine. What we have now is a compiler that only does this if it convenient to do on the host. I admit that i care about this more than others right now, but if intel adds a couple of more instructions to their vector units, other people will start to really care about this issue. If you take an OImode value with the current compiler and left shift it by 250 the middle end will say that the result is 0. This is just wrong!!! What this means is that the bitsize and precision of the operations need to be carried along when doing math. when wide-int checks for overflow on the multiply or add, it is not checking the if the value overflowed on two HWIs, it is checking if the add overflowed in the mode of the types that are represented on the target. When we do shift, we are not doing a shift within two HWIs, we are truncating the shift value (if this is appropriate) according to the bitsize and shifting according the precision. I think that an argument could be made that storing the mode should be changed to an explicit precision and bitsize. (A possible other option would be to store a tree type, but this would make the usage at the rtl level very cumbersome since types are rare.) Aside from the work, you would not get much push back. But the signess is a different argument. At the rtl level, the signess is a matter of context. (you could argue that this is a mistake and i would agree, but that is an even bigger change.) But more to the point, at the tree level, there are a surprising number of places where the operation desired does not follow the sign of the types that were used to construct the constants. Furthermore, not carrying the sign is more consistent with the double int code, which as you point out carries nothing. Well, on RTL the signedness is on the operation (you have sdiv and udiv, etc.). yes, there is a complete enough set of operations that allow you to specify the signess where this matters. double-int tries to present a sign-less twos-complement entity of size 2 * HOST_BITS_PER_WIDE_INT. I think that is sensible and for obvious reasons should not change. Both tree and RTL rely on this. I disagree, at least from an RTL perspective. HOST_BITS_PER_WIDE_INT is a host property, and as Kenny says, it's not really sensible to tie our main target-specific IR to something host-specific. We've only done that for historical reasons. Oh, I agree - that HOST_WIDE_INT provides the limit for the largest integer constants we can encode on a target is a bug! On a target that's been converted to wide_int, I don't think a pair of HWIs (whether or not it's expressed as a double_int) has any significance at all at the RTL level. As far as the wide_ints recording a mode or precision goes: we're in the lucky position of having tried both options. Trees record the type (and thus precision) of all compile-time integer constants. RTL doesn't. And the RTL way is a right pain. It leads to interfaces in which rtx arguments often have to be accompanied by a mode. And it leads to clunky differences like those between simplify_unary_operation (which takes two mode arguments) and simplify_binary_operation (which with our current set of binary operations requires only one). But the issue here is not that double_int doesn't record a mode or precision (or a sign). The issue is that CONST_DOUBLE and CONST_INT don't! The _tree_ INTEGER_CST contains of a type and a double-int. I see double-int (and the proposed wide-int) as a common building-block used for kind of arbitrary precision (as far as the target needs) integer constants on both tree and RTL. And having a common worker implementation requires it to operate on something different than tree types or RTL mode plus signedness. To put it another way: every wide_int operation needs to know the precision of its arguments and the precision of its result. That's not true. Every tree or RTL operation does, not every wide_int operation. double_int's are just twos-complement numbers of precision 2 * HOST_BITS_PER_WIDE_INT. wide_int's should be just twos-complement numbers of precision len * WHATEVER_HOST_COMPONENT_TYPE_IS_SUITABLE_FOR_A_FAST_IMPLEMENTATION. Operations on double_int and wide_int are bare-metal, in nearly all of the times you should use routines that do proper sign-/zero-extending to a possibly smaller precision. When using bare-metal you are supposed to do that yourself. Which is why I was suggesting
Re: patch to fix constant math
On Fri, Oct 5, 2012 at 11:26 AM, Richard Guenther richard.guent...@gmail.com wrote: Look at RTL users of the double-int routines and provide wrappers that take RTXen as inputs. Enforce that all CONSTs have a mode. Which would, btw, allow to merge CONST_INT, CONST_DOUBLE and CONST_WIDE by making the storage size variable and it's length specified by the mode of the RTX (I never liked the distinction of CONST_INT and CONST_DOUBLE, and you are right, the CONST_DOUBLE paths are seldomly excercised). Richard.
Re: patch to fix constant math
On Fri, Oct 5, 2012 at 11:55 AM, Richard Sandiford rdsandif...@googlemail.com wrote: Richard Guenther richard.guent...@gmail.com writes: As far as the wide_ints recording a mode or precision goes: we're in the lucky position of having tried both options. Trees record the type (and thus precision) of all compile-time integer constants. RTL doesn't. And the RTL way is a right pain. It leads to interfaces in which rtx arguments often have to be accompanied by a mode. And it leads to clunky differences like those between simplify_unary_operation (which takes two mode arguments) and simplify_binary_operation (which with our current set of binary operations requires only one). But the issue here is not that double_int doesn't record a mode or precision (or a sign). The issue is that CONST_DOUBLE and CONST_INT don't! The _tree_ INTEGER_CST contains of a type and a double-int. I see double-int (and the proposed wide-int) as a common building-block used for kind of arbitrary precision (as far as the target needs) integer constants on both tree and RTL. And having a common worker implementation requires it to operate on something different than tree types or RTL mode plus signedness. To put it another way: every wide_int operation needs to know the precision of its arguments and the precision of its result. That's not true. Every tree or RTL operation does, not every wide_int operation. double_int's are just twos-complement numbers of precision 2 * HOST_BITS_PER_WIDE_INT. wide_int's should be just twos-complement numbers of precision len * WHATEVER_HOST_COMPONENT_TYPE_IS_SUITABLE_FOR_A_FAST_IMPLEMENTATION. Operations on double_int and wide_int are bare-metal, in nearly all of the times you should use routines that do proper sign-/zero-extending to a possibly smaller precision. When using bare-metal you are supposed to do that yourself. Which is why I was suggesting to add double_sint, double_uint, double_sintp (with precision), etc., wrappers and operations. Not storing the precision in the wide_int is putting the onus on the caller to keep track of the precision separately. But that's a matter of providing something high-level ontop of the bare-metal double-int/wide-int (something shared between RTL and trees). Duplicating information in the bare-metal doesn't make sense (and no, INTEGER_CSTs on the tree level are _not_ short-lived, and how can a double-int make sense on the tree level when you say it doesn't make sense on the RTL level?) I think we're talking at cross purposes here. To the extent that I'm not really sure where to begin. :-) Just in case this is it: the idea is that wide_int is the type used to _process_ integers. It is not suppoed to be the type used to store integers in the IRs. The way trees store integers and the way that RTL stores integers are up to them. For RTL the idea is that we continue to store integers that happen to fit into a HWI as a CONST_INT (regardless of the size of the CONST_INT's context-determined mode). Wider integers are stored as a CONST_DOUBLE (for unconverted targets) or a CONST_WIDE_INT (for converted targets). None of the three use the wide_int type; they use more compact representations instead. And as Kenny says, using CONST_INT continues to be an important way of reducing the IR footprint. Whenever we want to do something interesting with the integers, we construct a wide_int for them and use the same processing code for all three rtx types. This avoids having the separate single-HWI and double-HWI paths that we have now. It also copes naturally with cases where we start out with single-HWI values but end up with wider ones. But the operations that we do on these wide_ints will all be to a mode or precision. Shifts of QImode integers are different from shifts of HImode integers, etc. If you knew all that already (you probably did) and I've completely missed the point, please say so. :-) I'm not sure what you mean by bare metal. The issue is that unlike RTL where we construct double-ints from CONST_INT/CONST_DOUBLE right now, tree has the double-int _embedded_. That's why I say that the thing we embed in a CONST_WIDE_INT or a tree INTEGER_CST needs to be bare metal, and that's what I would call wide-int. I think you have two things mixed in this patch which might add to the confusion (heh). One is, making the thing you work on in RTL (which used to be double-ints before this patch) be wide-ints which carry additional information taken from the IL RTX piece at construction time. That's all nice and good. The other thing is adding a CONST_WIDE_INT to support wider integer constants than the weird host-dependent limitation we have right now. Mixing those things together probably made you be able to make Kenny work on it ... heh ;) So to me wide-ints provide the higher-level abstraction ontop of double-ints (which would remain the storage entity). Such higher-level
Re: [PATCH, gdc] - Merging gdc (GNU D Compiler) into gcc
On Fri, Oct 5, 2012 at 12:07 PM, Iain Buclaw ibuc...@ubuntu.com wrote: On 5 October 2012 01:06, Joseph S. Myers jos...@codesourcery.com wrote: On Thu, 4 Oct 2012, Iain Buclaw wrote: The only patches to gcc proper are documentation-related and adding the D frontend / libphobos to configure and make files. I would have thought that these would typically only be included with the actual front-end? Looking back at my previous review comments, I suggested that you might need to split up c-common.[ch] so that certain parts of attribute handling could be shared with D, because duplicate code copied from elsewhere in GCC was not an appropriate implementation approach. Have you then eliminated the duplicate code in some other way that does not involve splitting up those files so code can be shared? Ah, no; thanks for reminding me of this. The code duplicated from c-common.[ch] are the handlers for C __attributes__, however gdc doesn't use all of them because some just don't have a fitting place eg: gnu_inline, artificial. Would the best approach be to move all handle_* functions and any helper functions into a new source file that can be shared between frontends, and define two new frontend hooks, LANG_HOOK_ATTRIBUTE_TABLE and LANG_HOOK_FORMAT_ATTRIBUTE_TABLE ? Btw, the LTO frontend also has most of the stuff duplicated ... (see lto/lto-lang.c). Not sure why ... Richard. Regards -- Iain Buclaw *(p e ? p++ : p) = (c 0x0f) + '0';
Re: Use conditional casting with symtab_node
On Fri, Oct 5, 2012 at 12:50 PM, Nathan Froyd froy...@mozilla.com wrote: - Original Message - I see all these patches with mixed feeling - it puts breaks on all developers because they need to learn the new interface which does not bring any immediate benefit. So I think _your_ development time would be better spent by fixing open bugs or by tackling some of the existing scalability issues in GCC (rather than quoting funny '0.001% faster with 99% confidence' stuff). This tone is unnecessary. Sorry, that wasn't intended. I question these numbers because unless you bootstrap say 100 times the noise in bootstrap speed is way too high to make such claims. Of course critical information is missing: The new code bootstraps .616% faster with a 99% confidence of being faster. 99% confidence on what basis? What's your sample size? Why does the patch need this kind of marketing? I, for one, think that it's excellent that Lawrence is writing these cleanup patches and measuring what impact they have on performance. Bonus points that they are making the compiler faster. Speed of the compiler *is* a scalability issue, and it's one that GCC doesn't appear to have paid all that much attention to over the years. I just don't believe the 0.5% numbers. Richard. -Nathan
Re: patch to fix constant math
On Fri, Oct 5, 2012 at 1:24 PM, Richard Sandiford rdsandif...@googlemail.com wrote: Richard Guenther richard.guent...@gmail.com writes: On Fri, Oct 5, 2012 at 11:55 AM, Richard Sandiford rdsandif...@googlemail.com wrote: Richard Guenther richard.guent...@gmail.com writes: As far as the wide_ints recording a mode or precision goes: we're in the lucky position of having tried both options. Trees record the type (and thus precision) of all compile-time integer constants. RTL doesn't. And the RTL way is a right pain. It leads to interfaces in which rtx arguments often have to be accompanied by a mode. And it leads to clunky differences like those between simplify_unary_operation (which takes two mode arguments) and simplify_binary_operation (which with our current set of binary operations requires only one). But the issue here is not that double_int doesn't record a mode or precision (or a sign). The issue is that CONST_DOUBLE and CONST_INT don't! The _tree_ INTEGER_CST contains of a type and a double-int. I see double-int (and the proposed wide-int) as a common building-block used for kind of arbitrary precision (as far as the target needs) integer constants on both tree and RTL. And having a common worker implementation requires it to operate on something different than tree types or RTL mode plus signedness. To put it another way: every wide_int operation needs to know the precision of its arguments and the precision of its result. That's not true. Every tree or RTL operation does, not every wide_int operation. double_int's are just twos-complement numbers of precision 2 * HOST_BITS_PER_WIDE_INT. wide_int's should be just twos-complement numbers of precision len * WHATEVER_HOST_COMPONENT_TYPE_IS_SUITABLE_FOR_A_FAST_IMPLEMENTATION. Operations on double_int and wide_int are bare-metal, in nearly all of the times you should use routines that do proper sign-/zero-extending to a possibly smaller precision. When using bare-metal you are supposed to do that yourself. Which is why I was suggesting to add double_sint, double_uint, double_sintp (with precision), etc., wrappers and operations. Not storing the precision in the wide_int is putting the onus on the caller to keep track of the precision separately. But that's a matter of providing something high-level ontop of the bare-metal double-int/wide-int (something shared between RTL and trees). Duplicating information in the bare-metal doesn't make sense (and no, INTEGER_CSTs on the tree level are _not_ short-lived, and how can a double-int make sense on the tree level when you say it doesn't make sense on the RTL level?) I think we're talking at cross purposes here. To the extent that I'm not really sure where to begin. :-) Just in case this is it: the idea is that wide_int is the type used to _process_ integers. It is not suppoed to be the type used to store integers in the IRs. The way trees store integers and the way that RTL stores integers are up to them. For RTL the idea is that we continue to store integers that happen to fit into a HWI as a CONST_INT (regardless of the size of the CONST_INT's context-determined mode). Wider integers are stored as a CONST_DOUBLE (for unconverted targets) or a CONST_WIDE_INT (for converted targets). None of the three use the wide_int type; they use more compact representations instead. And as Kenny says, using CONST_INT continues to be an important way of reducing the IR footprint. Whenever we want to do something interesting with the integers, we construct a wide_int for them and use the same processing code for all three rtx types. This avoids having the separate single-HWI and double-HWI paths that we have now. It also copes naturally with cases where we start out with single-HWI values but end up with wider ones. But the operations that we do on these wide_ints will all be to a mode or precision. Shifts of QImode integers are different from shifts of HImode integers, etc. If you knew all that already (you probably did) and I've completely missed the point, please say so. :-) I'm not sure what you mean by bare metal. The issue is that unlike RTL where we construct double-ints from CONST_INT/CONST_DOUBLE right now, tree has the double-int _embedded_. That's why I say that the thing we embed in a CONST_WIDE_INT or a tree INTEGER_CST needs to be bare metal, and that's what I would call wide-int. OK, but that's deliberately not what Kenny's patch calls wide int. The whole idea is that the bare metal thing we embed in a CONST_WIDE_INT or tree isn't (and doesn't need to be) the same as the type that we use to operate on integers. That bare-metal thing doesn't even have to have a fixed width. (CONST_WIDE_INT doesn't have a fixed width, it's only as big as the integer needs it to be.) Ok, let's rephrase it this way then: the bare metal thing used for the storage should ideally be the same in the tree IL and the RTL
Re: [PATCH] Improve debug info for partial inlining (PR debug/54519, take 2)
On Wed, 3 Oct 2012, Jakub Jelinek wrote: On Tue, Sep 11, 2012 at 03:59:56PM +0200, Jakub Jelinek wrote: As discussed in the PR, right now we do a very bad job for debug info of partially inlined functions (both when they are kept only partially inlined, or when partial inlining is performed, but doesn't seem to be useful and foo.part.N is inlined back, either into the original function, or into a function into which foo has been inlined first). This patch improves that by doing something similar to what ipa-prop.c does, in particular for arguments that aren't actually passed to foo.part.N we add debug args and corresponding debug bind and debug source bind stmts to provide better debug info (if foo.part.N isn't inlined, then DW_OP_GNU_parameter_ref is going to be used together with corresponding call site arguments). Bootstrapped/regtested on x86_64-linux and i686-linux, some of the tests still fail with some option combinations, am going to file a DF VTA PR for that momentarily. Ok for trunk? Now that Alex fixed the DF VTA issue, most of the non-LTO FAILs are gone, the remaining one on x86_64 looks like a GDB bug, and there is one -Os failure on pr54519-3.c which could be either alias oracle or var-tracking bug/missing feature - basically there is a non-addressable parameter in stack slot, and vt_canon_true_dep - canon_true_dependence thinks an argument push insn might alias with it, because it doesn't have a MEM_EXPR and ao_ref_from_mem fails. Posting an updated patch, because last night I found that even when the patch should have fixed static inline void foo (int x, int y) { asm volatile (nop); } static inline void bar (int z) { foo (z, 0); foo (z, 1); } int main () { bar (0); bar (1); return 0; } it didn't, there was a confusion when DECL_ORIGIN should be used and when not. This version of the patch fixes that, on this testcase (adjusted added as pr54519-6.c) p x, p y and up; p z now work well. Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk? 2012-10-03 Jakub Jelinek ja...@redhat.com PR debug/54519 * ipa-split.c (split_function): Add debug args and debug source and normal stmts for args_to_skip which are gimple regs. * tree-inline.c (copy_debug_stmt): When inlining, adjust source debug bind stmts to debug binds of corresponding DEBUG_EXPR_DECL. * gcc.dg/guality/pr54519-1.c: New test. * gcc.dg/guality/pr54519-2.c: New test. * gcc.dg/guality/pr54519-3.c: New test. * gcc.dg/guality/pr54519-4.c: New test. * gcc.dg/guality/pr54519-5.c: New test. * gcc.dg/guality/pr54519-6.c: New test. --- gcc/ipa-split.c.jj2012-09-25 14:26:52.612821323 +0200 +++ gcc/ipa-split.c 2012-10-02 17:41:31.030357922 +0200 @@ -1059,6 +1059,7 @@ split_function (struct split_point *spli gimple last_stmt = NULL; unsigned int i; tree arg, ddef; + VEC(tree, gc) **debug_args = NULL; if (dump_file) { @@ -1232,6 +1233,65 @@ split_function (struct split_point *spli gimple_set_block (call, DECL_INITIAL (current_function_decl)); VEC_free (tree, heap, args_to_pass); The following could use a comment on what you are doing ... + if (args_to_skip) +for (parm = DECL_ARGUMENTS (current_function_decl), num = 0; + parm; parm = DECL_CHAIN (parm), num++) + if (bitmap_bit_p (args_to_skip, num) +is_gimple_reg (parm)) + { + tree ddecl; + gimple def_temp; + + arg = get_or_create_ssa_default_def (cfun, parm); + if (!MAY_HAVE_DEBUG_STMTS) + continue; + if (debug_args == NULL) + debug_args = decl_debug_args_insert (node-symbol.decl); + ddecl = make_node (DEBUG_EXPR_DECL); + DECL_ARTIFICIAL (ddecl) = 1; + TREE_TYPE (ddecl) = TREE_TYPE (parm); + DECL_MODE (ddecl) = DECL_MODE (parm); + VEC_safe_push (tree, gc, *debug_args, DECL_ORIGIN (parm)); + VEC_safe_push (tree, gc, *debug_args, ddecl); + def_temp = gimple_build_debug_bind (ddecl, unshare_expr (arg), + call); + gsi_insert_after (gsi, def_temp, GSI_NEW_STMT); + } + if (debug_args != NULL) +{ + unsigned int i; + tree var, vexpr; + gimple_stmt_iterator cgsi; + gimple def_temp; + + push_cfun (DECL_STRUCT_FUNCTION (node-symbol.decl)); What do you need the push/pop_cfun for? I see ENTRY_BLOCK_PTR (easily fixable), but else? + var = BLOCK_VARS (DECL_INITIAL (node-symbol.decl)); + i = VEC_length (tree, *debug_args); + cgsi = gsi_after_labels (single_succ (ENTRY_BLOCK_PTR)); + do + { + i -= 2; + while (var != NULL_TREE + DECL_ABSTRACT_ORIGIN (var) + != VEC_index (tree, *debug_args, i)) + var = TREE_CHAIN (var); + if (var == NULL_TREE) + break; +
Re: patch to fix constant math
On Fri, Oct 5, 2012 at 2:26 PM, Richard Sandiford rdsandif...@googlemail.com wrote: Richard Guenther richard.guent...@gmail.com writes: On Fri, Oct 5, 2012 at 1:24 PM, Richard Sandiford rdsandif...@googlemail.com wrote: Richard Guenther richard.guent...@gmail.com writes: The issue is that unlike RTL where we construct double-ints from CONST_INT/CONST_DOUBLE right now, tree has the double-int _embedded_. That's why I say that the thing we embed in a CONST_WIDE_INT or a tree INTEGER_CST needs to be bare metal, and that's what I would call wide-int. OK, but that's deliberately not what Kenny's patch calls wide int. The whole idea is that the bare metal thing we embed in a CONST_WIDE_INT or tree isn't (and doesn't need to be) the same as the type that we use to operate on integers. That bare-metal thing doesn't even have to have a fixed width. (CONST_WIDE_INT doesn't have a fixed width, it's only as big as the integer needs it to be.) Ok, let's rephrase it this way then: the bare metal thing used for the storage should ideally be the same in the tree IL and the RTL IL _and_ the higher-level abstract wide-int. Hmm, OK, that's a straight disagreement then. So to me wide-ints provide the higher-level abstraction ontop of double-ints (which would remain the storage entity). Such higher-level abstraction is very useful, also for double-ints and also on the tree level. There is no requirement to provide bigger double-int (or wide-int) for this. Let's call this abstraction wide-int (as opposed to my suggested more piecemail double_sint, double_uint). You can perfectly model it ontop of the existing double_int storage. As of providing larger double-ints - there is not much code left (ok, quite an overstatement ;)) that relies on the implementation detail of double-int containing exactly two HOST_WIDE_INTs. The exact purpose of double-ints was to _hide_ this (previously we only had routines like mul_double_with_sign which take two HOST_WIDE_INT components). Most code dealing with the implementation detail is code interfacing with middle-end routines that take a HOST_WIDE_INT argument (thus the double_int_fits_hwi_p predicates) - even wide_int has to support this kind of interfacing. So, after introducing wide_int that just wraps double_int and changing all user code (hopefully all, I guess mostly all), we'd tackle the issue that the size of double_int's is host dependent. A simple solution is to make it's size dependent on a target macro (yes, macro ...), so on a 32bit HWI host targeting a 64bit 'HWI' target you'd simply have four HWIs in the 'double-int' storage (and arithmetic needs to be generalized to support this). I just don't see why this is a good thing. The constraints are different when storing integers and when operating on them. When operating on them, we want something that is easily constructed on the stack, so we can create temporary structures very cheaply, and can easily pass and return them. We happen to know at GCC's compile time how big the biggest integer will ever be, so it makes sense for wide_int to be that wide. I'm not arguing against this. I'm just saying that the internal representation will depend on the host - not the number of total bits, but the number of pieces. Sure, but Kenny already has a macro to specify how many bits we need (MAX_BITSIZE_MODE_ANY_INT). We can certainly wrap: HOST_WIDE_INT val[MAX_BITSIZE_MODE_ANY_INT / HOST_BITS_PER_WIDE_INT]; in a typedef if that's what you prefer. I'd prefer it to be initially double_int, and later fixed to double_int with a member like the above. Possibly renamed as well. But when storing integers we want something compact. If your machine supports 256-bit integers, but the code you're compiling makes heavy use of 128-bit integers, why would you want to waste 128 of 256 bits on every stored integer? Which is why even CONST_WIDE_INT doesn't have a fixed width. You seem to be arguing that the type used to store integers in the IR has to be same as the one that we use when performing compile-time arithmetic, but I think it's better to have an abstraction between the two. Well, you don't want to pay the cost dividing 256bit numbers all the time when most of your numbers are only 128bit. So we don't really want to perform compile-time arithmetic on the biggest possible precision either. wide_int doesn't perform them in the biggest possible precision. It performs them in the precision of the result. It just happens to have enough storage to cope with all possible precisions (because the actual precision of the result is usually only known at GCC's runtime). So if you think a pair of HWIs continues to be a useful way of storing integers at the tree level, then we can easily continue to use a pair of HWIs there. How do you store larger ints there then? You'd need another tree code for wider integers. I'm not saying that's a good idea
Re: patch to fix constant math
On Fri, Oct 5, 2012 at 2:39 PM, Richard Guenther richard.guent...@gmail.com wrote: Ok, I see where you are going. Let me look at the patch again. * The introduction and use of CONST_SCALAR_INT_P could be split out (obvious and good) * DEF_RTL_EXPR(CONST_WIDE_INT, const_wide_int, , RTX_CONST_OBJ) defining that new RTX is orthogonal to providing the wide_int abstraction for operating on CONST_INT and CONST_DOUBLE, right? @@ -144,6 +144,7 @@ static bool prefer_and_bit_test (enum machine_mode mode, int bitnum) { bool speed_p; + wide_int mask = wide_int::set_bit_in_zero (bitnum, mode); set_bit_in_zero ... why use this instead of wide_int::zero (mode).set_bit (bitnum) that would match the old double_int_zero.set_bit interface and be more composed of primitives? if (and_test == 0) { @@ -164,8 +165,7 @@ prefer_and_bit_test (enum machine_mode m } /* Fill in the integers. */ - XEXP (and_test, 1) -= immed_double_int_const (double_int_zero.set_bit (bitnum), mode); + XEXP (and_test, 1) = immed_wide_int_const (mask); I suppose immed_wide_int_const may create a CONST_INT, a CONST_DOUBLE or a CONST_WIDE_INT? +#if TARGET_SUPPORTS_WIDE_INT +/* Returns 1 if OP is an operand that is a CONST_INT or CONST_WIDE_INT. */ +int +const_scalar_int_operand (rtx op, enum machine_mode mode) +{ why is it necessary to condition anything on TARGET_SUPPORTS_WIDE_INT? It seems testing/compile coverage of this code will be bad ... Can't you simply map CONST_WIDE_INT to CONST_DOUBLE for targets not supporting CONST_WIDE_INT (what does it mean to support CONST_WIDE_INT? Does a target that supports CONST_WIDE_INT no longer use CONST_DOUBLEs for integers?) + if (!CONST_WIDE_INT_P (op)) +return 0; hm, that doesn't match the comment (CONST_INT is supposed to return 1 as well). The comment doesn't really tell what the function does it seems, + int prec = GET_MODE_PRECISION (mode); + int bitsize = GET_MODE_BITSIZE (mode); + + if (CONST_WIDE_INT_NUNITS (op) * HOST_BITS_PER_WIDE_INT bitsize) + return 0; it's mode argument is not documented. And this doesn't even try to see if the constant would fit the mode anyway (like if it's zero). Not sure what the function is for. + { + /* Multiword partial int. */ + HOST_WIDE_INT x + = CONST_WIDE_INT_ELT (op, CONST_WIDE_INT_NUNITS (op) - 1); + return (wide_int::sext (x, prec (HOST_BITS_PER_WIDE_INT - 1)) + == x); err - so now we have wide_int with a mode but we pass in another mode and see if they have the same value? Same value as-in signed value? Which probably is why you need to rule out different size constants above? Because you don't know whether to sign or zero extend? +/* Returns 1 if OP is an operand that is a CONST_WIDE_INT. */ +int +const_wide_int_operand (rtx op, enum machine_mode mode) +{ similar comments, common code should be factored out at least. Instead of conditioning a whole set of new function on supports-wide-int please add cases to the existing functions to avoid diverging in pieces that both kind of targets support. @@ -2599,7 +2678,7 @@ constrain_operands (int strict) break; case 'n': - if (CONST_INT_P (op) || CONST_DOUBLE_AS_INT_P (op)) + if (CONST_SCALAR_INT_P (op)) win = 1; factoring out this changes would really make the patch less noisy. skipping to rtl.h bits +struct GTY((variable_size)) hwivec_def { + int num_elem;/* number of elements */ + HOST_WIDE_INT elem[1]; +}; num_elem seems redundant and computable from GET_MODE_SIZE. Or do you want to optimize encoding like for CONST_INT (and unlike CONST_DOUBLE)? I doubt the above packs nicely into rtx_def? A general issue of it though - we waste 32bits on 64bit hosts in rtx_def between the bits and the union. Perfect place for num_elem (if you really need it) and avoids another 4 byte hole. Similar issue exists in rtvec_def. back to where I was @@ -645,6 +673,10 @@ iterative_hash_rtx (const_rtx x, hashval return iterative_hash_object (i, hash); case CONST_INT: return iterative_hash_object (INTVAL (x), hash); +case CONST_WIDE_INT: + for (i = 0; i CONST_WIDE_INT_NUNITS (x); i++) + hash = iterative_hash_object (CONST_WIDE_INT_ELT (x, i), hash); + return hash; I see CONST_DOUBLEs value is not hashed. Why hash wide-ints value? Seeing +#define HWI_GET_NUM_ELEM(HWIVEC) ((HWIVEC)-num_elem) +#define HWI_PUT_NUM_ELEM(HWIVEC, NUM) ((HWIVEC)-num_elem = (NUM)) skipping to +#if TARGET_SUPPORTS_WIDE_INT + { +rtx value = const_wide_int_alloc (len); +unsigned int i; + +/* It is so tempting to just put the mode in here. Must control + myself ... */ +PUT_MODE (value, VOIDmode); +HWI_PUT_NUM_ELEM (CONST_WIDE_INT_VEC (value), len); why is NUM_ELEM not set in const_wide_int_alloc, and only there? +extern rtx rtx_alloc_stat_v
Re: patch to fix constant math
On Fri, Oct 5, 2012 at 3:18 PM, Richard Sandiford rdsandif...@googlemail.com wrote: Richard Sandiford rdsandif...@googlemail.com writes: How is CONST_WIDE_INT variable size? It's just the usual trailing variable-length array thing. Good. Do you get rid of CONST_DOUBLE (for integers) at the same time? Yeah. I initially thought it might be OK to keep them and have CONST_INT, integer CONST_DOUBLEs and CONST_WIDE_INT alongside each other. (The way the patch is structured means that the choice of whether to keep integer CONST_DOUBLEs can be changed very easily.) But Kenny convinced me it was a bad idea. Sorry to follow up on myself, but to clarify: I was talking about converted targets here. (As in, I originally thought even converted targets could continue to use integer CONST_DOUBLEs.) Unconverted targets continue to use CONST_DOUBLE. Why is it that not all targets are converted? What's the difficulty with that? I really do not like partially transitioning there. Richard. Richard
Re: [PATCH] Improve debug info for partial inlining (PR debug/54519, take 2)
On Fri, Oct 5, 2012 at 2:49 PM, Jakub Jelinek ja...@redhat.com wrote: On Fri, Oct 05, 2012 at 02:20:13PM +0200, Richard Guenther wrote: The following could use a comment on what you are doing ... Will add something. + if (args_to_skip) +for (parm = DECL_ARGUMENTS (current_function_decl), num = 0; +parm; parm = DECL_CHAIN (parm), num++) + if (bitmap_bit_p (args_to_skip, num) + is_gimple_reg (parm)) + { + tree ddecl; + gimple def_temp; + + arg = get_or_create_ssa_default_def (cfun, parm); + if (!MAY_HAVE_DEBUG_STMTS) + continue; + if (debug_args == NULL) + debug_args = decl_debug_args_insert (node-symbol.decl); + ddecl = make_node (DEBUG_EXPR_DECL); + DECL_ARTIFICIAL (ddecl) = 1; + TREE_TYPE (ddecl) = TREE_TYPE (parm); + DECL_MODE (ddecl) = DECL_MODE (parm); + VEC_safe_push (tree, gc, *debug_args, DECL_ORIGIN (parm)); + VEC_safe_push (tree, gc, *debug_args, ddecl); + def_temp = gimple_build_debug_bind (ddecl, unshare_expr (arg), + call); + gsi_insert_after (gsi, def_temp, GSI_NEW_STMT); + } + if (debug_args != NULL) +{ + unsigned int i; + tree var, vexpr; + gimple_stmt_iterator cgsi; + gimple def_temp; + + push_cfun (DECL_STRUCT_FUNCTION (node-symbol.decl)); What do you need the push/pop_cfun for? I see ENTRY_BLOCK_PTR (easily fixable), but else? I believe that gsi_insert_before in another function isn't going to work well. E.g. update_modified_stmt starts with if (!ssa_operands_active (cfun)) return; Hmm, indeed. Or is it ok to use gsi_insert_before_without_update and expect that when compilation resumes for the modified callee, it will update all modified stmts? There is not much that needs to be updated on the stmts, source bind has not setters nor uses of any SSA_NAMEs or virtual defs/sets, the normal bind has a DEBUG_EXPR_DECL on the RHS and so is not a use or def of anything as well. I don't think we want to rely on that ... so just keep the push/pop_cfun. Thanks, Richard. If push_cfun/pop_cfun is not needed, guess the two loops could be merged, the reason for two of them was just that I didn't want to push_cfun/pop_cfun for each optimized away parameter. + var = BLOCK_VARS (DECL_INITIAL (node-symbol.decl)); + i = VEC_length (tree, *debug_args); + cgsi = gsi_after_labels (single_succ (ENTRY_BLOCK_PTR)); + do + { + i -= 2; + while (var != NULL_TREE + DECL_ABSTRACT_ORIGIN (var) + != VEC_index (tree, *debug_args, i)) + var = TREE_CHAIN (var); + if (var == NULL_TREE) + break; + vexpr = make_node (DEBUG_EXPR_DECL); + parm = VEC_index (tree, *debug_args, i); + DECL_ARTIFICIAL (vexpr) = 1; + TREE_TYPE (vexpr) = TREE_TYPE (parm); + DECL_MODE (vexpr) = DECL_MODE (parm); + def_temp = gimple_build_debug_source_bind (vexpr, parm, +NULL); + gsi_insert_before (cgsi, def_temp, GSI_SAME_STMT); + def_temp = gimple_build_debug_bind (var, vexpr, NULL); + gsi_insert_before (cgsi, def_temp, GSI_SAME_STMT); + } + while (i); + pop_cfun (); +} + + t = VEC_index (tree, *debug_args, i + 1); + gimple_debug_source_bind_set_value (stmt, t); + stmt-gsbase.subcode = GIMPLE_DEBUG_BIND; That looks fishy ... shouldn't it be at least the other way around? stmt-gsbase.subcode = GIMPLE_DEBUG_BIND; gimple_debug_bind_set_value (stmt, t); It surely can. I was considering whether to create a new wrapper in gimple.h for the subcode change of a debug stmt, but if there are no further places needing this (don't see them right now), that might be overkill. Otherwise this looks ok. Thanks. Jakub
Re: patch to fix constant math
On Fri, Oct 5, 2012 at 4:14 PM, Richard Sandiford rdsandif...@googlemail.com wrote: Richard Guenther richard.guent...@gmail.com writes: On Fri, Oct 5, 2012 at 3:18 PM, Richard Sandiford rdsandif...@googlemail.com wrote: Richard Sandiford rdsandif...@googlemail.com writes: How is CONST_WIDE_INT variable size? It's just the usual trailing variable-length array thing. Good. Do you get rid of CONST_DOUBLE (for integers) at the same time? Yeah. I initially thought it might be OK to keep them and have CONST_INT, integer CONST_DOUBLEs and CONST_WIDE_INT alongside each other. (The way the patch is structured means that the choice of whether to keep integer CONST_DOUBLEs can be changed very easily.) But Kenny convinced me it was a bad idea. Sorry to follow up on myself, but to clarify: I was talking about converted targets here. (As in, I originally thought even converted targets could continue to use integer CONST_DOUBLEs.) Unconverted targets continue to use CONST_DOUBLE. Why is it that not all targets are converted? What's the difficulty with that? I really do not like partially transitioning there. The problem is that CONST_DOUBLE as it exists today has two meanings: a floating-point meaning and an integer meaning. Ports that handle CONST_DOUBLEs are aware of this and expect the two things to have the same rtx code. Whereas in a converted port, the two things have different rtx codes, and the integers have a different representation from the current low/high pair. So to take the first hit in config/alpha/alpha.c, namely alpha_rtx_costs: case CONST_DOUBLE: if (x == CONST0_RTX (mode)) *total = 0; else if ((outer_code == PLUS add_operand (x, VOIDmode)) || (outer_code == AND and_operand (x, VOIDmode))) *total = 0; else if (add_operand (x, VOIDmode) || and_operand (x, VOIDmode)) *total = 2; else *total = COSTS_N_INSNS (2); return true; What could the patch do to make this work without modification? The middle two cases are for integers, but the first and last presumably apply to both integers and floats. I didn't say it does not require changes, just that the transition should be finished. Some ports have little CONST_DOUBLE uses (which is what I am grepping for), and if the max-wide-int-size matches that of the current CONST_DOUBLE there is little chance for code generation differences (in fact there should be none, correct?). In the current patch no target is converted (maybe that's just going to be 5/n?), I'd like to see at least one commonly tested target converted and if we don't convert all targets another commonly tested target to stay unconverted (just check gcc-testresults for which pair of targets that would work). Definitely at the end of stage3 we should have zero unconverted targets (but I doubt converting them is a huge task - I have converted the VECTOR_CST representation as well and we've had the location + BLOCK merge and other changes affecting all targets). Richard. Richard
Re: Variable DECL_SIZE
On Wed, Oct 3, 2012 at 7:05 PM, Ilya Enkovich enkovich@gmail.com wrote: Hi, I fall into ssa verification failure when try to pass field's DECL_SIZE as an operand for CALL_EXPR. The fail occurs if field's size is not a constant. In such case DECL_SIZE holds a VAR_DECL and I need to find it's proper SSA_NAME. I thought it may be the default one but gimple_default_def returns NULL. What is the right way to obtain correct ssa_name in such case? There is no way. You have to know that you need it's size at the point of gimplification. Later there is no way to recover it. Richard. Thanks, Ilya
Re: Functions that are CSEable but not pure
On Wed, Oct 3, 2012 at 9:00 PM, Jason Merrill ja...@redhat.com wrote: In C++ there is a common idiom called initialize on first use. In its simplest form it looks like int lazy_i() { static int i = init; return i; } If the initialization is expensive or order-sensitive, this is a useful alternative to initialization on load (http://www.parashift.com/c++-faq/static-init-order-on-first-use.html). An interesting property of such functions is that they only have side-effects the first time they are called, so subsequent calls can be optimized away to just use the return value of the first call. If the result is not needed, are we allowed to remove a call to this function? Basically, is the function allowed to detect if it was called before? If we can remove it, thus the function may not exploit knowledge on wheter it was called before then we can also perform partial redundancy elimination in if () lazy_i (); lazy_i (); so I really hope we can also DCE these calls - otherwise their property would be really weird, as CSE also involves a DCE piece: i = lazy_i (); j = lazy_j (); - i = lazy_i (); lazy_j (); j = i; - i = lazy_i (); j = i; But if we are allowed to DCE a lazy_i call with unused result then it really is pure. Basically with this property we consider the static variable it writes to as local, as it cannot be inspected in any other way than by calling the function or inspecting its return value. So - what's wrong with using pure? That it doesn't feel correct? The I suppose we should simply re-word its definition ;) Currently there is no way to express this in GCC so that the optimizers know that multiple calls are redundant. Marking the function as pure causes calls to be CSEd as desired, but also asserts that the first call has no side-effects, which is not the case. My implementation of dynamic initialization of TLS variables as mandated by the C++11 and OpenMP standards uses this idiom implicitly, so I'd really like to be able to optimize away multiple calls. Right now we have the following ECF flags (among others): const = no read, no write, no side-effects, cseable pure = no write, no side-effects, cseable novops = no read, no write note that I consider novops a red-herring. novops is basically 'const' volatile, but we don't have a volatile flag on calls (we have it, but that's for the store to the result). I'd rather get rid of novops ... looping_const_or_pure = wait, actually there are side-effects Seems like the difference between novops and looping_const_or_pure is whether calls are subject to CSE. Is that right? No, novops and looping_const_or_pure tell you whether they are subject to DCE, all const/pure functions are subject to CSE, novops functions are not (as said above, novops is really a volatile const). Rather than checking these flag bundles, it seems to me that we ought to break them up into no reads no writes no side-effects cseable side-effects no side-effects and cseable side-effects doesn't sound like a good distinction to me - side-effects are not 'cse'-able or they would not be side-effects. But I suppose it all boils down to my question above, where the PRE case is really the most interesting (can the formerly second call detect it used the initialized path or the non-initialized path?) So const would be noread|nowrite|noside pure would be nowrite|noside const|looping would be noread|nowrite|cseside pure|looping would be nowrite|cseside novops would be noread|nowrite and the behavior I want would be just cseside. Does this make sense? Anyone more familiar with this area of the code want to implement it for me? :} I'd say it's already implemented, just poorly documented. Use pure! Richard. Jason
Re: reverse bitfield patch
On Wed, Oct 3, 2012 at 12:07 AM, DJ Delorie d...@redhat.com wrote: Here's my current patch for the bitfield reversal feature I've been working on for a while, with an RX-specific pragma to apply it globally. Could someone please review this? It would be nice to get it in before stage1 closes again... ChangeLog missing, new functions need a toplevel comment documenting function, argument and return value as per coding conventions. Richard. Index: gcc/doc/extend.texi === --- gcc/doc/extend.texi (revision 192009) +++ gcc/doc/extend.texi (working copy) @@ -5427,12 +5427,74 @@ Note that the type visibility is applied associated with the class (vtable, typeinfo node, etc.). In particular, if a class is thrown as an exception in one shared object and caught in another, the class must have default visibility. Otherwise the two shared objects will be unable to use the same typeinfo node and exception handling will break. +@item bit_order +Normally, GCC allocates bitfields from either the least significant or +most significant bit in the underlying type, such that bitfields +happen to be allocated from lowest address to highest address. +Specifically, big-endian targets allocate the MSB first, where +little-endian targets allocate the LSB first. The @code{bit_order} +attribute overrides this default, allowing you to force allocation to +be MSB-first, LSB-first, or the opposite of whatever gcc defaults to. The +@code{bit_order} attribute takes an optional argument: + +@table @code + +@item native +This is the default, and also the mode when no argument is given. GCC +allocates LSB-first on little endian targets, and MSB-first on big +endian targets. + +@item swapped +Bitfield allocation is the opposite of @code{native}. + +@item lsb +Bits are allocated LSB-first. + +@item msb +Bits are allocated MSB-first. + +@end table + +A short example demonstrates bitfield allocation: + +@example +struct __attribute__((bit_order(msb))) @{ + char a:3; + char b:3; +@} foo = @{ 3, 5 @}; +@end example + +With LSB-first allocation, @code{foo.a} will be in the 3 least +significant bits (mask 0x07) and @code{foo.b} will be in the next 3 +bits (mask 0x38). With MSB-first allocation, @code{foo.a} will be in +the 3 most significant bits (mask 0xE0) and @code{foo.b} will be in +the next 3 bits (mask 0x1C). + +Note that it is entirely up to the programmer to define bitfields that +make sense when swapped. Consider: + +@example +struct __attribute__((bit_order(msb))) @{ + short a:7; + char b:6; +@} foo = @{ 3, 5 @}; +@end example + +On some targets, or if the struct is @code{packed}, GCC may only use +one byte of storage for A despite it being a @code{short} type. +Swapping the bit order of A would cause it to overlap B. Worse, the +bitfield for B may span bytes, so ``swapping'' would no longer be +defined as there is no ``char'' to swap within. To avoid such +problems, the programmer should either fully-define each underlying +type, or ensure that their target's ABI allocates enough space for +each underlying type regardless of how much of it is used. + @end table To specify multiple attributes, separate them by commas within the double parentheses: for example, @samp{__attribute__ ((aligned (16), packed))}. Index: gcc/c-family/c-common.c === --- gcc/c-family/c-common.c (revision 192009) +++ gcc/c-family/c-common.c (working copy) @@ -310,12 +310,13 @@ struct visibility_flags visibility_optio static tree c_fully_fold_internal (tree expr, bool, bool *, bool *); static tree check_case_value (tree); static bool check_case_bounds (tree, tree, tree *, tree *); static tree handle_packed_attribute (tree *, tree, tree, int, bool *); +static tree handle_bitorder_attribute (tree *, tree, tree, int, bool *); static tree handle_nocommon_attribute (tree *, tree, tree, int, bool *); static tree handle_common_attribute (tree *, tree, tree, int, bool *); static tree handle_noreturn_attribute (tree *, tree, tree, int, bool *); static tree handle_hot_attribute (tree *, tree, tree, int, bool *); static tree handle_cold_attribute (tree *, tree, tree, int, bool *); static tree handle_noinline_attribute (tree *, tree, tree, int, bool *); @@ -601,12 +602,14 @@ const unsigned int num_c_common_reswords const struct attribute_spec c_common_attribute_table[] = { /* { name, min_len, max_len, decl_req, type_req, fn_type_req, handler, affects_type_identity } */ { packed, 0, 0, false, false, false, handle_packed_attribute , false}, + { bit_order, 0, 1, false, true, false, + handle_bitorder_attribute , false}, { nocommon, 0, 0, true, false, false,
Re: inlined memcpy/memset degradation in gcc 4.6 or later
On Tue, Oct 2, 2012 at 4:19 PM, Walter Lee w...@tilera.com wrote: On TILE-Gx, I'm observing a degradation in inlined memcpy/memset in gcc 4.6 and later versus gcc 4.4. Though I find the problem on TILE-Gx, I think this is a problem for any architectures with SLOW_UNALIGNED_ACCESS set to 1. Consider the following program: struct foo { int x; }; void copy(struct foo* f0, struct foo* f1) { memcpy (f0, f1, sizeof(struct foo)); } In gcc 4.4, I get the desired inline memcpy: copy: ld4sr1, r1 st4 r0, r1 jrp lr In gcc 4.7, however, I get inlined byte-by-byte copies: copy: ld1u_add r10, r1, 1 st1_add r0, r10, 1 ld1u_add r10, r1, 1 st1_add r0, r10, 1 ld1u_add r10, r1, 1 st1_add r0, r10, 1 ld1u r10, r1 st1 r0, r10 jrp lr The inlining of memcpy is done in expand_builtin_memcpy in builtins.c. Tracing through that, I see that the alignment of src_align and dest_align, which is computed by get_pointer_alignment, has degraded: in gcc 4.4 they are 32 bits, but in gcc 4.7 they are 8 bits. This causes the loads generated by the inlined memcopy to be per-byte instead of per-4-byte. Looking further, gcc 4.7 uses the align field in struct ptr_info_def to compute the alignment. This field appears to be initialized in get_ptr_info in tree-ssanames.c but it is always initialized to 1 byte and does not appear to change. gcc 4.4 computes its alignment information differently. I get the same byte-copies with gcc 4.8 and gcc 4.6. I see a couple related open PRs: 50417, 53535, but no suggested fixes for them yet. Can anyone advise on how this can be fixed? Should I file a new bug, or add this info to one of the existing PRs? There is no way to fix it. memcpy does not require aligned arguments and the merely presence of a typed pointer contains zero information of alignment for the middle-end. If you want to excercise C's notion of alignemnt requirements then do not use memcpy but *f0 = *f1; which works equally well. ISTR Tom had various patches improving alignment info for pointers but we cannot improve this degenerate case. Btw, the new beavior even fixed bugs. Richard. Thanks, Walter
Re: Functions that are CSEable but not pure
On Thu, Oct 4, 2012 at 2:56 PM, Jason Merrill ja...@redhat.com wrote: On 10/04/2012 08:45 AM, Jakub Jelinek wrote: On Thu, Oct 04, 2012 at 10:42:59AM +0200, Richard Guenther wrote: On Wed, Oct 3, 2012 at 9:00 PM, Jason Merrill ja...@redhat.com wrote: If the result is not needed, are we allowed to remove a call to this function? No. Unless you know the same function has been already called. Yes, we are. If we optimize away a use of the variable, we can also remove the call to the function. We can also hoist the call so its side-effects occur before other side-effects, as the standard only requires that the initialization of the variable occur some time between thread creation and the first use of the variable. So - what's wrong with using pure? That it doesn't feel correct? No, it can modify memory. Right, that's the issue. The back end assumes that calls to pure functions won't clobber memory; calls to these functions can clobber arbitrary memory, but only on the first call. Ugh. Especially with the above (you can DCE those calls) makes this severly mis-specified ... and any implementation error-prone (look what mess our losely defined 'malloc' attribute opened ...). I thought of a testcase like int *p = get_me (); .. = *p; int *q = get_me (); .. = *q; and get_me allocating/initalizing and returning a singleton. But you tell me it's more complicated and get_me () needs to be a barrier for any load/store (as it may modify arbitrary memory, but only on the first call). I think that may modify arbitrary memory isn't going to fly and my answer would be, better don't try to optimize anything here, at least not in generic terms. How would you handle this in the alias oracle? How would you then make CSE recognize two functions return the same value and are CSEable? I think the plan was for these functions not to return any value, No, I'm talking about the wrapper function which returns a reference to the variable (as in my example). Can you come up with a short but complete testcase illustrating the issue better (preferably C, so I don't need to read-in magic points where construction happens)? Thanks, Richard. Jason
Re: Functions that are CSEable but not pure
On Thu, Oct 4, 2012 at 5:22 PM, Jason Merrill ja...@redhat.com wrote: On 10/04/2012 09:07 AM, Richard Guenther wrote: Ugh. Especially with the above (you can DCE those calls) makes this severly mis-specified ... and any implementation error-prone (look what mess our losely defined 'malloc' attribute opened ...). I thought of a testcase like int *p = get_me (); .. = *p; int *q = get_me (); .. = *q; and get_me allocating/initalizing and returning a singleton. Right. But you tell me it's more complicated and get_me () needs to be a barrier for any load/store (as it may modify arbitrary memory, but only on the first call). Yes, because the initialization is user-written code. I think that may modify arbitrary memory isn't going to fly and my answer would be, better don't try to optimize anything here, at least not in generic terms. How would you handle this in the alias oracle? How would you then make CSE recognize two functions return the same value and are CSEable? For aliasing purposes, the call is like a call to a normal function. For CSE purposes, we want to recognize identical calls and combine them. I don't know the GCC bits well enough to be any more specific. Can you come up with a short but complete testcase illustrating the issue better (preferably C, so I don't need to read-in magic points where construction happens)? int init_count; int data; void init() { static int initialized; if (!initialized) { data = ++init_count; initialized = 1; } } inline int *get_me() __attribute ((pure)); inline int *get_me() { init (); return data; } int sink; int main() { sink = init_count; int *p = get_me(); if (init_count != 1) __builtin_abort(); int *q = get_me(); if (init_count != 1) __builtin_abort(); return *p + *q; } On this testcase, gcc -O2 doesn't reload init_count after the call to get_me because it thinks that the call can't have modified init_count. I want the compiler to know that it is safe to discard the redundant assignment, but not make assumptions about memory. But isn't it a fact that it _cannot_ modify init_count? If the second call is CSEable then it cannot have side-effects that are observable at the call site. Is the following an example you would consider to fall under your CSEing? int init_count; int data; int initialized; void init() { if (!initialized) { data = ++init_count; initialized = 1; } } inline int *get_me() __attribute ((pure)); inline int *get_me() { init (); return data; } int sink; int main() { sink = init_count; int *p = get_me(); if (init_count != 1) __builtin_abort(); initialized = 0; int *q = get_me(); if (init_count != 2) __builtin_abort(); return *p + *q; } ? If so, then why can we assume get_me returns the same pointer even here? Richard. On 10/04/2012 08:59 AM, Jakub Jelinek wrote: On Thu, Oct 04, 2012 at 08:56:03AM -0400, Jason Merrill wrote: Sure, but I thought you want to inline the wrapper function as soon as possible. Or do you want to keep it as some kind of builtin that gets expanded during expansion to the test and call? Ah, I see your point; if get_me is inlined we end up with two calls to init, so it would be good to mark init with the same hypothetical attribute. Jason
Re: Functions that are CSEable but not pure
On Thu, Oct 4, 2012 at 7:08 PM, Richard Guenther richard.guent...@gmail.com wrote: On Thu, Oct 4, 2012 at 5:22 PM, Jason Merrill ja...@redhat.com wrote: On 10/04/2012 09:07 AM, Richard Guenther wrote: Ugh. Especially with the above (you can DCE those calls) makes this severly mis-specified ... and any implementation error-prone (look what mess our losely defined 'malloc' attribute opened ...). I thought of a testcase like int *p = get_me (); .. = *p; int *q = get_me (); .. = *q; and get_me allocating/initalizing and returning a singleton. Right. But you tell me it's more complicated and get_me () needs to be a barrier for any load/store (as it may modify arbitrary memory, but only on the first call). Yes, because the initialization is user-written code. I think that may modify arbitrary memory isn't going to fly and my answer would be, better don't try to optimize anything here, at least not in generic terms. How would you handle this in the alias oracle? How would you then make CSE recognize two functions return the same value and are CSEable? For aliasing purposes, the call is like a call to a normal function. For CSE purposes, we want to recognize identical calls and combine them. I don't know the GCC bits well enough to be any more specific. Can you come up with a short but complete testcase illustrating the issue better (preferably C, so I don't need to read-in magic points where construction happens)? int init_count; int data; void init() { static int initialized; if (!initialized) { data = ++init_count; initialized = 1; } } inline int *get_me() __attribute ((pure)); inline int *get_me() { init (); return data; } int sink; int main() { sink = init_count; int *p = get_me(); if (init_count != 1) __builtin_abort(); int *q = get_me(); if (init_count != 1) __builtin_abort(); return *p + *q; } On this testcase, gcc -O2 doesn't reload init_count after the call to get_me because it thinks that the call can't have modified init_count. I want the compiler to know that it is safe to discard the redundant assignment, but not make assumptions about memory. But isn't it a fact that it _cannot_ modify init_count? If the second call is CSEable then it cannot have side-effects that are observable at the call site. Is the following an example you would consider to fall under your CSEing? int init_count; int data; int initialized; void init() { if (!initialized) { data = ++init_count; initialized = 1; } } inline int *get_me() __attribute ((pure)); inline int *get_me() { init (); return data; } int sink; int main() { sink = init_count; int *p = get_me(); if (init_count != 1) __builtin_abort(); initialized = 0; int *q = get_me(); if (init_count != 2) __builtin_abort(); return *p + *q; } ? If so, then why can we assume get_me returns the same pointer even here? Or similar: int main() { sink = init_count; int *p = get_me(); if (init_count != 1) __builtin_abort(); } is this required to not abort? p is unused and with 'pure' you'd DCE the call. That is, I am confused about the distinction you seem to make between the static variable 'initialized' and the global 'init_count'. You seem to imply that the attribute would mean that we can CSE side-effects to global memory but that those side-effects may be value-changing! In practice for any CSE implementation this hypotetical cse-side-effects attribute would only allow us to CSE if there are no intermediate side-effects between two such calls, and as soon as the second get_me call would be CSEd we'd also CSE the init_count value (which you didn't want to CSE?!) Still confused ;) Richard. Richard. On 10/04/2012 08:59 AM, Jakub Jelinek wrote: On Thu, Oct 04, 2012 at 08:56:03AM -0400, Jason Merrill wrote: Sure, but I thought you want to inline the wrapper function as soon as possible. Or do you want to keep it as some kind of builtin that gets expanded during expansion to the test and call? Ah, I see your point; if get_me is inlined we end up with two calls to init, so it would be good to mark init with the same hypothetical attribute. Jason
Re: Functions that are CSEable but not pure
On Thu, Oct 4, 2012 at 7:15 PM, Jakub Jelinek ja...@redhat.com wrote: On Thu, Oct 04, 2012 at 07:08:02PM +0200, Richard Guenther wrote: But isn't it a fact that it _cannot_ modify init_count? If the second call is CSEable then it cannot have side-effects that are observable at the call site. Is the following an example you would consider to fall under your CSEing? int init_count; int data; int initialized; void init() { if (!initialized) { data = ++init_count; initialized = 1; } } inline int *get_me() __attribute ((pure)); inline int *get_me() { init (); return data; } int sink; int main() { sink = init_count; int *p = get_me(); if (init_count != 1) __builtin_abort(); initialized = 0; int *q = get_me(); if (init_count != 2) __builtin_abort(); return *p + *q; } The above isn't a singleton function, as you are clearing initialized, therefore it doesn't have the properties the thread_local var get_address wrapper has. The singleton function really is void singleton (void) { static __thread bool initialized; if (!initialized) { initialized = true; call_some_function_that_may_modify_memory (); } } and has side effects just first time in a thread. So I suppose the testcase that would be valid but break with using pure would be instead int init_count; int data; void init() { static int initialized; if (!initialized) { data = ++init_count; initialized = 1; } } inline int *get_me() __attribute ((pure)); inline int *get_me() { init (); return data; } int main() { int x = init_count; int *p = get_me(); if (init_count == x) __builtin_abort(); int *q = get_me(); if (init_count == x) __builtin_abort(); } here when get_me is pure we CSE init_count over the _first_ call of get_me. Technically we only might CSE it over the second call of get_me. Thus there is flow-sensitive information coming out of nowhere when we try to skip the second get_me () call in looking up the init_count load after it (coming out of nowhere for the alias-oracle query which only includes 'init_count' and the stmt 'int *q = get_me ()'). Thus my answer is, for the alias-oracle the hypothetical 'singleton_init' attribute provides no useful information. Thus CSE and/or DCE of 'singleton_init' calls has to use special code (which I can think of being added to DCE, harder to existing SCCVN, maybe easier to DOM) which will usually not fit into the algorithm (so you might be able to teach FRE elimination of how to elminiate the 2nd call to get_me () but you will need a 2nd FRE pass to then figure out that init_count can be CSEd as well - which makes this all very undesirable). Inlining will also make this harder (you lose the attribute and fall into the lucas situation ...). At some point this singleton-ness should be auto-detected (there is inlined code in SPEC 2k lucas that would benefit from constant propagation, and SPEC 2k6 libquantum that uses a related concept - change / restore of a global in a function to the net effect that the function should not be considered clobbering it). Richard. Jakub
Re: RFC: Using DECL_NO_LIMIT_STACK as a backend specific flag
On Wed, Oct 3, 2012 at 5:20 PM, nick clifton ni...@redhat.com wrote: Hi Ian, Can't you just keep a list of the decls for which you have issued the warning? Yes - that would work too. In fact I agree that this would be cleaner solution in my particular case. I'll create a new patch... How many decls do you expect to be on that list? Not very many. Maybe two or three at most. But I am interested to know if there is a way for targets to add their own information to function decls (and decls in general). If not for this particular case, then for problems to come in the future. To the decl not. But I suppose you are only looking at defined functions, so it should suffice to amend struct function. It's reasonable to allow the target to associate extra info with struct function and we already have a way to do that via the init_machine_status target hook. Richard. Cheers Nick
Re: RFA: add lock_length attribute to break branch-shortening cycles
On Wed, Oct 3, 2012 at 8:22 PM, Joern Rennecke joern.renne...@embecosm.com wrote: The ARCompact architecture has some pipelining features that result in the vanilla branch shortening not always converging. Moreover, there are some short, complex branch instructions with very short offsets; replacing them with a multi-insn sequence when the offset doesn't reach makes the code significantly longer. Thus, when starting branch shortening with pessimistic assumptions, the short branches are often not used because of the pessimistic branch length causing the offsets going out of range. This problem can be avoided when starting with a low estimate and working upwards. However, that makes the incidence of infinite branch shortening cycles higher, and also makes it impossible to just break out after some iteration count. To address these issues, I've made the generator programs recognize the optional lock_length attribute. To quote from the documentation added for this feature: If you define the `lock_length' attribute, branch shortening will work the other way round: it starts out assuming minimum instruction lengths and iterates from there. In addition, the value of the `lock_length' attribute does not decrease across iterations, and the value computed for the `length' attribute will be no smaller than that of the `lock_length' attribute. I miss a few things in this description: - what is the value of lock_length supposed to be? From the lock prefix it sounds like it is something unchanging, maybe even constant, thus a maximum? - the length attribute still needs to be specified when lock_length is? how do they relate? Is lock_length always smaller / bigger than length? - what happens if you have patterns with lock_length and patterns without? - what patterns does lock_length apply to? In general optimistically attacking this kind of problem should be always better - did you try simply switching this for all targets? It shouldn't be slower and the only thing you need to guarantee is that during iteration you never make insn-lenghts smaller again. Richard. bootstrapped and regression tested on i686-pc-linux-gnu 2012-10-03 Joern Rennecke joern.renne...@embecosm.com * final.c (get_attr_length_1): Use direct recursion rather than calling get_attr_length. (get_attr_lock_length): New function. (INSN_VARIABLE_LENGTH_P): Define. (shorten_branches): Take HAVE_ATTR_lock_length into account. Don't overwrite non-delay slot insn lengths with the lengths of delay slot insns with same uid. * genattrtab.c (lock_length_str): New variable. (make_length_attrs): New parameter base. (main): Initialize lock_length_str. Generate lock_lengths attributes. * genattr.c (gen_attr): Emit declarations for lock_length attribute related functions. * doc/md.texi (node Insn Lengths): Document lock_length attribute. Index: doc/md.texi === --- doc/md.texi (revision 192036) +++ doc/md.texi (working copy) @@ -8004,6 +8004,20 @@ (define_insn jump (const_int 6)))]) @end smallexample +@cindex lock_length +Usually, branch shortening is done assuming the worst case (i.e. longest) +lengths, and then iterating (if optimizing) to smaller lengths till +no further changed occur. This does not work so well for architectures +that have very small minimum offsets and considerable jumps in instruction +lengths. + +If you define the @code{lock_length} attribute, branch shortening will +work the other way round: it starts out assuming minimum instruction +lengths and iterates from there. In addition, the value of the +@code{lock_length} attribute does not decrease across iterations, and +the value computed for the @code{length} attribute will be no smaller +than that of the @code{lock_length} attribute. + @end ifset @ifset INTERNALS @node Constant Attributes Index: final.c === --- final.c (revision 192036) +++ final.c (working copy) @@ -312,6 +312,7 @@ dbr_sequence_length (void) `insn_current_length'. */ static int *insn_lengths; +static char *uid_lock_length; VEC(int,heap) *insn_addresses_; @@ -447,6 +448,20 @@ get_attr_length (rtx insn) return get_attr_length_1 (insn, insn_default_length); } +#ifdef HAVE_ATTR_lock_length +int +get_attr_lock_length (rtx insn) +{ + if (uid_lock_length insn_lengths_max_uid INSN_UID (insn)) +return uid_lock_length[INSN_UID (insn)]; + return get_attr_length_1 (insn, insn_min_lock_length); +} +#define INSN_VARIABLE_LENGTH_P(INSN) \ + (insn_variable_length_p (INSN) || insn_variable_lock_length_p (INSN)) +#else +#define INSN_VARIABLE_LENGTH_P(INSN) (insn_variable_length_p (INSN)) +#endif + /* Obtain the current length of an insn.
Re: opts.c, gcc.c: Plug some memory leaks - and an out-of-bounds memory access
On Wed, Oct 3, 2012 at 11:01 PM, Tobias Burnus bur...@net-b.de wrote: Found using http://scan5.coverity.com/ Build on x86-64-gnu-linux with C/C++/Fortran. I will now do an all-language build/regtest. OK when it passes? (Note to the save_string call: I reduced it by 2: The +1 in the call makes it long (out of bounds) and the +1 in temp_filename_length is not needed (but also doesn't harm) as tmp is null terminated and save_string adds another '\0' after copying len bytes.) - prefix = concat (target_sysroot_suffix, prefix, NULL); - prefix = concat (sysroot_no_trailing_dir_separator, prefix, NULL); + { + char *tmp; + tmp = concat (target_sysroot_suffix, prefix, NULL); + prefix = concat (sysroot_no_trailing_dir_separator, tmp, NULL); + free (tmp); + } prefix = concat (sysroot_no_trailing_dir_separator, target_sysroot_suffix, prefix, NULL); should be equivalent and easier to read, no? + else + prefix = concat (sysroot_no_trailing_dir_separator, prefix, NULL); + btw, we're not careing too much about memleaks in the driver ... Otherwise the patch looks ok with the above change. Thanks, Richard. Tobias
Re: [Ping]FW: [PATCH] Cilk Plus merging to trunk (2 of n)
On Thu, Oct 4, 2012 at 2:22 AM, Iyer, Balaji V balaji.v.i...@intel.com wrote: Hi Joseph, Did you get a chance to look at this submission? I think I have fixed all the changes you have mentioned. Is it OK for trunk? Thanks, Balaji V. Iyer. -Original Message- From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches- ow...@gcc.gnu.org] On Behalf Of Iyer, Balaji V Sent: Wednesday, September 26, 2012 7:16 PM To: Joseph Myers Cc: gcc-patches@gcc.gnu.org; al...@redhat.com; r...@redhat.com; l...@redhat.com Subject: RE: [PATCH] Cilk Plus merging to trunk (2 of n) Hello Joseph, In my last patch, I forgot to add the change Richard Guenther wanted me to make. He wanted me to move the ARRAY_NOTATION_REF node from tree.def to c-family/c-common.def. Here is a new one that has this change. I am sorry for this. Here are ChangeLog entries: gcc/ChangeLog 2012-09-26 Balaji V. Iyer balaji.v.i...@intel.com * tree.h (array_notation_reduce_type): Added new enumerator. This should be moved to c-tree.h then, and ... * Makefile.in (OBJS): Added array-notation-common.o. * doc/passes.texi (Cilk Plus Transformation): Documented array notation and overall transformations for Cilk Plus. * doc/invoke.texi (C Dialect Options): Documented -fcilkplus flag. * doc/generic.texi (Storage References): Documented ARRAY_NOTATION_REF tree addition. * tree-pretty-pretty.c (dump_generic_node): Added ARRAY_NOTATION_REF case. ... this to c-pretty-print.c and * array-notation-common.c: New file. ... this to the c-common/ directory. Basically this should be a completely frontend-only patch. Richard. gcc/c-family/ChangeLog 2012-09-26 Balaji V. Iyer balaji.v.i...@intel.com * c-common.h (build_array_notation_expr): New function declaration. (ARRAY_NOTATION_ARRAY): Added new #define. (ARRAY_NOTATION_CHECK): Likewise. (ARRAY_NOTATION_START): Likewise. (ARRAY_NOTATION_LENGTH): Likewise. (ARRAY_NOTATION_STRIDE): Likewise. (ARRAY_NOTATION_TYPE): Likewise. * c-common.def: Added new tree ARRAY_NOTATION_REF. * c-common.c (c_define_builtins): Added a call to initialize array notation builtin functions. (c_common_init_ts): Set ARRAY_NOTATION_REF as typed. * c.opt (-fcilkplus): Define new command line switch. gcc/c/ChangeLog 2012-09-26 Balaji V. Iyer balaji.v.i...@intel.com * c-typeck.c (convert_arguments): Added a check if tree contains array notation expressions before throwing errors or doing anything. * Make-lang.in (C_AND_OBJC_OBJS): Added c-array-notation.o. * c-parser.c (c_parser_compound_statement): Check if array notation code is used in tree, if so, then transform them into appropriate C code. (c_parser_expr_no_commas): Check if array notation is used in LHS or RHS, if so, then build array notation expression instead of regular modify. (c_parser_postfix_expression_after_primary): Added a check for colon(s) after square braces, if so then handle it like an array notation. Also, break up array notations in unary op if found. (c_parser_array_notation): New function. * c-array-notation.c: New file. gcc/testsuite/ChangeLog 2012-09-26 Balaji V. Iyer balaji.v.i...@intel.com * gcc.dg/cilk-plus/array_notation/execute/execute.exp: New script. * gcc.dg/cilk-plus/array_notation/compile/compile.exp: Likewise. * gcc.dg/cilk-plus/array_notation/errors/errors.exp: Likewise. * gcc.dg/cilk-plus/array_notation/execute/sec_implicit_ex.c: New test. * gcc.dg/cilk-plus/array_notation/execute/if_test.c: Likewise. * gcc.dg/cilk-plus/array_notation/execute/gather_scatter.c: Likewise. * gcc.dg/cilk-plus/array_notation/execute/builtin_func_double2.c: Likewise. * gcc.dg/cilk-plus/array_notation/execute/builtin_func_double.c: Likewise. * gcc.dg/cilk-plus/array_notation/execute/builtin_fn_custom.c: Likewise. * gcc.dg/cilk-plus/array_notation/execute/builtin_fn_mutating.c: Likewise. * gcc.dg/cilk-plus/array_notation/execute/array_test_ND.c: Likewise. * gcc.dg/cilk-plus/array_notation/execute/array_test2.c: Likewise. * gcc.dg/cilk-plus/array_notation/execute/array_test1.c: Likewise. * gcc.dg/cilk-plus/array_notation/compile/sec_implicit_ex.c: Likewise. * gcc.dg/cilk-plus/array_notation/compile/gather_scatter.c: Likewise. * gcc.dg/cilk-plus/array_notation/compile/builtin_func_double2.c: Likewise. * gcc.dg/cilk-plus/array_notation/compile/array_test_ND.c: Likewise. * gcc.dg/cilk-plus/array_notation/compile/if_test.c: Likewise. * gcc.dg/cilk-plus/array_notation/compile/builtin_func_double.c: Likewise. * gcc.dg/cilk-plus/array_notation/compile/array_test1.c
Re: Fix -fdump-ada-spec
On Thu, Oct 4, 2012 at 10:26 AM, Arnaud Charlet char...@adacore.com wrote: After changes by Sharad (Add option for dumping to stderr (issue6190057)), -fdump-ada-spec is broken, and is now a no-op. Admittedly, this is because -fdump-ada-spec is handled differently from other -fdump-* switches, so this patch fixes support for -fdump-ada-spec by using an approach similar to -fdump-go-spec, and use regular switches via c/c.opt. I've removed the handling of TDF_RAW, which was a debugging option, and never really used, so can be simply deleted. Change is mostly trivial/mechanical. Tested on x86_64-pc-linux-gnu, OK for trunk? Much cleaner indeed. Ok, Thanks, Richard. gcc/ 2012-10-04 Arnaud Charlet char...@adacore.com * dumpfile.h, dumpfile.c: Remove TDI_ada. c-family/ 2012-10-04 Arnaud Charlet char...@adacore.com * c-ada-spec.c (print_ada_declaration): Remove handling of TDF_RAW. * c.opt (-fdump-ada-spec, -fdump-ada-spec-slim): Move switch definition out of dumpfile.h. c/ 2012-10-04 Arnaud Charlet char...@adacore.com * c-decl.c (c_write_global_declarations): Fix handling of -fdump-ada-spec*. cp/ 2012-10-04 Arnaud Charlet char...@adacore.com * decl2.c (cp_write_global_declarations): Fix handling of -fdump-ada-spec*. Arno -- Index: c-family/c.opt === --- c-family/c.opt (revision 192062) +++ c-family/c.opt (working copy) @@ -799,6 +799,14 @@ fdollars-in-identifiers C ObjC C++ ObjC++ Permit '$' as an identifier character +fdump-ada-spec +C ObjC C++ ObjC++ RejectNegative Var(flag_dump_ada_spec) +Write all declarations as Ada code transitively + +fdump-ada-spec-slim +C ObjC C++ ObjC++ RejectNegative Var(flag_dump_ada_spec_slim) +Write all declarations as Ada code for the given file only + felide-constructors C++ ObjC++ Var(flag_elide_constructors) Init(1) Index: c-family/c-ada-spec.c === --- c-family/c-ada-spec.c (revision 192062) +++ c-family/c-ada-spec.c (working copy) @@ -2535,7 +2535,6 @@ print_ada_declaration (pretty_printer *b int is_class = false; tree name = TYPE_NAME (TREE_TYPE (t)); tree decl_name = DECL_NAME (t); - bool dump_internal = get_dump_file_info (TDI_ada)-pflags TDF_RAW; tree orig = NULL_TREE; if (cpp_check cpp_check (t, IS_TEMPLATE)) @@ -2705,8 +2704,7 @@ print_ada_declaration (pretty_printer *b } else { - if (!dump_internal - TREE_CODE (t) == VAR_DECL + if (TREE_CODE (t) == VAR_DECL decl_name *IDENTIFIER_POINTER (decl_name) == '_') return 0; @@ -2796,8 +2794,7 @@ print_ada_declaration (pretty_printer *b /* If this function has an entry in the dispatch table, we cannot omit it. */ - if (!dump_internal !DECL_VINDEX (t) - *IDENTIFIER_POINTER (decl_name) == '_') + if (!DECL_VINDEX (t) *IDENTIFIER_POINTER (decl_name) == '_') { if (IDENTIFIER_POINTER (decl_name)[1] == '_') return 0; Index: c/c-decl.c === --- c/c-decl.c (revision 192062) +++ c/c-decl.c (working copy) @@ -10079,10 +10079,10 @@ c_write_global_declarations (void) gcc_assert (!current_scope); /* Handle -fdump-ada-spec[-slim]. */ - if (dump_initialized_p (TDI_ada)) + if (flag_dump_ada_spec || flag_dump_ada_spec_slim) { /* Build a table of files to generate specs for */ - if (get_dump_file_info (TDI_ada)-pflags TDF_SLIM) + if (flag_dump_ada_spec_slim) collect_source_ref (main_input_filename); else for_each_global_decl (collect_source_ref_cb); Index: cp/decl2.c === --- cp/decl2.c (revision 192062) +++ cp/decl2.c (working copy) @@ -3698,9 +3698,9 @@ cp_write_global_declarations (void) cgraph_process_same_body_aliases (); /* Handle -fdump-ada-spec[-slim] */ - if (dump_initialized_p (TDI_ada)) + if (flag_dump_ada_spec || flag_dump_ada_spec_slim) { - if (get_dump_file_info (TDI_ada)-pflags TDF_SLIM) + if (flag_dump_ada_spec_slim) collect_source_ref (main_input_filename); else collect_source_refs (global_namespace); Index: dumpfile.c === --- dumpfile.c (revision 192062) +++ dumpfile.c (working copy) @@ -57,8 +57,7 @@ static struct dump_file_info dump_files[ 0, 0, 0, 5}, {.vcg, tree-vcg, NULL, NULL, NULL, NULL, NULL, TDF_TREE, 0, 0, 0, 6}, - {.ads, ada-spec, NULL, NULL, NULL, NULL, NULL, 0, 0, 0, 0, 7}, -#define FIRST_AUTO_NUMBERED_DUMP 8 +#define FIRST_AUTO_NUMBERED_DUMP 7 {NULL, tree-all, NULL, NULL, NULL, NULL, NULL, TDF_TREE,
Re: [patch][lra] Improve initial program point density in lra-lives.c (was: Re: RFC: LRA for x86/x86-64 [7/9])
On Thu, Oct 4, 2012 at 11:43 AM, Steven Bosscher stevenb@gmail.com wrote: On Wed, Oct 3, 2012 at 5:35 PM, Steven Bosscher stevenb@gmail.com wrote: The worst result is this: Compressing live ranges: from 726174 to 64496 - 8%, pre_count 40476128, post_count 12483414 But that's still a lot better than before the patch for the same function: Compressing live ranges: from 1742569 to 73069 - 4%, pre_count 40842330, post_count 12479992 Walking basic blocks with FOR_EACH_BB_REVERSE gives: Only FOR_EACH_BB_REVERSE: Compressing live ranges: from 1742579 to 429746 - 24% pre_count 41106212, post_count 34376494 Compressing live ranges: from 1742569 to 63000 - 3% pre_count 40835340, post_count 11055747 FOR_EACH_BB_REVERSE + need_curr_point_incr: Compressing live ranges: from 726184 to 416529 - 57% pre_count 40743516, post_count 34376846 Compressing live ranges: from 726174 to 61840 - 8% pre_count 40472806, post_count 11055747 The combination of the two changes takes ~20s off the ~180s for LRA create live ranges. Isn't _REVERSE vs. non-_RESERVE still kind-of random order? Thus, doesn't the above show there exists an optimal order for processing which we could use? (I realize _REVERSE is a simple solution, but might there not exist a pathological case where _REVERSE is even worse than non-_REVERSE?) Richard. Ciao! Steven
Re: patch to fix
On Wed, Oct 3, 2012 at 7:15 PM, Kenneth Zadeck zad...@naturalbridge.com wrote: The enclosed patch is the third of at least four patches that fix the problems associated with supporting integers on the target that are wider than two HOST_WIDE_INTs. While GCC claims to support OI mode, and we have two public ports that make minor use of this mode, in practice, compilation that uses OImode mode commonly gives the wrong result or ices. We have a private port of GCC for an architecture that is further down the road to needing comprehensive OImode and we have discovered that this is unusable. We have decided to fix it in a general way that so that it is most beneficial to the GCC community. It is our belief that we are just a little ahead of the X86 and the NEON and these patches will shortly be essential. The first two of these patches were primarily lexigraphical and have already been committed.They transformed the uses of CONST_DOUBLE so that it is easy to tell what the intended usage is. The underlying structures in the next two patches are very general: once they are added to the compiler, the compiler will be able to support targets with any size of integer from hosts of any size integer. The patch enclosed deals with the portable RTL parts of the compiler. The next patch, which is currently under construction deals with the tree level. However, this patch can be put on the trunk as is, and it will eleviate many, but not all of the current limitations in the rtl parts of the compiler. Some of the patch is conditional, depending on a port defining the symbol 'TARGET_SUPPORTS_WIDE_INT' to be non zero. Defining this symbol to be non zero is declaring that the port has been converted to use the new form or integer constants. However, the patch is completely backwards compatible to allow ports that do not need this immediately to convert at their leasure. The conversion process is not difficult, but it does require some knowledge of the port, so we are not volinteering to do this for all ports. OVERVIEW OF THE PATCH: The patch defines a new datatype, a 'wide_int' (defined in wide-int.[ch], and this datatype will be used to perform all of the integer constant math in the compiler. Externally, wide-int is very similar to double-int except that it does not have the limitation that math must be done on exactly two HOST_WIDE_INTs. Internally, a wide_int is a structure that contains a fixed sized array of HOST_WIDE_INTs, a length field and a mode. The size of the That it has a mode sounds odd to me and makes it subtly different from HOST_WIDE_INT and double-int. Maybe the patch will tell why this is so. array is determined at generation time by dividing the number of bits of the largest integer supported on the target by the number of bits in a HOST_WIDE_INT of the host. Thus, with this format, any length of integer can be supported on any host. A new rtx type is created, the CONST_WIDE_INT, which contains a garbage collected array of HOST_WIDE_INTS that is large enough to hold the constant. For the targets that define TARGET_SUPPORTS_WIDE_INT to be non zero, CONST_DOUBLES are only used to hold floating point values. If the target leaves TARGET_SUPPORTS_WIDE_INT defined as 0, CONST_WIDE_INTs are not used and CONST_DOUBLEs are as they were before. CONST_INT does not change except that it is defined to hold all constants that fit in exactly one HOST_WIDE_INT. Note that is slightly different than the current trunk. Before this patch, the TImode constant '5' could either be in a CONST_INT or CONST_DOUBLE depending on which code path was used to create it. This patch changes this so that if the constant fits in a CONST_INT then it is represented in a CONST_INT no matter how it is created. For the array inside a CONST_WIDE_INT, and internally in wide-int, we use a compressed form for integers that need more than one HOST_WIDE_INT. Higher elements of the array are not needed if they are just a sign extension of the elements below them. This does not imply that constants are signed or are sign extended, this is only a compression technique. While it might seem to be more esthetically pleasing to have not introduced the CONST_WIDE_INT and to have changed the representation of the CONST_INT to accomodate larger numbers, this would have both used more space and would be a time consuming change for the port maintainers. We believe that most ports can be quickly converted with the current scheme because there is just not a lot of code in the back ends that cares about large constants. Furthermore, the CONST_INT is very space efficient and even in a program that was heavy in large values, most constants would still fit in a CONST_INT. All of the parts of the rtl level that deal with CONST_DOUBLE as an now conditionally work with CONST_WIDE_INTs depending on the value of TARGET_SUPPORTS_WIDE_INT. We believe that this
Re: [Patch] Fix PR53397
On Tue, Oct 2, 2012 at 6:40 PM, Kumar, Venkataramanan venkataramanan.ku...@amd.com wrote: Hi Richi, (Snip) + (!cst_and_fits_in_hwi (step)) +{ + if( loop-inner != NULL) +{ + if (dump_file (dump_flags TDF_DETAILS)) +{ + fprintf (dump_file, Reference %p:\n, (void *) ref); + fprintf (dump_file, (base ); + print_generic_expr (dump_file, base, TDF_SLIM); + fprintf (dump_file, , step ); + print_generic_expr (dump_file, step, TDF_TREE); + fprintf (dump_file, )\n); No need to repeat this - all references are dumped when we gather them. (Snip) The dumping happens at record_ref which is called after these statements to record these references. When the step is invariant we return from the function without recording the references. so I thought of dumping the references here. Is there a cleaner way to dump the references at one place? Yes, call dump_mem_ref then, instead of repeating parts of its body. Richard. Regards, Venkat. -Original Message- From: Richard Guenther [mailto:rguent...@suse.de] Sent: Tuesday, October 02, 2012 5:42 PM To: Kumar, Venkataramanan Cc: gcc-patches@gcc.gnu.org Subject: Re: [Patch] Fix PR53397 On Mon, 1 Oct 2012, venkataramanan.ku...@amd.com wrote: Hi, The below patch fixes the FFT/Scimark regression caused by useless prefetch generation. This fix tries to make prefetch less aggressive by prefetching arrays in the inner loop, when the step is invariant in the entire loop nest. GCC currently tries to prefetch invariant steps when they are in the inner loop. But does not check if the step is variant in outer loops. In the scimark FFT case, the trip count of the inner loop varies by a non constant step, which is invariant in the inner loop. But the step variable is varying in outer loop. This makes inner loop trip count small (at run time varies sometimes as small as 1 iteration) Prefetching ahead x iteration when the inner loop trip count is smaller than x leads to useless prefetches. Flag used: -O3 -march=amdfam10 Before ** ** ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark ** ** for details. (Results can be submitted to p...@nist.gov) ** ** ** Using 2.00 seconds min time per kenel. Composite Score: 550.50 FFT Mflops:38.66(N=1024) SOR Mflops: 617.61(100 x 100) MonteCarlo: Mflops: 173.74 Sparse matmult Mflops: 675.63(N=1000, nz=5000) LU Mflops: 1246.88(M=100, N=100) After ** ** ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark ** ** for details. (Results can be submitted to p...@nist.gov) ** ** ** Using 2.00 seconds min time per kenel. Composite Score: 639.20 FFT Mflops: 479.19(N=1024) SOR Mflops: 617.61(100 x 100) MonteCarlo: Mflops: 173.18 Sparse matmult Mflops: 679.13(N=1000, nz=5000) LU Mflops: 1246.88(M=100, N=100) GCC regression make check -k passes with x86_64-unknown-linux-gnu New tests that PASS: gcc.dg/pr53397-1.c scan-assembler prefetcht0 gcc.dg/pr53397-1.c scan-tree-dump aprefetch Issued prefetch gcc.dg/pr53397-1.c (test for excess errors) gcc.dg/pr53397-2.c scan-tree-dump aprefetch loop variant step gcc.dg/pr53397-2.c scan-tree-dump aprefetch Not prefetching gcc.dg/pr53397-2.c (test for excess errors) Checked CPU2006 and polyhedron on latest AMD processor, no regressions noted. Ok to commit in trunk? regards, Venkat gcc/ChangeLog +2012-10-01 Venkataramanan Kumar venkataramanan.ku...@amd.com + + * tree-ssa-loop-prefetch.c (gather_memory_references_ref):$ + Perform non constant step prefetching in inner loop, only $ + when it is invariant in the entire loop nest. $ + * testsuite/gcc.dg/pr53397-1.c: New test case $ + Checks we are prefecthing for loop invariant steps$ + * testsuite/gcc.dg/pr53397-2.c: New test case$ + Checks we are not prefecthing for loop variant steps + Index: gcc/testsuite/gcc.dg/pr53397-1.c === --- gcc/testsuite/gcc.dg/pr53397-1.c (revision 0) +++ gcc/testsuite/gcc.dg/pr53397-1.c (revision 0) @@ -0,0 +1,28 @@ +/* Prefetching when the step is loop invariant. */ + +/* { dg-do compile } */ +/* { dg-options -O3 -fprefetch-loop-arrays +-fdump-tree-aprefetch-details --param min-insn-to-prefetch-ratio=3 +--param simultaneous-prefetches=10 -fdump-tree-aprefetch-details } +*/ + + +double data[16384]; +void prefetch_when_non_constant_step_is_invariant(int step, int n) { + int
Re: patch to fix constant math
this last night. it is an artifact of the way i produced the patch and responsible people have been sacked. However, it shows that you read the patch carefully, and i really appreciate that. i owe you a beer (not that you need another at this time of year). You also didn't mention the missing tree bits ... was this just a 1/n patch or is it at all usable for you in this state? Where do the large integers magically come from? Richard. Kenny On 10/04/2012 08:48 AM, Richard Guenther wrote: On Wed, Oct 3, 2012 at 7:15 PM, Kenneth Zadeck zad...@naturalbridge.com wrote: The enclosed patch is the third of at least four patches that fix the problems associated with supporting integers on the target that are wider than two HOST_WIDE_INTs. While GCC claims to support OI mode, and we have two public ports that make minor use of this mode, in practice, compilation that uses OImode mode commonly gives the wrong result or ices. We have a private port of GCC for an architecture that is further down the road to needing comprehensive OImode and we have discovered that this is unusable. We have decided to fix it in a general way that so that it is most beneficial to the GCC community. It is our belief that we are just a little ahead of the X86 and the NEON and these patches will shortly be essential. The first two of these patches were primarily lexigraphical and have already been committed.They transformed the uses of CONST_DOUBLE so that it is easy to tell what the intended usage is. The underlying structures in the next two patches are very general: once they are added to the compiler, the compiler will be able to support targets with any size of integer from hosts of any size integer. The patch enclosed deals with the portable RTL parts of the compiler. The next patch, which is currently under construction deals with the tree level. However, this patch can be put on the trunk as is, and it will eleviate many, but not all of the current limitations in the rtl parts of the compiler. Some of the patch is conditional, depending on a port defining the symbol 'TARGET_SUPPORTS_WIDE_INT' to be non zero. Defining this symbol to be non zero is declaring that the port has been converted to use the new form or integer constants. However, the patch is completely backwards compatible to allow ports that do not need this immediately to convert at their leasure. The conversion process is not difficult, but it does require some knowledge of the port, so we are not volinteering to do this for all ports. OVERVIEW OF THE PATCH: The patch defines a new datatype, a 'wide_int' (defined in wide-int.[ch], and this datatype will be used to perform all of the integer constant math in the compiler. Externally, wide-int is very similar to double-int except that it does not have the limitation that math must be done on exactly two HOST_WIDE_INTs. Internally, a wide_int is a structure that contains a fixed sized array of HOST_WIDE_INTs, a length field and a mode. The size of the That it has a mode sounds odd to me and makes it subtly different from HOST_WIDE_INT and double-int. Maybe the patch will tell why this is so. array is determined at generation time by dividing the number of bits of the largest integer supported on the target by the number of bits in a HOST_WIDE_INT of the host. Thus, with this format, any length of integer can be supported on any host. A new rtx type is created, the CONST_WIDE_INT, which contains a garbage collected array of HOST_WIDE_INTS that is large enough to hold the constant. For the targets that define TARGET_SUPPORTS_WIDE_INT to be non zero, CONST_DOUBLES are only used to hold floating point values. If the target leaves TARGET_SUPPORTS_WIDE_INT defined as 0, CONST_WIDE_INTs are not used and CONST_DOUBLEs are as they were before. CONST_INT does not change except that it is defined to hold all constants that fit in exactly one HOST_WIDE_INT. Note that is slightly different than the current trunk. Before this patch, the TImode constant '5' could either be in a CONST_INT or CONST_DOUBLE depending on which code path was used to create it. This patch changes this so that if the constant fits in a CONST_INT then it is represented in a CONST_INT no matter how it is created. For the array inside a CONST_WIDE_INT, and internally in wide-int, we use a compressed form for integers that need more than one HOST_WIDE_INT. Higher elements of the array are not needed if they are just a sign extension of the elements below them. This does not imply that constants are signed or are sign extended, this is only a compression technique. While it might seem to be more esthetically pleasing to have not introduced the CONST_WIDE_INT and to have changed the representation of the CONST_INT to accomodate larger numbers, this would have both used more space and would be a time consuming change for the port maintainers. We
Re: [lra] patch to solve most scalability problems for LRA
On Thu, Oct 4, 2012 at 7:07 PM, Steven Bosscher stevenb@gmail.com wrote: On Thu, Oct 4, 2012 at 5:37 PM, Vladimir Makarov vmaka...@redhat.com wrote: The only issue now is PR54146 compilation time for IRA+LRA although it was improved significantly. I will continue work on PR54146. But now I am going to focus on proposals from reviews. Right, there still are opportunities to improve things. (The real solution may be to stop SRA from creating so many simultaneously live pseudos in the first place...) + lra_simple_p += (ira_use_lra_p max_reg_num () = (1 26) / last_basic_block); I think you should use n_basic_blocks here instead of last_basic_block, in case this runs without compacting the cfg first (n_basic_blocks is the real number of basic blocks in the cfg, last_basic_block is the highest index, so last_basic_block = n_basic_blocks). I also noticed that switching to IRA_REGION_ONE improves things when we have a large number of loops (profile points to some loop code in IRA). Note that the magic number above should be a new --param, and once we have a diagnostic flag that shows whenever we back off like this it should notify the user of that fact (and the params we have overflown) - this just reminded me of that idea from somebody else ;) Thanks for working on this! Indeed ;) It, btw, also applies to IRA + reload ... Richard. Ciao! Steven
Re: PATCH trunk: gengtype honoring mark_hook-s inside struct inside union-s
On Thu, Oct 4, 2012 at 7:24 PM, Basile Starynkevitch bas...@starynkevitch.net wrote: On Thu, Oct 04, 2012 at 06:51:35PM +0300, Laurynas Biveinis wrote: 2012-10-03 Basile Starynkevitch bas...@starynkevitch.net * gengtype.c (walk_type): Emit mark_hook when inside a struct of a union member. This is OK. thanks, Committed revision 192092 to trunk. I believe this patch should be backported into GCC 4.7 and 4.6 I see no reason for this unless it is a regression. Richard. Regards. -- Basile STARYNKEVITCH http://starynkevitch.net/Basile/ email: basileatstarynkevitchdotnet mobile: +33 6 8501 2359 8, rue de la Faiencerie, 92340 Bourg La Reine, France *** opinions {are only mines, sont seulement les miennes} ***
Re: [PATCH] Teach VRP to handle if ((unsigned_narrowing_cast) x != 0) similarly to if ((x 0xffff) != 0) (PR tree-optimization/54810)
On Thu, Oct 4, 2012 at 6:31 PM, Jakub Jelinek ja...@redhat.com wrote: Hi! This patch handles unsigned narrowing casts the same as BIT_AND_EXPR with the unsigned narrow type's max value. Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk? Ok. Thanks, Richard. 2012-10-04 Jakub Jelinek ja...@redhat.com PR tree-optimization/54810 * tree-vrp.c (register_edge_assert_for_2): Handle NAME = (unsigned) NAME2; if (NAME cmp CST) for narrowing casts to unsigned integral type like NAME = NAME2 CST2; if (NAME cmp CST) where CST2 is the max value of the unsigned integral type. --- gcc/tree-vrp.c.jj 2012-09-25 14:45:48.0 +0200 +++ gcc/tree-vrp.c 2012-10-04 11:43:32.334988401 +0200 @@ -4712,6 +4712,11 @@ register_edge_assert_for_2 (tree name, e tree val2 = NULL_TREE; double_int mask = double_int_zero; unsigned int prec = TYPE_PRECISION (TREE_TYPE (val)); + unsigned int nprec = prec; + enum tree_code rhs_code = ERROR_MARK; + + if (is_gimple_assign (def_stmt)) + rhs_code = gimple_assign_rhs_code (def_stmt); /* Add asserts for NAME cmp CST and NAME being defined as NAME = (int) NAME2. */ @@ -4721,7 +4726,7 @@ register_edge_assert_for_2 (tree name, e gimple_assign_cast_p (def_stmt)) { name2 = gimple_assign_rhs1 (def_stmt); - if (CONVERT_EXPR_CODE_P (gimple_assign_rhs_code (def_stmt)) + if (CONVERT_EXPR_CODE_P (rhs_code) INTEGRAL_TYPE_P (TREE_TYPE (name2)) TYPE_UNSIGNED (TREE_TYPE (name2)) prec == TYPE_PRECISION (TREE_TYPE (name2)) @@ -4767,8 +4772,7 @@ register_edge_assert_for_2 (tree name, e NAME = NAME2 CST2. Extract CST2 from the right shift. */ - if (is_gimple_assign (def_stmt) - gimple_assign_rhs_code (def_stmt) == RSHIFT_EXPR) + if (rhs_code == RSHIFT_EXPR) { name2 = gimple_assign_rhs1 (def_stmt); cst2 = gimple_assign_rhs2 (def_stmt); @@ -4840,21 +4844,37 @@ register_edge_assert_for_2 (tree name, e /* Add asserts for NAME cmp CST and NAME being defined as NAME = NAME2 CST2. -Extract CST2 from the and. */ +Extract CST2 from the and. + +Also handle +NAME = (unsigned) NAME2; +casts where NAME's type is unsigned and has smaller precision +than NAME2's type as if it was NAME = NAME2 MASK. */ names[0] = NULL_TREE; names[1] = NULL_TREE; cst2 = NULL_TREE; - if (is_gimple_assign (def_stmt) - gimple_assign_rhs_code (def_stmt) == BIT_AND_EXPR) + if (rhs_code == BIT_AND_EXPR + || (CONVERT_EXPR_CODE_P (rhs_code) + TREE_CODE (TREE_TYPE (val)) == INTEGER_TYPE + TYPE_UNSIGNED (TREE_TYPE (val)) + TYPE_PRECISION (TREE_TYPE (gimple_assign_rhs1 (def_stmt))) + prec + !retval)) { name2 = gimple_assign_rhs1 (def_stmt); - cst2 = gimple_assign_rhs2 (def_stmt); + if (rhs_code == BIT_AND_EXPR) + cst2 = gimple_assign_rhs2 (def_stmt); + else + { + cst2 = TYPE_MAX_VALUE (TREE_TYPE (val)); + nprec = TYPE_PRECISION (TREE_TYPE (name2)); + } if (TREE_CODE (name2) == SSA_NAME INTEGRAL_TYPE_P (TREE_TYPE (name2)) TREE_CODE (cst2) == INTEGER_CST !integer_zerop (cst2) - prec = HOST_BITS_PER_DOUBLE_INT - (prec 1 + nprec = HOST_BITS_PER_DOUBLE_INT + (nprec 1 || TYPE_UNSIGNED (TREE_TYPE (val { gimple def_stmt2 = SSA_NAME_DEF_STMT (name2); @@ -4881,12 +4901,12 @@ register_edge_assert_for_2 (tree name, e bool valid_p = false, valn = false, cst2n = false; enum tree_code ccode = comp_code; - valv = tree_to_double_int (val).zext (prec); - cst2v = tree_to_double_int (cst2).zext (prec); + valv = tree_to_double_int (val).zext (nprec); + cst2v = tree_to_double_int (cst2).zext (nprec); if (!TYPE_UNSIGNED (TREE_TYPE (val))) { - valn = valv.sext (prec).is_negative (); - cst2n = cst2v.sext (prec).is_negative (); + valn = valv.sext (nprec).is_negative (); + cst2n = cst2v.sext (nprec).is_negative (); } /* If CST2 doesn't have most significant bit set, but VAL is negative, we have comparison like @@ -4894,7 +4914,7 @@ register_edge_assert_for_2 (tree name, e if (!cst2n valn) ccode = ERROR_MARK; if (cst2n) - sgnbit = double_int_one.llshift (prec - 1, prec).zext (prec); + sgnbit = double_int_one.llshift (nprec - 1, nprec).zext (nprec);
Re: How much time left till phase 3?
On Tue, Oct 2, 2012 at 10:44 AM, Joern Rennecke joern.renne...@embecosm.com wrote: I'll have to prepare a few more patches to (supposedly) generic code to support the ARCompact port, which we (Synopsys and Embecosm) would like contribute in time for gcc 4.8. How much time is left till we switch from phase 1 to phase 3? I expect stage1 to close mid to end of October (after which it lasted for more than 7 months). Richard.
Re: How much time left till phase 3?
On Tue, Oct 2, 2012 at 11:34 AM, Richard Guenther richard.guent...@gmail.com wrote: On Tue, Oct 2, 2012 at 10:44 AM, Joern Rennecke joern.renne...@embecosm.com wrote: I'll have to prepare a few more patches to (supposedly) generic code to support the ARCompact port, which we (Synopsys and Embecosm) would like contribute in time for gcc 4.8. How much time is left till we switch from phase 1 to phase 3? I expect stage1 to close mid to end of October (after which it lasted for more than 7 months). Btw, I realize that the aarch64 port probably also wants to merge even if I didn't see a merge proposal or know whether they have patches to generic code. Anybody else with things they want to merge during stage1? Thanks, Richard. Richard.
Re: [RFC] Make vectorizer to skip loops with small iteration estimate
On Mon, 1 Oct 2012, Jan Hubicka wrote: So the unvectorized cost is SIC * niters The vectorized path is SOC + VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC The scalar path of vectorizer loop is SIC * niters + SOC Note that 'th' is used for the runtime profitability check which is done at the time the setup cost has already been taken (yes, we Yes, I understand that. probably should make it more conservative but then guard the whole set of loops by the check, not only the vectorized path). See PR53355 for the general issue. Yep, we may reduce the cost of SOC by outputting early guard for non-vectorized path better than we do now. However... Of course this is very simple benchmark, in reality the vectorizatoin can be a lot more harmful by complicating more complex control flows. So I guess we have two options 1) go with the new formula and try to make cost model a bit more realistic. 2) stay with original formula that is quite close to reality, but I think more by an accident. I think we need to improve it as whole, thus I'd prefer 2). ... I do not see why. Even if we make the check cheaper we will only distribute part of SOC to vector prologues/epilogues. Still I think the formula is wrong, I.e. accounting SOC where it should not. The cost of scalar path without vectorization is niters * SIC while with vectorization we have scalar path niters * SIC + SOC and vector path SOC + VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC So SOC cancels out in the runtime check. I still think we need two formulas - one determining if vectorization is profitable, other specifying the threshold for scalar path at runtime (that will generally give lower values). True, we want two values. But part of the scalar path right now is all the computation required for alias and alignment runtime checks (because the way all the conditions are combined). I'm not much into the details of what we account for in SOC (I suppose it's everything we insert in the preheader of the vector loop). + if (vect_print_dump_info (REPORT_UNVECTORIZED_LOCATIONS)) +fprintf (vect_dump, not vectorized: estimated iteration count too small.); + if (vect_print_dump_info (REPORT_DETAILS)) +fprintf (vect_dump, not vectorized: estimated iteration count smaller than + user specified loop bound parameter or minimum + profitable iterations (whichever is more conservative).); this won't work anymore btw - dumping infrastructure changed. I suppose your patch is a step in the right direction, but to really make progress we need to re-organize the loop and predicate structure produced by the vectorizer. So, please update your patch, re-test and then it's ok. 2) Even when loop iterates 2 times, it is estimated to 4 iterations by estimated_stmt_executions_int with the profile feedback. The reason is loop_ch pass. Given a rolled loop with exit probability 30%, proceeds by duplicating the header with original probabilities. This makes the loop to be executed with 60% probability. Because the loop body counts remain the same (and they should), the expected number of iterations increase by the decrease of entry edge to the header. I wonder what to do about this. Obviously without path profiling loop_ch can not really do a good job. We can artifically make header to suceed more likely, that is the reality, but that requires non-trivial loop profile updating. We can also simply record the iteration bound into loop structure and ignore that the profile is not realistic But we don't preserve loop structure from header copying ... From what time we keep loop structure? In general I would like to eventualy drop value histograms to loop structure specifying number of iterations with profile feedback. We preserve it from the start of the tree loop optimizers (it's easy to preserve them from earlier points as long as you don't cross inlining, but to lower the impact of the change I placed it where it was enough to prevent the excessive unrolling/peeling done by RTL) Finally we can duplicate loop headers before profilng. I implemented that via early_ch pass executed only with profile generation or feedback. I guess it makes sense to do, even if it breaks the assumption that we should do strictly -Os generation on paths where Well, there are CH cases that do not increase code size and I doubt that loop header copying is generally bad for -Os ... we are not good at handling non-copied loop headers. There is comment saying /* Loop header copying usually increases size of the code. This used not to be true, since quite often it is possible to verify that the condition is satisfied in the first
Re: Convert more non-GTY htab_t to hash_table.
On Mon, 1 Oct 2012, Lawrence Crowl wrote: Change more non-GTY hash tables to use the new type-safe template hash table. Constify member function parameters that can be const. Correct a couple of expressions in formerly uninstantiated templates. The new code is 0.362% faster in bootstrap, with a 99.5% confidence of being faster. Tested on x86-64. Okay for trunk? You are changing a hashtable used by fold checking, did you test with fold checking enabled? +/* Data structures used to maintain mapping between basic blocks and + copies. */ +static hash_table bb_copy_hasher bb_original; +static hash_table bb_copy_hasher bb_copy; note that because hash_table has a constructor we now get global CTORs for all statics :( (and mx-protected local inits ...) Can you please try to remove the constructor from hash_table to avoid this overhead? (as a followup - that is, don't initialize htab) The cfg.c, dse.c and hash-table.h parts are ok for trunk, I'll leave the rest to respective maintainers of the pieces of the compiler. Thanks, Richard. Index: gcc/java/ChangeLog 2012-10-01 Lawrence Crowl cr...@google.com * Make-lang.in (JAVA_OBJS): Add dependence on hash-table.o. (JCFDUMP_OBJS): Add dependence on hash-table.o. (jcf-io.o): Add dependence on hash-table.h. * jcf-io.c (memoized_class_lookups): Change to use type-safe hash table. Index: gcc/c/ChangeLog 2012-10-01 Lawrence Crowl cr...@google.com * Make-lang.in (c-decl.o): Add dependence on hash-table.h. * c-decl.c (detect_field_duplicates_hash): Change to new type-safe hash table. Index: gcc/objc/ChangeLog 2012-10-01 Lawrence Crowl cr...@google.com * Make-lang.in (OBJC_OBJS): Add dependence on hash-table.o. (objc-act.o): Add dependence on hash-table.h. * objc-act.c (objc_detect_field_duplicates): Change to new type-safe hash table. Index: gcc/ChangeLog 2012-10-01 Lawrence Crowl cr...@google.com * Makefile.in (fold-const.o): Add depencence on hash-table.h. (dse.o): Likewise. (cfg.o): Likewise. * fold-const.c (fold_checksum_tree): Change to new type-safe hash table. * (print_fold_checksum): Likewise. * cfg.c (var bb_original): Likewise. * (var bb_copy): Likewise. * (var loop_copy): Likewise. * hash-table.h (template hash_table): Constify parameters for find... and remove_elt... member functions. (hash_table::empty) Correct size expression. (hash_table::clear_slot) Correct deleted entry assignment. * dse.c (var rtx_group_table): Change to new type-safe hash table. Index: gcc/cp/ChangeLog 2012-10-01 Lawrence Crowl cr...@google.com * Make-lang.in (class.o): Add dependence on hash-table.h. (tree.o): Likewise. (semantics.o): Likewise. * class.c (fixed_type_or_null): Change to new type-safe hash table. * tree.c (verify_stmt_tree): Likewise. (verify_stmt_tree_r): Likewise. * semantics.c (struct nrv_data): Likewise. Index: gcc/java/Make-lang.in === --- gcc/java/Make-lang.in (revision 191941) +++ gcc/java/Make-lang.in (working copy) @@ -83,10 +83,10 @@ JAVA_OBJS = java/class.o java/decl.o jav java/zextract.o java/jcf-io.o java/win32-host.o java/jcf-parse.o java/mangle.o \ java/mangle_name.o java/builtins.o java/resource.o \ java/jcf-depend.o \ - java/jcf-path.o java/boehm.o java/java-gimplify.o + java/jcf-path.o java/boehm.o java/java-gimplify.o hash-table.o JCFDUMP_OBJS = java/jcf-dump.o java/jcf-io.o java/jcf-depend.o java/jcf-path.o \ - java/win32-host.o java/zextract.o ggc-none.o + java/win32-host.o java/zextract.o ggc-none.o hash-table.o JVGENMAIN_OBJS = java/jvgenmain.o java/mangle_name.o @@ -326,7 +326,7 @@ java/java-gimplify.o: java/java-gimplify # jcf-io.o needs $(ZLIBINC) added to cflags. CFLAGS-java/jcf-io.o += $(ZLIBINC) java/jcf-io.o: java/jcf-io.c $(CONFIG_H) $(SYSTEM_H) coretypes.h \ - $(JAVA_TREE_H) java/zipfile.h + $(JAVA_TREE_H) java/zipfile.h $(HASH_TABLE_H) # jcf-path.o needs a -D. CFLAGS-java/jcf-path.o += \ Index: gcc/java/jcf-io.c === --- gcc/java/jcf-io.c (revision 191941) +++ gcc/java/jcf-io.c (working copy) @@ -31,7 +31,7 @@ The Free Software Foundation is independ #include jcf.h #include tree.h #include java-tree.h -#include hashtab.h +#include hash-table.h #include dirent.h #include zlib.h @@ -271,20 +271,34 @@ find_classfile (char *filename, JCF *jcf return open_class (filename, jcf, fd, dep_name); } -/* Returns 1 if the CLASSNAME (really a char *) matches the name - stored in TABLE_ENTRY (also a char *). */ -static int -memoized_class_lookup_eq (const void *table_entry, const void *classname) +/* Hash table
Re: [PATCH] Fix powerpc breakage, was: Add option for dumping to stderr (issue6190057)
On Tue, Oct 2, 2012 at 1:11 AM, Xinliang David Li davi...@google.com wrote: On Mon, Oct 1, 2012 at 4:05 PM, Sharad Singhai sing...@google.com wrote: Thanks for tracking down and fixing the powerpc port. The dump_kind_p () check is redundant but canonical form here. I think blocks of dump code guarded by if dump_kind_p (...) might be easier to read/maintain. I find it confusing to be honest. The redundant check serves no purpose. The check should be inlined and avoid the call to the diagnostic routine, thus speed up compile-time. We should use this pattern, especially if it guards multiple calls. Richard. David Sharad Sharad On Mon, Oct 1, 2012 at 3:45 PM, Xinliang David Li davi...@google.com wrote: On Mon, Oct 1, 2012 at 2:37 PM, Michael Meissner meiss...@linux.vnet.ibm.com wrote: I tracked down some of the other code that previously used REPORT_DETAILS, and MSG_NOTE is the new way to do the same thing. This bootstraps and no unexpected errors occur during make check. Is it ok to install? 2012-10-01 Michael Meissner meiss...@linux.vnet.ibm.com * config/rs6000/rs6000.c (toplevel): Include dumpfile.h. (rs6000_density_test): Rework to accomidate 09-30 change by Sharad Singhai. * config/rs6000/t-rs6000 (rs6000.o): Add dumpfile.h dependency. Index: gcc/config/rs6000/rs6000.c === --- gcc/config/rs6000/rs6000.c (revision 191932) +++ gcc/config/rs6000/rs6000.c (working copy) @@ -58,6 +58,7 @@ #include tm-constrs.h #include opts.h #include tree-vectorizer.h +#include dumpfile.h #if TARGET_XCOFF #include xcoffout.h /* get declarations of xcoff_*_section_name */ #endif @@ -3518,11 +3519,11 @@ rs6000_density_test (rs6000_cost_data *d vec_cost + not_vec_cost DENSITY_SIZE_THRESHOLD) { data-cost[vect_body] = vec_cost * (100 + DENSITY_PENALTY) / 100; - if (vect_print_dump_info (REPORT_DETAILS)) - fprintf (vect_dump, -density %d%%, cost %d exceeds threshold, penalizing -loop body cost by %d%%, density_pct, -vec_cost + not_vec_cost, DENSITY_PENALTY); + if (dump_kind_p (MSG_NOTE)) Is this check needed? Seems redundant. David + dump_printf_loc (MSG_NOTE, vect_location, +density %d%%, cost %d exceeds threshold, penalizing +loop body cost by %d%%, density_pct, +vec_cost + not_vec_cost, DENSITY_PENALTY); } } Index: gcc/config/rs6000/t-rs6000 === --- gcc/config/rs6000/t-rs6000 (revision 191932) +++ gcc/config/rs6000/t-rs6000 (working copy) @@ -26,7 +26,7 @@ rs6000.o: $(CONFIG_H) $(SYSTEM_H) corety $(OBSTACK_H) $(TREE_H) $(EXPR_H) $(OPTABS_H) except.h function.h \ output.h dbxout.h $(BASIC_BLOCK_H) toplev.h $(GGC_H) $(HASHTAB_H) \ $(TM_P_H) $(TARGET_H) $(TARGET_DEF_H) langhooks.h reload.h gt-rs6000.h \ - cfgloop.h $(OPTS_H) $(COMMON_TARGET_H) + cfgloop.h $(OPTS_H) $(COMMON_TARGET_H) dumpfile.h rs6000-c.o: $(srcdir)/config/rs6000/rs6000-c.c \ $(srcdir)/config/rs6000/rs6000-protos.h \ -- Michael Meissner, IBM 5 Technology Place Drive, M/S 2757, Westford, MA 01886-3141, USA meiss...@linux.vnet.ibm.com fax +1 (978) 399-6899
Re: [PATCH] Fix test breakage, was: Add option for dumping to stderr (issue6190057)
On Tue, Oct 2, 2012 at 1:31 AM, Sharad Singhai sing...@google.com wrote: Here is a patch to fix test breakage caused by r191883. Bootstrapped on x86_64 and tested with make -k check RUNTESTFLAGS=--target_board=unix/\{,-m32\}. Okay for trunk? Ok. Thanks, Richard. Thanks, Sharad 2012-10-01 Sharad Singhai sing...@google.com * tree-vect-stmts.c (vectorizable_operation): Add missing return. testsuite/Changelog * gfortran.dg/vect/vect.exp: Change verbose vectorizor dump options to fix test failures caused by r191883. * gcc.dg/tree-ssa/gen-vect-11.c: Likewise. * gcc.dg/tree-ssa/gen-vect-2.c: Likewise. * gcc.dg/tree-ssa/gen-vect-32.c: Likewise. * gcc.dg/tree-ssa/gen-vect-25.c: Likewise. * gcc.dg/tree-ssa/gen-vect-11a.c: Likewise. * gcc.dg/tree-ssa/gen-vect-26.c: Likewise. * gcc.dg/tree-ssa/gen-vect-11b.c: Likewise. * gcc.dg/tree-ssa/gen-vect-11c.c: Likewise. * gcc.dg/tree-ssa/gen-vect-28.c: Likewise. * testsuite/gcc.target/i386/vect-double-1.c: Fix test. Missing entry from r191883. Index: testsuite/gfortran.dg/vect/vect.exp === --- testsuite/gfortran.dg/vect/vect.exp (revision 191883) +++ testsuite/gfortran.dg/vect/vect.exp (working copy) @@ -26,7 +26,7 @@ set DEFAULT_VECTCFLAGS # These flags are used for all targets. lappend DEFAULT_VECTCFLAGS -O2 -ftree-vectorize -fno-vect-cost-model \ - -ftree-vectorizer-verbose=4 -fdump-tree-vect-stats + -fdump-tree-vect-details # If the target system supports vector instructions, the default action # for a test is 'run', otherwise it's 'compile'. Save current default. Index: testsuite/gcc.dg/tree-ssa/gen-vect-11.c === --- testsuite/gcc.dg/tree-ssa/gen-vect-11.c (revision 191883) +++ testsuite/gcc.dg/tree-ssa/gen-vect-11.c (working copy) @@ -1,6 +1,6 @@ /* { dg-do run { target vect_cmdline_needed } } */ -/* { dg-options -O2 -ftree-vectorize -ftree-vectorizer-verbose=3 -fwrapv -fdump-tree-vect-stats } */ -/* { dg-options -O2 -ftree-vectorize -ftree-vectorizer-verbose=3 -fwrapv -fdump-tree-vect-stats -mno-sse { target { i?86-*-* x86_64-*-* } } } */ +/* { dg-options -O2 -ftree-vectorize -fwrapv -fdump-tree-vect-details } */ +/* { dg-options -O2 -ftree-vectorize -fwrapv -fdump-tree-vect-details -mno-sse { target { i?86-*-* x86_64-*-* } } } */ #include stdlib.h Index: testsuite/gcc.dg/tree-ssa/gen-vect-2.c === --- testsuite/gcc.dg/tree-ssa/gen-vect-2.c (revision 191883) +++ testsuite/gcc.dg/tree-ssa/gen-vect-2.c (working copy) @@ -1,6 +1,6 @@ /* { dg-do run { target vect_cmdline_needed } } */ -/* { dg-options -O2 -ftree-vectorize -ftree-vectorizer-verbose=4 -fdump-tree-vect-stats } */ -/* { dg-options -O2 -ftree-vectorize -ftree-vectorizer-verbose=4 -fdump-tree-vect-stats -mno-sse { target { i?86-*-* x86_64-*-* } } } */ +/* { dg-options -O2 -ftree-vectorize -fdump-tree-vect-details } */ +/* { dg-options -O2 -ftree-vectorize -fdump-tree-vect-details -mno-sse { target { i?86-*-* x86_64-*-* } } } */ #include stdlib.h Index: testsuite/gcc.dg/tree-ssa/gen-vect-32.c === --- testsuite/gcc.dg/tree-ssa/gen-vect-32.c (revision 191883) +++ testsuite/gcc.dg/tree-ssa/gen-vect-32.c (working copy) @@ -1,6 +1,6 @@ /* { dg-do run { target vect_cmdline_needed } } */ -/* { dg-options -O2 -ftree-vectorize -ftree-vectorizer-verbose=4 -fdump-tree-vect-stats } */ -/* { dg-options -O2 -ftree-vectorize -ftree-vectorizer-verbose=4 -fdump-tree-vect-stats -mno-sse { target { i?86-*-* x86_64-*-* } } } */ +/* { dg-options -O2 -ftree-vectorize -fdump-tree-vect-details } */ +/* { dg-options -O2 -ftree-vectorize -fdump-tree-vect-details -mno-sse { target { i?86-*-* x86_64-*-* } } } */ #include stdlib.h Index: testsuite/gcc.dg/tree-ssa/gen-vect-25.c === --- testsuite/gcc.dg/tree-ssa/gen-vect-25.c (revision 191883) +++ testsuite/gcc.dg/tree-ssa/gen-vect-25.c (working copy) @@ -1,6 +1,6 @@ /* { dg-do run { target vect_cmdline_needed } } */ -/* { dg-options -O2 -ftree-vectorize -ftree-vectorizer-verbose=4 -fdump-tree-vect-stats } */ -/* { dg-options -O2 -ftree-vectorize -ftree-vectorizer-verbose=4 -fdump-tree-vect-stats -mno-sse { target { i?86-*-* x86_64-*-* } } } */ +/* { dg-options -O2 -ftree-vectorize -fdump-tree-vect-details } */ +/* { dg-options -O2 -ftree-vectorize -fdump-tree-vect-details -mno-sse { target { i?86-*-* x86_64-*-* } } } */ #include stdlib.h Index: testsuite/gcc.dg/tree-ssa/gen-vect-11a.c === --- testsuite/gcc.dg/tree-ssa/gen-vect-11a.c (revision 191883) +++
Re: [PATCH] Add option for dumping to stderr (issue6190057)
On Mon, Oct 1, 2012 at 8:39 PM, Gabriel Dos Reis g...@integrable-solutions.net wrote: On Mon, Oct 1, 2012 at 1:27 PM, Michael Meissner meiss...@linux.vnet.ibm.com wrote: On Mon, Oct 01, 2012 at 02:02:26PM -0400, Michael Meissner wrote: Your change on September 30th, breaks the powerpc port because the REPORT_DETAILS value in the enumeration is no longer there, and the rs6000_density_test function was using that. Please in the future, when you are making global changes, grep for uses of enum values in all of the machine dependent directories so we can avoid breakage like this. Also, in looking at the changes, given we are already up to 28 TDF_ flags, I would recommend immediately adding a new type that is the TDF flagword type. Thus it will be a lot simpler when we add 4 more TDF flags and have to change the type from int to HOST_WIDE_INT. Agreed that we need an abstraction here. Some TLC as well - the flags have various meanings (some control dumping, some, like TDF_TREE, seem to be unrelated - the MSG ones probably don't need the same number-space as well, not all flags are used anymore - TDF_MEMSYMS?). But yes, an abstraction is needed. But I wouldn't suggest HOST_WIDE_INT but int - uint32_t instead (possibly going uint64_t). Richard. -- Gaby
Re: [Patch] Fix PR53397
On Mon, 1 Oct 2012, venkataramanan.ku...@amd.com wrote: Hi, The below patch fixes the FFT/Scimark regression caused by useless prefetch generation. This fix tries to make prefetch less aggressive by prefetching arrays in the inner loop, when the step is invariant in the entire loop nest. GCC currently tries to prefetch invariant steps when they are in the inner loop. But does not check if the step is variant in outer loops. In the scimark FFT case, the trip count of the inner loop varies by a non constant step, which is invariant in the inner loop. But the step variable is varying in outer loop. This makes inner loop trip count small (at run time varies sometimes as small as 1 iteration) Prefetching ahead x iteration when the inner loop trip count is smaller than x leads to useless prefetches. Flag used: -O3 -march=amdfam10 Before ** ** ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark ** ** for details. (Results can be submitted to p...@nist.gov) ** ** ** Using 2.00 seconds min time per kenel. Composite Score: 550.50 FFT Mflops:38.66(N=1024) SOR Mflops: 617.61(100 x 100) MonteCarlo: Mflops: 173.74 Sparse matmult Mflops: 675.63(N=1000, nz=5000) LU Mflops: 1246.88(M=100, N=100) After ** ** ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark ** ** for details. (Results can be submitted to p...@nist.gov) ** ** ** Using 2.00 seconds min time per kenel. Composite Score: 639.20 FFT Mflops: 479.19(N=1024) SOR Mflops: 617.61(100 x 100) MonteCarlo: Mflops: 173.18 Sparse matmult Mflops: 679.13(N=1000, nz=5000) LU Mflops: 1246.88(M=100, N=100) GCC regression make check -k passes with x86_64-unknown-linux-gnu New tests that PASS: gcc.dg/pr53397-1.c scan-assembler prefetcht0 gcc.dg/pr53397-1.c scan-tree-dump aprefetch Issued prefetch gcc.dg/pr53397-1.c (test for excess errors) gcc.dg/pr53397-2.c scan-tree-dump aprefetch loop variant step gcc.dg/pr53397-2.c scan-tree-dump aprefetch Not prefetching gcc.dg/pr53397-2.c (test for excess errors) Checked CPU2006 and polyhedron on latest AMD processor, no regressions noted. Ok to commit in trunk? regards, Venkat gcc/ChangeLog +2012-10-01 Venkataramanan Kumar venkataramanan.ku...@amd.com + + * tree-ssa-loop-prefetch.c (gather_memory_references_ref):$ + Perform non constant step prefetching in inner loop, only $ + when it is invariant in the entire loop nest. $ + * testsuite/gcc.dg/pr53397-1.c: New test case $ + Checks we are prefecthing for loop invariant steps$ + * testsuite/gcc.dg/pr53397-2.c: New test case$ + Checks we are not prefecthing for loop variant steps + Index: gcc/testsuite/gcc.dg/pr53397-1.c === --- gcc/testsuite/gcc.dg/pr53397-1.c (revision 0) +++ gcc/testsuite/gcc.dg/pr53397-1.c (revision 0) @@ -0,0 +1,28 @@ +/* Prefetching when the step is loop invariant. */ + +/* { dg-do compile } */ +/* { dg-options -O3 -fprefetch-loop-arrays -fdump-tree-aprefetch-details --param min-insn-to-prefetch-ratio=3 --param simultaneous-prefetches=10 -fdump-tree-aprefetch-details } */ + + +double data[16384]; +void prefetch_when_non_constant_step_is_invariant(int step, int n) +{ + int a; + int b; + for (a = 1; a step; a++) { +for (b = 0; b n; b += 2 * step) { + + int i = 2*(b + a); + int j = 2*(b + a + step); + + + data[j] = data[i]; + data[j+1] = data[i+1]; +} + } +} + +/* { dg-final { scan-tree-dump Issued prefetch aprefetch } } */ +/* { dg-final { scan-assembler prefetcht0 } } */ This (and the case below) needs to be adjusted to only run on the appropriate hardware. See for example gcc.dg/tree-ssa/prefetch-8.c for how to do this. +/* { dg-final { cleanup-tree-dump aprefetch } } */ Index: gcc/testsuite/gcc.dg/pr53397-2.c === --- gcc/testsuite/gcc.dg/pr53397-2.c (revision 0) +++ gcc/testsuite/gcc.dg/pr53397-2.c (revision 0) @@ -0,0 +1,29 @@ +/* Not prefetching when the step is loop variant. */ + +/* { dg-do compile } */ +/* { dg-options -O3 -fprefetch-loop-arrays -fdump-tree-aprefetch-details --param min-insn-to-prefetch-ratio=3 --param simultaneous-prefetches=10 -fdump-tree-aprefetch-details } */ + + +double data[16384]; +void donot_prefetch_when_non_constant_step_is_variant(int step, int n) +{ + int a; + int b; + for (a = 1; a step; a++,step*=2) {
Re: vec_cond_expr adjustments
On Mon, Oct 1, 2012 at 5:57 PM, Marc Glisse marc.gli...@inria.fr wrote: [merging both threads, thanks for the answers] On Mon, 1 Oct 2012, Richard Guenther wrote: optabs should be fixed instead, an is_gimple_val condition is implicitely val != 0. For vectors, I think it should be val 0 (with an appropriate cast of val to a signed integer vector type if necessary). Or (val highbit) != 0, but that's longer. I don't think so. Throughout the compiler we generally assume false == 0 and anything else is true. (yes, for FP there is STORE_FLAG_VALUE, but it's scope is quite limited - if we want sth similar for vectors we'd have to invent it). See below. If we for example have predicate = a b; x = predicate ? d : e; y = predicate ? f : g; we ideally want to re-use the predicate computation on targets where that would be optimal (and combine should be able to recover the case where it is not). That I don't understand. The vcond instruction implemented by targets takes as arguments d, e, cmp, a, b and emits the comparison itself. I don't see how I can avoid sending to the targets both (d,e,,a,b) and (f,g,,a,b). They will notice eventually that ab is computed twice and remove one of the two, but I don't see how to do that in optabs.c. Or I can compute x = a b, use x 0 as the comparison passed to the targets, and expect targets (those for which it is true) to recognize that 0 is useless in a vector condition (PR54700), or is useless on a comparison result. But that's a limitation of how vcond works. ISTR there is/was a vselect instruction as well, taking a mask and two vectors to select from. At least that's how vcond works internally for some sub-targets. vselect seems to only appear in config/. Would it be defined as: vselect(m,a,b)=(am)|(b~m) ? I would almost be tempted to just define a pattern in .md files and let combine handle it, although it might be one instruction too long for that (and if m is xy, ~m might look like x=y). Or would it match the OpenCL select: For each component of a vector type, result[i] = if MSB of c[i] is set ? b[i] : a[i].? Or the pattern with and | but with a precondition that the value of each element of the mask must be 0 or ±1? I don't find vcond that bad, as long as targets check for trivial comparisons in the expansion (what trivial means may depend on the platform). It is quite flexible for targets. Well, ok. On Mon, 1 Oct 2012, Richard Guenther wrote: tmp = fold_build2_loc (gimple_location (def_stmt), code, - boolean_type_node, + TREE_TYPE (cond), That's obvious. Ok, I'll test and commit that line separately. + if (TREE_CODE (op0) == VECTOR_CST TREE_CODE (op1) == VECTOR_CST) +{ + int count = VECTOR_CST_NELTS (op0); + tree *elts = XALLOCAVEC (tree, count); + gcc_assert (TREE_CODE (type) == VECTOR_TYPE); + + for (int i = 0; i count; i++) + { + tree elem_type = TREE_TYPE (type); + tree elem0 = VECTOR_CST_ELT (op0, i); + tree elem1 = VECTOR_CST_ELT (op1, i); + + elts[i] = fold_relational_const (code, elem_type, + elem0, elem1); + + if(elts[i] == NULL_TREE) + return NULL_TREE; + + elts[i] = fold_negate_const (elts[i], elem_type); I think you need to invent something new similar to STORE_FLAG_VALUE or use STORE_FLAG_VALUE here. With the above you try to map {0, 1} to {0, -1} which is only true if the operation on the element types returns {0, 1} (thus, STORE_FLAG_VALUE is 1). Er, seems to me that constant folding of a scalar comparison in the front/middle-end only returns {0, 1}. The point is we need to define some semantics for vector comparison results. One variant is to make it target independent which in turn would inhibit (or make it more difficult) to exploit some target features. You for example use {0, -1} for truth values - probably to exploit target features - even though the most natural middle-end way would be to use {0, 1} as for everything else (caveat: there may be both signed and unsigned bools, we don't allow vector components with non-mode precision, thus you could argue that a signed bool : 1 is just sign-extended for your solution). A different variant is to make it target dependent to leverage optimization opportunities - that's why STORE_FLAG_VALUE exists. For example with vector comparisons a v result, when performing bitwise operations on it, you either have to make the target expand code to produce {0, -1} even if the natural compare instruction would, say, produce {0, 0x8} - or not constrain the possible values of its result (like forwprop would do with your patch). In general we want constant folding to yield the same results as if the HW carried out the operation to make -O0 code not diverge from
Re: [PATCH] Vector CONSTRUCTOR verifier
On Tue, Oct 2, 2012 at 3:01 PM, Jakub Jelinek ja...@redhat.com wrote: Hi! As discussed in the PR and on IRC, this patch verifies that vector CONSTRUCTOR in GIMPLE is either empty CONSTRUCTOR, or contains scalar elements of type compatible with vector element type (then the verification is less strict, allows less than TYPE_VECTOR_SUBPARTS elements and allows non-NULL indexes if they are consecutive (no holes); this is because from FEs often CONSTRUCTORs with those properties leak into the IL, and a change in the gimplifier to canonicalize them wasn't enough, they keep leaking even from non-gimplified DECL_INITIAL values etc.), or contains vector elements (element types must be compatible, the vector elements must be of the same type and their number must fill the whole wider vector - these are created/used by tree-vect-generic lowering if HW supports only shorter vectors than what is requested in source). Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk? Ok with ... 2012-10-02 Jakub Jelinek ja...@redhat.com PR tree-optimization/54713 * expr.c (categorize_ctor_elements_1): Don't assume purpose is non-NULL. * tree-cfg.c (verify_gimple_assign_single): Add verification of vector CONSTRUCTORs. * tree-ssa-sccvn.c (vn_reference_lookup_3): For VECTOR_TYPE CONSTRUCTORs, don't do anything if element type is VECTOR_TYPE, and don't check index. * tree-vect-slp.c (vect_get_constant_vectors): VIEW_CONVERT_EXPR ctor elements first if their type isn't compatible with vector element type. --- gcc/expr.c.jj 2012-09-27 12:45:53.0 +0200 +++ gcc/expr.c 2012-10-01 18:21:40.885122833 +0200 @@ -5491,7 +5491,7 @@ categorize_ctor_elements_1 (const_tree c { HOST_WIDE_INT mult = 1; - if (TREE_CODE (purpose) == RANGE_EXPR) + if (purpose TREE_CODE (purpose) == RANGE_EXPR) { tree lo_index = TREE_OPERAND (purpose, 0); tree hi_index = TREE_OPERAND (purpose, 1); --- gcc/tree-cfg.c.jj 2012-10-01 17:28:17.469921927 +0200 +++ gcc/tree-cfg.c 2012-10-02 11:24:11.686155889 +0200 @@ -4000,6 +4000,80 @@ verify_gimple_assign_single (gimple stmt return res; case CONSTRUCTOR: + if (TREE_CODE (rhs1_type) == VECTOR_TYPE) + { + unsigned int i; + tree elt_i, elt_v, elt_t = NULL_TREE; + + if (CONSTRUCTOR_NELTS (rhs1) == 0) + return res; + /* For vector CONSTRUCTORs we require that either it is empty +CONSTRUCTOR, or it is a CONSTRUCTOR of smaller vector elements +(then the element count must be correct to cover the whole +outer vector and index must be NULL on all elements, or it is +a CONSTRUCTOR of scalar elements, where we as an exception allow +smaller number of elements (assuming zero filling) and +consecutive indexes as compared to NULL indexes (such +CONSTRUCTORs can appear in the IL from FEs). */ + FOR_EACH_CONSTRUCTOR_ELT (CONSTRUCTOR_ELTS (rhs1), i, elt_i, elt_v) + { + if (elt_t == NULL_TREE) + { + elt_t = TREE_TYPE (elt_v); + if (TREE_CODE (elt_t) == VECTOR_TYPE) + { + tree elt_t = TREE_TYPE (elt_v); + if (!useless_type_conversion_p (TREE_TYPE (rhs1_type), + TREE_TYPE (elt_t))) + { + error (incorrect type of vector CONSTRUCTOR + elements); + debug_generic_stmt (rhs1); + return true; + } + else if (CONSTRUCTOR_NELTS (rhs1) + * TYPE_VECTOR_SUBPARTS (elt_t) + != TYPE_VECTOR_SUBPARTS (rhs1_type)) + { + error (incorrect number of vector CONSTRUCTOR + elements); + debug_generic_stmt (rhs1); + return true; + } + } + else if (!useless_type_conversion_p (TREE_TYPE (rhs1_type), + elt_t)) + { + error (incorrect type of vector CONSTRUCTOR elements); + debug_generic_stmt (rhs1); + return true; + } + else if (CONSTRUCTOR_NELTS (rhs1) + TYPE_VECTOR_SUBPARTS (rhs1_type)) + { + error (incorrect number of vector CONSTRUCTOR elements); + debug_generic_stmt (rhs1); + return true; +
[PATCH] Fix PR54735
This fixes PR54735 - a bad interaction of non-up-to-date virtual SSA form, update-SSA and cfg cleanup. Morale of the story: cfg cleanup can remove blocks and thus release SSA names - SSA update is rightfully confused when such released SSA name is still used at update time. The following patch fixes the case in question simply by making sure to run update-SSA before cfg cleanup. [eventually release_ssa_name could treat virtual operands the same as regular SSA names when they are scheduled for a re-write, that would make this whole mess more robust - I am thinking of this] Bootstrapped on x86_64-unknown-linux-gnu, testing in progress. Richard. 2012-10-02 Richard Guenther rguent...@suse.de PR middle-end/54735 * tree-ssa-pre.c (do_pre): Make sure to update virtual SSA form before cleaning up the CFG. * g++.dg/torture/pr54735.C: New testcase. Index: gcc/tree-ssa-pre.c === --- gcc/tree-ssa-pre.c (revision 191969) +++ gcc/tree-ssa-pre.c (working copy) @@ -4820,6 +4820,13 @@ do_pre (void) free_scc_vn (); + /* Tail merging invalidates the virtual SSA web, together with + cfg-cleanup opportunities exposed by PRE this will wreck the + SSA updating machinery. So make sure to run update-ssa + manually, before eventually scheduling cfg-cleanup as part of + the todo. */ + update_ssa (TODO_update_ssa_only_virtuals); + return todo; } @@ -4845,8 +4852,7 @@ struct gimple_opt_pass pass_pre = 0, /* properties_provided */ 0, /* properties_destroyed */ TODO_rebuild_alias, /* todo_flags_start */ - TODO_update_ssa_only_virtuals | TODO_ggc_collect - | TODO_verify_ssa /* todo_flags_finish */ + TODO_ggc_collect | TODO_verify_ssa /* todo_flags_finish */ } }; Index: gcc/testsuite/g++.dg/torture/pr54735.C === --- gcc/testsuite/g++.dg/torture/pr54735.C (revision 0) +++ gcc/testsuite/g++.dg/torture/pr54735.C (working copy) @@ -0,0 +1,179 @@ +// { dg-do compile } + +class Gmpfr +{}; +class M : Gmpfr +{ +public: + Gmpfr infconst; + M(int); +}; +templatetypenamestruct A; +templatetypename, int, int, int = 0 ? : 0, int = 0, int = 0class N; +templatetypenameclass O; +templatetypenamestruct B; +struct C +{ + enum + { value }; +}; +class D +{ +public: + enum + { ret }; +}; +struct F +{ + enum + { ret = 0 ? : 0 }; +}; +templatetypename Derivedstruct G +{ + typedef ODerivedtype; +}; +struct H +{ + void operator * (); +}; +struct I +{ + enum + { RequireInitialization = C::value ? : 0, ReadCost }; +}; +templatetypename Derivedstruct J +{ + enum + { ret = ADerived::InnerStrideAtCompileTime }; +}; +templatetypename Derivedstruct K +{ + enum + { ret = ADerived::OuterStrideAtCompileTime }; +}; +templatetypename Derivedclass P : H +{ +public: + using H::operator *; + typedef typename ADerived::Scalar Scalar; + enum + { RowsAtCompileTime= + ADerived::RowsAtCompileTime, ColsAtCompileTime = + ADerived::ColsAtCompileTime, SizeAtCompileTime = + F::ret, MaxRowsAtCompileTime = + ADerived::MaxRowsAtCompileTime, MaxColsAtCompileTime = + ADerived::MaxColsAtCompileTime, MaxSizeAtCompileTime = + F::ret, Flags = + ADerived::Flags ? : 0 ? : 0, CoeffReadCost = + ADerived::CoeffReadCost, InnerStrideAtCompileTime= + JDerived::ret, OuterStrideAtCompileTime = KDerived::ret }; + BDerived operator (const Scalar); +}; + +templatetypename Derivedclass O : public PDerived +{}; + +templateint _Colsclass L +{ +public: + + int cols() + { +return _Cols; + } +}; +templatetypename Derivedclass Q : public GDerived::type +{ +public: + typedef typename GDerived::type Base; + typedef typename ADerived::Index Index; + typedef typename ADerived::Scalar Scalar; + LBase::ColsAtCompileTime m_storage; + Index cols() + { +return m_storage.cols(); + } + + Scalar coeffRef(Index, + Index); +}; + +templatetypename _Scalar, int _Rows, int _Cols, int _Options, int _MaxRows, + int _MaxColsstruct AN_Scalar, _Rows, _Cols, _Options, _MaxRows, + _MaxCols +{ + typedef _Scalar Scalar; + typedef int Index; + enum + { RowsAtCompileTime, ColsAtCompileTime = + _Cols, MaxRowsAtCompileTime, MaxColsAtCompileTime, Flags= + D::ret, CoeffReadCost = + I::ReadCost, InnerStrideAtCompileTime, OuterStrideAtCompileTime = + 0 ? : 0 }; +}; +templatetypename _Scalar, int, int _Cols, int, int, + intclass N : public QN_Scalar, 0, _Cols +{ +public: + QN Base; + templatetypename T0, typename T1N(const T0
Re: [patch] Minor TARGET_MEM_REF cleanup
On Sat, Sep 29, 2012 at 1:17 PM, Eric Botcazou ebotca...@adacore.com wrote: Hi, for simple loops like: extern int a[]; extern int b[]; void foo (int l) { int i; for (i = 0; i l; i++) a[i] = b [i]; } you get in the .lim3 dump: Unanalyzed memory reference 0: _5 = MEM[symbol: b, index: ivtmp.3_1, step: 4, offset: 0B]; Memory reference 1: MEM[symbol: a, index: ivtmp.3_1, step: 4, offset: 0B] so the pass analyzes the store but not the load, which seems an oversight. The patch also folds copy_mem_ref_info into its only user and removes it. Tested on x86_64-suse-linux, OK for mainline? Please take the opportunity to clean up simple_mem_ref_in_stmt some more. Both loads and stores in assigns require gimple_assign_single_p, thus do if (!gimple_assign_single_p (stmt)) return NULL; before deciding on store/load. To decide that the stmt is a load then do if (TREE_CODE (*lhs) == SSA_NAME gimple_vuse (stmt)) it is a store if gimple_vdef (stmt) (TREE_CODE (*rhs) == SSA_NAME || is_gimple_min_invariant (*rhs)) else it may still be an aggregate copy but LIM doesn't handle those (though they may still be interesting for disambiguation ...) The tree-ssa-address parts are ok as-is. Thanks, Richard. 2012-09-29 Eric Botcazou ebotca...@adacore.com * tree.h (copy_mem_ref_info): Delete. * tree-ssa-address.c (copy_mem_ref_info): Likewise. (maybe_fold_tmr): Copy flags manually. * tree-ssa-loop-im.c (simple_mem_ref_in_stmt): Accept TARGET_MEM_REF on the RHS as well. -- Eric Botcazou
Re: [PATCH] Fix PR middle-end/54759
On Sun, Sep 30, 2012 at 9:03 PM, Dehao Chen de...@google.com wrote: Hi, This patch fixes the bug when comparing location to UNKNOWN_LOC. Bootstrapped and passed gcc regression test. Okay for trunk? Ok. Thanks, Richard. Thanks, Dehao 2012-09-30 Dehao Chen de...@google.com PR middle-end/54759 * gcc/tree-vect-loop-manip.c (slpeel_make_loop_iterate_ntimes): Use LOCATION_LOCUS to compare with UNKNOWN_LOCATION. (slpeel_tree_peel_loop_to_edge): Likewise. * gcc/tree-vectorizer.c (vectorize_loops): Likewise.
Re: RFC: LRA for x86/x86-64 [0/9]
On Mon, Oct 1, 2012 at 7:48 AM, Jakub Jelinek ja...@redhat.com wrote: On Sun, Sep 30, 2012 at 06:50:50PM -0400, Vladimir Makarov wrote: But I think that LRA cpu time problem for this test can be fixed. But I don't think I can fix it for 2 weeks. So if people believe that current LRA behaviour on this PR is a stopper to include it into gcc4.8 than we should postpone its inclusion until gcc4.9 when I hope to fix it. I think this testcase shouldn't be a show stopper for LRA inclusion into 4.8, but something to look at for stage3. I agree here. I think a lot of GCC passes have scalability issues on that testcase, that is why it must be compiled with -O1 and not higher optimization options, so perhaps it would be enough to choose a faster algorithm generating worse code for the huge functions and -O1. Yes, we spent quite some time in making basic optimization work for insane testcases (basically avoid quadratic or bigger complexity in any IL size variable (number of basic-blocks, edges, instructions, pseudos, etc.)). And indeed if you use -O2 we do have issues with existing passes (and even at -O1 points-to analysis can wreck things, or even profile guessing - see existing bugs for that). Basically I would tune -O1 towards being able to compile and optimize insane testcases with memory and compile-time requirements that are linear in any of the above complexity measures. Thus, falling back to the -O0 register allocating strathegy at certain thresholds for the above complexity measures is fine (existing IRA for example has really bad scaling on the number of loops in the function, but you can tweak with flags to make it not consider that). And I agree it is primarily a bug in the generator that it creates such huge functions, that can't perform very well. Well, not for -O2+, yes, but at least we should try(!) hard. Thanks, Richard. Jakub
Re: Profile housekeeping 5/n (make RTL loop optimizers to use loop bounds better)
On Mon, Oct 1, 2012 at 11:37 AM, Jan Hubicka hubi...@ucw.cz wrote: Hi, this patch commonizes the maximal iteration estimate logic in between SCEV and loop-iv. Both are now using loop-nb_iterations_upper_bound. I decided to keep same API for SCEV code as for RTL code, so I made estimated_loop_iterations and max_loop_iterations to not try to recompute bounds and ICE when invoked without SCEV fired on. The patch updates RTL optimizers to use the estimated_loop_iterations and max_loop_iterations. This has few advantages: 1) loop unroller can now take into account estimates stored into loop structure by earlier pass (I think none exist though) It is however better then using expected_loop_iterations since profile might get out of date with expansion. 2) loop peeling code now use max iterations bounds. This makes it i.e. to peel vectorizer prologues/epilogues/scalar loops so -fpeel-loops now improves my low iteration count testcase by about 10% 3) Same for loop unswithcing. I am not really friend with the new double_int API. I copied some existing examples but find it ugly. Why do we miss operators for comparsions and division? Why from_*/to_* can't be a cast at least for basic integer types? Regtested/bootstrapped x86_64-linux, seems sane? Yes. Can you add a testcase or two? I tweaked RTL unroll/peel after preserving loops to not blindly unroll/peel everything 8 times (not sure if _I_ added testcases ...). I also wonder if loop vectorizer should not update the estimates after loop iteration count is reduced by vectorizing. Probably yes. Thanks, Richard. Honza * loop-unswitch.c (unswitch_single_loop): Use estimated_loop_iterations_int to prevent unswitching when loop is known to not roll. * tree-ssa-loop-niter.c (estimated_loop_iterations): Do not segfault when SCEV is not initialized. (max_loop_iterations): Likewise. * tree-ssa-loop-unswitch.c (tree_ssa_unswitch_loops): Use estimated_loop_iterations_int to prevent unswithcing when loop is known to not roll. * tree-scalar-evolution.c (scev_initialized_p): New function. * tree-scalar-evolution.h (scev_initialized_p): Likewise. * loop-unroll.c (decide_peel_once_rolling): Use max_loop_iterations_int. (unroll_loop_constant_iterations): Update nb_iterations_upper_bound and nb_iterations_estimate. (decide_unroll_runtime_iterations): Use estimated_loop_iterations or max_loop_iterations; (unroll_loop_runtime_iterations): fix profile updating. (decide_peel_simple): Use estimated_loop_iterations and max_loop_iterations. (decide_unroll_stupid): Use estimated_loop_iterations ad max_loop_iterations. * loop-doloop.c (doloop_modify): Use max_loop_iterations_int. (doloop_optimize): Likewise. * loop-iv.c (iv_number_of_iterations): Use record_niter_bound. (find_simple_exit): Likewise. * cfgloop.h (struct niter_desc): Remove niter_max. Index: loop-unswitch.c === *** loop-unswitch.c (revision 191867) --- loop-unswitch.c (working copy) *** unswitch_single_loop (struct loop *loop, *** 257,262 --- 257,263 rtx cond, rcond = NULL_RTX, conds, rconds, acond, cinsn; int repeat; edge e; + HOST_WIDE_INT iterations; /* Do not unswitch too much. */ if (num PARAM_VALUE (PARAM_MAX_UNSWITCH_LEVEL)) *** unswitch_single_loop (struct loop *loop, *** 299,305 } /* Nor if the loop usually does not roll. */ ! if (expected_loop_iterations (loop) 1) { if (dump_file) fprintf (dump_file, ;; Not unswitching, loop iterations 1\n); --- 300,307 } /* Nor if the loop usually does not roll. */ ! iterations = estimated_loop_iterations_int (loop); ! if (iterations = 0 iterations = 1) { if (dump_file) fprintf (dump_file, ;; Not unswitching, loop iterations 1\n); Index: tree-ssa-loop-niter.c === *** tree-ssa-loop-niter.c (revision 191867) --- tree-ssa-loop-niter.c (working copy) *** estimate_numbers_of_iterations_loop (str *** 3012,3020 bool estimated_loop_iterations (struct loop *loop, double_int *nit) { ! estimate_numbers_of_iterations_loop (loop); if (!loop-any_estimate) ! return false; *nit = loop-nb_iterations_estimate; return true; --- 3012,3034 bool estimated_loop_iterations (struct loop *loop, double_int *nit) { ! /* When SCEV information is available, try to update loop iterations ! estimate. Otherwise just return whatever we recorded earlier. */ ! if (scev_initialized_p ()) !
Re: Constant-fold vector comparisons
On Sat, Sep 29, 2012 at 3:25 PM, Marc Glisse marc.gli...@inria.fr wrote: Hello, this patch does 2 things (I should have split it in 2, but the questions go together): 1) it handles constant folding of vector comparisons, 2) it fixes another place where vectors are not expected (I'll probably wait to have front-end support and testcases to do more of those, but there is something to discuss). I wasn't sure what integer_truep should test exactly. For integer: == 1 or != 0? For vectors: == -1 or 0? I chose the one that worked best for the forwprop case where I used it. It seems that before this patch, the middle-end didn't know how comparison results were encoded (a good reason for VEC_COND_EXPR to require a comparison as its first argument). I am using the OpenCL encoding that what matters is the high bit of each vector element. I am not quite sure what happens for targets (are there any?) that use a different encoding. When expanding vcond, they can do the comparison as they like. When expanding an isolated comparison, I expect they have to expand it as vcond(ab,-1,0). So it should be ok, but I could easily have missed something. Comments below 2012-10-01 Marc Glisse marc.gli...@inria.fr gcc/ * tree.c (integer_truep): New function. * tree.h (integer_truep): Declare. * tree-ssa-forwprop.c (forward_propagate_into_cond): Call it. Don't use boolean_type_node for vectors. * fold-const.c (fold_relational_const): Handle VECTOR_CST. gcc/testsuite/ * gcc.dg/tree-ssa/foldconst-6.c: New testcase. -- Marc Glisse Index: gcc/tree.h === --- gcc/tree.h (revision 191850) +++ gcc/tree.h (working copy) @@ -5272,20 +5272,25 @@ extern int integer_zerop (const_tree); /* integer_onep (tree x) is nonzero if X is an integer constant of value 1. */ extern int integer_onep (const_tree); /* integer_all_onesp (tree x) is nonzero if X is an integer constant all of whose significant bits are 1. */ extern int integer_all_onesp (const_tree); +/* integer_truep (tree x) is nonzero if X is an integer constant of value 1, + or a vector constant of value 0. */ + +extern bool integer_truep (const_tree); + /* integer_pow2p (tree x) is nonzero is X is an integer constant with exactly one bit 1. */ extern int integer_pow2p (const_tree); /* integer_nonzerop (tree x) is nonzero if X is an integer constant with a nonzero value. */ extern int integer_nonzerop (const_tree); Index: gcc/tree-ssa-forwprop.c === --- gcc/tree-ssa-forwprop.c (revision 191850) +++ gcc/tree-ssa-forwprop.c (working copy) @@ -564,46 +564,46 @@ forward_propagate_into_cond (gimple_stmt enum tree_code code; tree name = cond; gimple def_stmt = get_prop_source_stmt (name, true, NULL); if (!def_stmt || !can_propagate_from (def_stmt)) return 0; code = gimple_assign_rhs_code (def_stmt); if (TREE_CODE_CLASS (code) == tcc_comparison) tmp = fold_build2_loc (gimple_location (def_stmt), code, - boolean_type_node, + TREE_TYPE (cond), That's obvious. gimple_assign_rhs1 (def_stmt), gimple_assign_rhs2 (def_stmt)); else if ((code == BIT_NOT_EXPR TYPE_PRECISION (TREE_TYPE (cond)) == 1) || (code == BIT_XOR_EXPR - integer_onep (gimple_assign_rhs2 (def_stmt + integer_truep (gimple_assign_rhs2 (def_stmt See below. { tmp = gimple_assign_rhs1 (def_stmt); swap = true; } } if (tmp is_gimple_condexpr (tmp)) { if (dump_file tmp) { fprintf (dump_file, Replaced '); print_generic_expr (dump_file, cond, 0); fprintf (dump_file, ' with '); print_generic_expr (dump_file, tmp, 0); fprintf (dump_file, '\n); } - if (integer_onep (tmp)) + if (integer_truep (tmp)) gimple_assign_set_rhs_from_tree (gsi_p, gimple_assign_rhs2 (stmt)); else if (integer_zerop (tmp)) gimple_assign_set_rhs_from_tree (gsi_p, gimple_assign_rhs3 (stmt)); else { gimple_assign_set_rhs1 (stmt, unshare_expr (tmp)); if (swap) { tree t = gimple_assign_rhs2 (stmt); gimple_assign_set_rhs2 (stmt, gimple_assign_rhs3 (stmt)); Index: gcc/testsuite/gcc.dg/tree-ssa/foldconst-6.c === --- gcc/testsuite/gcc.dg/tree-ssa/foldconst-6.c (revision 0) +++ gcc/testsuite/gcc.dg/tree-ssa/foldconst-6.c (revision 0) @@ -0,0 +1,14 @@ +/* {
[PATCH] Fix PR47799 - debug info for early-inlining with LTO
This tries to emit proper debug information for early-inlined functions from LTO LTRANS phase (thus, emit DW_TAG_inlined_subroutine and allow gdb to set breakpoints). We need to avoid confusing LTO and dwarf2out with the full abstract block tree, so this patch flattens the abstract block tree by always using the ultimate origin for BLOCK_ABSTRACT_ORIGIN on blocks which are inlined_function_outer_scope_p. Thus, it tries to output the minimal info dwarf2out.c needs to emit the desired debug information. As with LTO all abstract inline instances get generated late for early inlined functions I had to amend the hack for extern inline functions to always output dies for decls that come their way through dwarf2out_abstract_function. And further down not crash on a NULL DECL_INITIAL (when LTO decided to output the function body in another LTRANS unit or if it does not get output at all). Currently LTO-bootstrapping and testing on x86_64-unknown-linux-gnu. Jason, are the dwarf2out bits ok with you? I've sofar toyed with examples like int x, y; static inline int foo (int i) { y = i; return y; } static inline int bar (int i) { x = i; return foo (x); } int main () { int k = 0; int res = bar (k); return res; } and debug information with/without LTO is now reasonably the same and I can set breakpoints on the inlined instances. Thanks, Richard. 2012-10-01 Richard Guenther rguent...@suse.de PR lto/47788 * tree-streamer-out.c (write_ts_block_tree_pointers): For inlined functions outer scopes write the ultimate origin as BLOCK_ABSTRACT_ORIGIN and BLOCK_SOURCE_LOCATION. Do not stream the fragment chains. (lto_input_ts_block_tree_pointers): Likewise. * dwarf2out.c (gen_subprogram_die): Handle NULL DECL_INITIAL. (dwarf2out_decl): Always output DECL_ABSTRACT function decls. Index: gcc/tree-streamer-in.c === *** gcc/tree-streamer-in.c (revision 191824) --- gcc/tree-streamer-in.c (working copy) *** static void *** 789,810 lto_input_ts_block_tree_pointers (struct lto_input_block *ib, struct data_in *data_in, tree expr) { - /* Do not stream BLOCK_SOURCE_LOCATION. We cannot handle debug information - for early inlining so drop it on the floor instead of ICEing in - dwarf2out.c. */ BLOCK_VARS (expr) = streamer_read_chain (ib, data_in); - /* Do not stream BLOCK_NONLOCALIZED_VARS. We cannot handle debug information - for early inlining so drop it on the floor instead of ICEing in - dwarf2out.c. */ - BLOCK_SUPERCONTEXT (expr) = stream_read_tree (ib, data_in); ! /* Do not stream BLOCK_ABSTRACT_ORIGIN. We cannot handle debug information ! for early inlining so drop it on the floor instead of ICEing in dwarf2out.c. */ ! BLOCK_FRAGMENT_ORIGIN (expr) = stream_read_tree (ib, data_in); ! BLOCK_FRAGMENT_CHAIN (expr) = stream_read_tree (ib, data_in); /* We re-compute BLOCK_SUBBLOCKS of our parent here instead of streaming it. For non-BLOCK BLOCK_SUPERCONTEXTs we still --- 789,810 lto_input_ts_block_tree_pointers (struct lto_input_block *ib, struct data_in *data_in, tree expr) { BLOCK_VARS (expr) = streamer_read_chain (ib, data_in); BLOCK_SUPERCONTEXT (expr) = stream_read_tree (ib, data_in); ! /* Stream BLOCK_ABSTRACT_ORIGIN and BLOCK_SOURCE_LOCATION for ! the limited cases we can handle - those that represent inlined ! function scopes. For the rest them on the floor instead of ICEing in dwarf2out.c. */ ! BLOCK_ABSTRACT_ORIGIN (expr) = stream_read_tree (ib, data_in); ! BLOCK_SOURCE_LOCATION (expr) = lto_input_location (ib, data_in); ! /* Do not stream BLOCK_NONLOCALIZED_VARS. We cannot handle debug information ! for early inlined BLOCKs so drop it on the floor instead of ICEing in ! dwarf2out.c. */ ! ! /* BLOCK_FRAGMENT_ORIGIN and BLOCK_FRAGMENT_CHAIN is not live at LTO ! streaming time. */ /* We re-compute BLOCK_SUBBLOCKS of our parent here instead of streaming it. For non-BLOCK BLOCK_SUPERCONTEXTs we still Index: gcc/tree-streamer-out.c === *** gcc/tree-streamer-out.c (revision 191824) --- gcc/tree-streamer-out.c (working copy) *** write_ts_exp_tree_pointers (struct outpu *** 682,702 static void write_ts_block_tree_pointers (struct output_block *ob, tree expr, bool ref_p) { - /* Do not stream BLOCK_SOURCE_LOCATION. We cannot handle debug information - for early inlining so drop it on the floor instead of ICEing in - dwarf2out.c. */ streamer_write_chain (ob, BLOCK_VARS (expr), ref_p); /* Do not stream BLOCK_NONLOCALIZED_VARS. We cannot handle debug information ! for early inlining so drop it on the floor
Re: [RFC] Make vectorizer to skip loops with small iteration estimate
On Sun, 30 Sep 2012, Jan Hubicka wrote: Hi, the point of the following patch is to make vectorizer to not vectorize the following testcase with profile feedback: int a[1]; int i=5; int k=2; int val; __attribute__ ((noinline,noclone)) test() { int j; for(j=0;jk;j++) a[j]=val; } main() { while (i) { test (); i--; } } Here the compiler should work out that the second loop iterates 2 times at the average and thus it is not good candidate for vectorizing. In my first attempt I added the following: @@ -1474,6 +1478,18 @@ vect_analyze_loop_operations (loop_vec_i return false; } + if ((estimated_niter = estimated_stmt_executions_int (loop)) != -1 + (unsigned HOST_WIDE_INT) estimated_niter = th) +{ + if (vect_print_dump_info (REPORT_UNVECTORIZED_LOCATIONS)) +fprintf (vect_dump, not vectorized: estimated iteration count too small.); + if (vect_print_dump_info (REPORT_DETAILS)) +fprintf (vect_dump, not vectorized: estimated iteration count smaller than + user specified loop bound parameter or minimum + profitable iterations (whichever is more conservative).); + return false; +} + But to my surprise it does not help. There are two things: 1) the value of TH is bit low. In a way the cost model works is that it finds minimal niters where vectorized loop with all the setup costs is cheaper than the vector loop with all the setup costs. I.e. /* Calculate number of iterations required to make the vector version profitable, relative to the loop bodies only. The following condition must hold true: SIC * niters + SOC VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC(A) where SIC = scalar iteration cost, VIC = vector iteration cost, VOC = vector outside cost, VF = vectorization factor, PL_ITERS = prologue iterations, EP_ITERS= epilogue iterations SOC = scalar outside cost for run time cost model check. */ This value is used for both 1) decision if number of iterations is too low (max iterations is known) 2) decision on runtime whether we want to take the vectorized path or the scalar path. The vectoried loop looks like: k.1_10 = k; if (k.1_10 0) { pretmp_2 = val; niters.8_4 = (unsigned int) k.1_10; bnd.9_13 = niters.8_4 2; ratio_mult_vf.10_1 = bnd.9_13 2; _18 = niters.8_4 = 3; _19 = ratio_mult_vf.10_1 == 0; _20 = _19 | _18; if (_20 != 0) scalar loop else vector prologue } So the unvectorized cost is SIC * niters The vectorized path is SOC + VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC The scalar path of vectorizer loop is SIC * niters + SOC Note that 'th' is used for the runtime profitability check which is done at the time the setup cost has already been taken (yes, we probably should make it more conservative but then guard the whole set of loops by the check, not only the vectorized path). See PR53355 for the general issue. It makes sense to vectorize if SIC * niters SOC + VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC (B) That is in the optimal cse where we actually vectorize the overall speed of vectorized loop including the runtime check is better. It makes sense to take the vector loop if SIC * niters VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC (C) Because the scalar loop is taken. The attached patch implements the formula (C) and uses it to deterine the decision based on number of iterations estimate (that is usually provided by the feedback) As a reality check, I tried my testcase. 9: Cost model analysis: Vector inside of loop cost: 1 Vector prologue cost: 7 Vector epilogue cost: 2 Scalar iteration cost: 1 Scalar outside cost: 6 Vector outside cost: 9 prologue iterations: 0 epilogue iterations: 2 Calculated minimum iters for profitability: 4 9: Profitability threshold = 3 9: Profitability estimated iterations threshold = 20 This is overrated. The loop starts to be benefical at about 4 iterations in reality. I guess the values are kind of wrong. Vector inside of loop cost and Scalar iteration cost seems to ignore the fact that the loops do contain some control flow that should account at least one extra cycle. Vector prologue cost seems bit overrated for one pack operation. Of course this is very simple benchmark, in reality the vectorizatoin can be a lot more harmful by complicating more complex control flows. So I guess we have two options 1) go with the new formula and try to make cost model a bit more realistic. 2) stay with original formula that is quite close to
[PATCH] Fix -frounding-math builtins
I noticed that we attach the no-vops attribute to -frounding-math math functions. That's bogus as can be seen from the testcase int fesetround(int); double asinh(double x); double foo (double x, int b) { double y = 0.0, z; if (b) y = asinh (x); fesetround (0x400 /*FE_DOWNWARD*/); z = asinh (x); return y + z; } where PRE rightfully so removes a seeming partial redundancy by inserting a asinh call into the else block. That's because it exactly does _not_ get to see the rounding mode clobbering fesetround call as asinh does not have a virtual operand. Fixed as follows. Bootstrapped and tested on x86_64-unknown-linux-gnu. Richard. 2012-10-01 Richard Guenther rguent...@suse.de * builtins.def (ATTR_MATHFN_FPROUNDING): Do not use no-vops with -frounding-math. * builtin-attrs.def (ATTR_PURE_NOTHROW_NOVOPS_LIST): Remove. (ATTR_PURE_NOTHROW_NOVOPS_LEAF_LIST): Likewise. Index: gcc/builtins.def === *** gcc/builtins.def(revision 191917) --- gcc/builtins.def(working copy) *** along with GCC; see the file COPYING3. *** 163,169 memory. */ #undef ATTR_MATHFN_FPROUNDING #define ATTR_MATHFN_FPROUNDING (flag_rounding_math ? \ ! ATTR_PURE_NOTHROW_NOVOPS_LEAF_LIST : ATTR_CONST_NOTHROW_LEAF_LIST) /* Define an attribute list for math functions that are normally impure because some of them may write into global memory for --- 163,169 memory. */ #undef ATTR_MATHFN_FPROUNDING #define ATTR_MATHFN_FPROUNDING (flag_rounding_math ? \ ! ATTR_PURE_NOTHROW_LEAF_LIST : ATTR_CONST_NOTHROW_LEAF_LIST) /* Define an attribute list for math functions that are normally impure because some of them may write into global memory for Index: gcc/builtin-attrs.def === *** gcc/builtin-attrs.def (revision 191917) --- gcc/builtin-attrs.def (working copy) *** DEF_ATTR_TREE_LIST (ATTR_PURE_NOTHROW_LI *** 127,136 ATTR_NULL, ATTR_NOTHROW_LIST) DEF_ATTR_TREE_LIST (ATTR_PURE_NOTHROW_LEAF_LIST, ATTR_PURE, \ ATTR_NULL, ATTR_NOTHROW_LEAF_LIST) - DEF_ATTR_TREE_LIST (ATTR_PURE_NOTHROW_NOVOPS_LIST, ATTR_NOVOPS, \ - ATTR_NULL, ATTR_PURE_NOTHROW_LIST) - DEF_ATTR_TREE_LIST (ATTR_PURE_NOTHROW_NOVOPS_LEAF_LIST, ATTR_NOVOPS,\ - ATTR_NULL, ATTR_PURE_NOTHROW_LEAF_LIST) DEF_ATTR_TREE_LIST (ATTR_NORETURN_NOTHROW_LIST, ATTR_NORETURN,\ ATTR_NULL, ATTR_NOTHROW_LIST) DEF_ATTR_TREE_LIST (ATTR_NORETURN_NOTHROW_LEAF_LIST, ATTR_NORETURN,\ --- 127,132
Re: [PATCH] Add option for dumping to stderr (issue6190057)
On Mon, Oct 1, 2012 at 3:55 PM, Sharad Singhai sing...@google.com wrote: On Mon, Oct 1, 2012 at 6:52 AM, H.J. Lu hjl.to...@gmail.com wrote: On Mon, Oct 1, 2012 at 6:49 AM, Sharad Singhai sing...@google.com wrote: I am sorry, I didn't enable all the languages. Will fix the fortran test breakage shortly. It is not just Fortran. There are some failures in C testcases. I checked and those files looked like generator files for Fortran tests and thus were not exercised in my configuration. I am really sorry about that. I am fixing it. As I said, you should not enable/disable anything special but configure with all default languages enabled (no --enable-languages) and do toplevel make -k check, preferably also excercising multilibs with RUNTESTFLAGS=--target_board=unix/\{,-m32\} Richard. UNSUPPORTED: gcc.dg/tree-ssa/gen-vect-11.c UNSUPPORTED: gcc.dg/tree-ssa/gen-vect-11a.c UNSUPPORTED: gcc.dg/tree-ssa/gen-vect-11b.c UNSUPPORTED: gcc.dg/tree-ssa/gen-vect-11c.c UNSUPPORTED: gcc.dg/tree-ssa/gen-vect-2.c UNSUPPORTED: gcc.dg/tree-ssa/gen-vect-25.c UNSUPPORTED: gcc.dg/tree-ssa/gen-vect-26.c UNSUPPORTED: gcc.dg/tree-ssa/gen-vect-28.c UNSUPPORTED: gcc.dg/tree-ssa/gen-vect-32.c Thanks, Sharad Thanks, Sharad Sharad On Mon, Oct 1, 2012 at 4:50 AM, H.J. Lu hjl.to...@gmail.com wrote: On Sun, Sep 30, 2012 at 11:36 PM, Sharad Singhai sing...@google.com wrote: Resend to gcc-patches I have addressed the comments by fixing all the minor issues, bootstrapped and tested on x86_64. I did the recommended reshuffling by moving non-tree code from tree-dump.c into a new file dumpfile.c. I committed two successive revisions r191883 Main patch with the dump infrastructure changes. However, I accidentally left out a new file, dumpfile.c. r191884 Added dumpfile.c, and did the renaming of dump_* functions from gimple_pretty_print.[ch]. As things stand right now, r191883 is broken because of the missing file 'dumpfile.c', which the very next commit fixes. Anyone who got broken revision r191883, please svn update. I am really very sorry about that. I have a couple more minor patches which deal with renaming; I plan to address those later. It caused: FAIL: gcc.dg/tree-ssa/gen-vect-11.c (test for excess errors) FAIL: gcc.dg/tree-ssa/gen-vect-11.c scan-tree-dump-times vect vectorized 1 loops 1 FAIL: gcc.dg/tree-ssa/gen-vect-11a.c (test for excess errors) FAIL: gcc.dg/tree-ssa/gen-vect-11a.c scan-tree-dump-times vect vectorized 1 loops 1 FAIL: gcc.dg/tree-ssa/gen-vect-11b.c (test for excess errors) FAIL: gcc.dg/tree-ssa/gen-vect-11b.c scan-tree-dump-times vect vectorized 0 loops 1 FAIL: gcc.dg/tree-ssa/gen-vect-11c.c (test for excess errors) FAIL: gcc.dg/tree-ssa/gen-vect-11c.c scan-tree-dump-times vect vectorized 0 loops 1 FAIL: gcc.dg/tree-ssa/gen-vect-2.c (test for excess errors) FAIL: gcc.dg/tree-ssa/gen-vect-2.c scan-tree-dump-times vect vectorized 1 loops 1 FAIL: gcc.dg/tree-ssa/gen-vect-25.c (test for excess errors) FAIL: gcc.dg/tree-ssa/gen-vect-25.c scan-tree-dump-times vect vectorized 2 loops 1 FAIL: gcc.dg/tree-ssa/gen-vect-26.c (test for excess errors) FAIL: gcc.dg/tree-ssa/gen-vect-26.c scan-tree-dump-times vect Alignment of access forced using peeling 1 FAIL: gcc.dg/tree-ssa/gen-vect-26.c scan-tree-dump-times vect vectorized 1 loops 1 FAIL: gcc.dg/tree-ssa/gen-vect-28.c (test for excess errors) FAIL: gcc.dg/tree-ssa/gen-vect-28.c scan-tree-dump-times vect Alignment of access forced using peeling 1 FAIL: gcc.dg/tree-ssa/gen-vect-28.c scan-tree-dump-times vect vectorized 1 loops 1 FAIL: gcc.dg/tree-ssa/gen-vect-32.c (test for excess errors) FAIL: gcc.dg/tree-ssa/gen-vect-32.c scan-tree-dump-times vect vectorized 1 loops 1 FAIL: gfortran.dg/vect/O3-pr36119.f90 (test for excess errors) FAIL: gfortran.dg/vect/O3-pr39595.f (test for excess errors) FAIL: gfortran.dg/vect/Ofast-pr50414.f90 (test for excess errors) FAIL: gfortran.dg/vect/cost-model-pr34445.f (test for excess errors) FAIL: gfortran.dg/vect/cost-model-pr34445a.f (test for excess errors) FAIL: gfortran.dg/vect/fast-math-pr38968.f90 (test for excess errors) FAIL: gfortran.dg/vect/fast-math-pr38968.f90 scan-tree-dump vect vectorized 1 loops FAIL: gfortran.dg/vect/fast-math-real8-pr40801.f90 (test for excess errors) FAIL: gfortran.dg/vect/fast-math-real8-pr40801.f90 (test for excess errors) FAIL: gfortran.dg/vect/fast-math-vect-8.f90 (test for excess errors) FAIL: gfortran.dg/vect/fast-math-vect-8.f90 scan-tree-dump-times vect vectorized 1 loops 1 FAIL: gfortran.dg/vect/no-fre-no-copy-prop-O3-pr51704.f90 (test for excess errors) FAIL: gfortran.dg/vect/no-vfa-pr32377.f90 (test for excess errors) FAIL: gfortran.dg/vect/no-vfa-pr32377.f90 scan-tree-dump-times vect vectorized 2 loops 1 FAIL: gfortran.dg/vect/no-vfa-pr32457.f90 (test for excess errors) FAIL: gfortran.dg/vect/no-vfa-pr32457.f90 scan-tree-dump-times vect vectorized 0 loops 1 FAIL: gfortran.dg/vect/pr19049.f90
Re: RFC: LRA for x86/x86-64 [0/9]
On Sun, Sep 30, 2012 at 6:52 PM, Steven Bosscher stevenb@gmail.com wrote: Hi, To look at it in yet another way: integrated RA : 189.34 (16%) usr LRA non-specific: 59.82 ( 5%) usr LRA virtuals eliminatenon: 56.79 ( 5%) usr LRA create live ranges : 175.30 (15%) usr LRA hard reg assignment : 130.85 (11%) usr The IRA pass is slower than the next-slowest pass (tree PRA) by almost a factor 2.5. Each of the individually-measured *phases* of LRA is slower than the complete IRA *pass*. These 5 timevars together make up for 52% of all compile time. That figure indeed makes IRA + LRA look bad. Did you by chance identify anything obvious that can be done to improve the situation? Thanks, Richard. IRA already has scalability problems, let's not add more of that with LRA. Ciao! Steven
Re: RFC: LRA for x86/x86-64 [0/9]
On Fri, Sep 28, 2012 at 12:56 AM, Vladimir Makarov vmaka...@redhat.com wrote: Originally I was to submit LRA at the very beginning of stage1 for gcc4.9 as it was discussed on this summer GNU Tools Cauldron. After some thinking, I've decided to submit LRA now but only switched on for *x86/x86-64* target. The reasons for that are o I am already pretty confident in LRA for this target with the point of reliability, performance, code size, and compiler speed. o I am confident that I can fix LRA bugs and pitfalls which might be recognized and reported during stage2 and 3 of gcc4.8. o Wider LRA testing for x86/x86-64 will make smoother a hard transition of other targets to LRA during gcc4.9 development. During development of gcc4.9, I'd like to switch major targets to LRA as it was planned before. I hope that all targets will be switched for the next release after gcc4.9 (although it will be dependent mostly on the target maintainers). When/if it is done, reload and reload oriented machine-dependent code can be removed. LRA project was reported on 2012 GNU Tools Cauldron (http://gcc.gnu.org/wiki/cauldron2012). The presentation contains a high-level description of LRA and the project status. The following patches makes LRA working for x86/x86-64. Separately patches mostly do nothing until the last patch switches on LRA for x86/x86-64. Although compiler is bootstrapped after applying each patch in given order, the division is only for review convenience. Any comments and proposals are appreciated. Even if GCC community decides that it is too late to submit it to gcc4.8, the earlier reviews are always useful. From a release-manager point of view the patch is in time for 4.8, in that it is during stage1 (which I expect to last another two to four weeks). Note that there is no such thing as stage2 anymore but we go straight to stage3 (bugfixing mode, no new features) from stage1. After three months of stage3 we go into regression-fixes only mode for as long as there are release-blocking bugs (regressions with priority P1). You will have roughly half a year to fix LRA for 4.8.0 after stage1 closes. Thanks, Richard. The patches were successfully bootstrapped and tested for x86/x86-64.
Re: vec_cond_expr adjustments
On Fri, Sep 28, 2012 at 12:42 AM, Marc Glisse marc.gli...@inria.fr wrote: Hello, I have been experimenting with generating VEC_COND_EXPR from the front-end, and these are just a couple things I noticed. 1) optabs.c requires that the first argument of vec_cond_expr be a comparison, but verify_gimple_assign_ternary only checks is_gimple_condexpr, like for COND_EXPR. In the long term, it seems better to also allow ssa_name and vector_cst (thus match the is_gimple_condexpr condition), but for now I just want to know early if I created an invalid vec_cond_expr. optabs should be fixed instead, an is_gimple_val condition is implicitely val != 0. 2) a little refactoring of the code to find a suitable vector type for comparison results, and one more place where it should be used (no testcase yet because I don't know if that path can be taken without front-end changes first). Yes, it looks fine to me. I did wonder, for tree-ssa-forwprop, about using directly TREE_TYPE (cond) without truth_type_for. Yes, that should work. Hmm, now I am wondering whether I should have waited until I had front-end vec_cond_expr support to submit everything at once... ;) The tree.[ch] and gimple-fold.c hunks are ok if tested properly, the tree-ssa-forwprop.c idea of using TREE_TYPE (cond), too. I don't like the tree-cfg.c change, instead re-factor optabs.c to get a decomposed cond for vector_compare_rtx and appropriately decompose a non-comparison-class cond in expand_vec_cond_expr. If we for example have predicate = a b; x = predicate ? d : e; y = predicate ? f : g; we ideally want to re-use the predicate computation on targets where that would be optimal (and combine should be able to recover the case where it is not). Thanks, Richard. 2012-09-27 Marc Glisse marc.gli...@inria.fr * tree-cfg.c (verify_gimple_assign_ternary): Stricter check on first argument of VEC_COND_EXPR. * tree.c (truth_type_for): New function. * tree.h (truth_type_for): Declare. * gimple-fold.c (and_comparisons_1): Call it. (or_comparisons_1): Likewise. * tree-ssa-forwprop.c (forward_propagate_into_cond): Likewise. -- Marc Glisse Index: gcc/tree-ssa-forwprop.c === --- gcc/tree-ssa-forwprop.c (revision 191810) +++ gcc/tree-ssa-forwprop.c (working copy) @@ -549,21 +549,22 @@ static bool forward_propagate_into_cond (gimple_stmt_iterator *gsi_p) { gimple stmt = gsi_stmt (*gsi_p); tree tmp = NULL_TREE; tree cond = gimple_assign_rhs1 (stmt); bool swap = false; /* We can do tree combining on SSA_NAME and comparison expressions. */ if (COMPARISON_CLASS_P (cond)) tmp = forward_propagate_into_comparison_1 (stmt, TREE_CODE (cond), - boolean_type_node, + truth_type_for +(TREE_TYPE (cond)), TREE_OPERAND (cond, 0), TREE_OPERAND (cond, 1)); else if (TREE_CODE (cond) == SSA_NAME) { enum tree_code code; tree name = cond; gimple def_stmt = get_prop_source_stmt (name, true, NULL); if (!def_stmt || !can_propagate_from (def_stmt)) return 0; Index: gcc/tree-cfg.c === --- gcc/tree-cfg.c (revision 191810) +++ gcc/tree-cfg.c (working copy) @@ -3758,22 +3758,24 @@ verify_gimple_assign_ternary (gimple stm tree rhs2_type = TREE_TYPE (rhs2); tree rhs3 = gimple_assign_rhs3 (stmt); tree rhs3_type = TREE_TYPE (rhs3); if (!is_gimple_reg (lhs)) { error (non-register as LHS of ternary operation); return true; } - if (((rhs_code == VEC_COND_EXPR || rhs_code == COND_EXPR) - ? !is_gimple_condexpr (rhs1) : !is_gimple_val (rhs1)) + if (((rhs_code == COND_EXPR) ? !is_gimple_condexpr (rhs1) + : (rhs_code == VEC_COND_EXPR) ? (!is_gimple_condexpr (rhs1) + || is_gimple_val (rhs1)) + : !is_gimple_val (rhs1)) || !is_gimple_val (rhs2) || !is_gimple_val (rhs3)) { error (invalid operands in ternary operation); return true; } /* First handle operations that involve different types. */ switch (rhs_code) { Index: gcc/gimple-fold.c === --- gcc/gimple-fold.c (revision 191810) +++ gcc/gimple-fold.c (working copy) @@ -23,21 +23,20 @@ along with GCC; see the file COPYING3. #include coretypes.h #include tm.h #include tree.h #include flags.h #include function.h #include dumpfile.h #include tree-flow.h #include tree-ssa-propagate.h #include target.h #include gimple-fold.h -#include
[PATCH][LTO] properly stream PARM_DECLs
This avoids streaming PARM_DECLs both in the global type/decl and the local function sections. They are needed on the global level, so properly stream them there. Fixes part of the issues we have with debug information for early inlining, too. LTO Bootstrapped and tested on x86_64-unknown-linux-gnu, applied. Richard. 2012-09-28 Richard Guenther rguent...@suse.de PR lto/47799 * lto-streamer-out.c (tree_is_indexable): Make PARM_DECLs global. (lto_output_tree_ref): Handle references to them. (output_function): Do not output function arguments again. * lto-streamer-in.c (input_function): Do not input arguments again, nor overwrite them. Index: gcc/lto-streamer-out.c === *** gcc/lto-streamer-out.c.orig 2012-09-26 16:47:18.0 +0200 --- gcc/lto-streamer-out.c 2012-09-28 11:18:35.438055184 +0200 *** static bool *** 125,131 tree_is_indexable (tree t) { if (TREE_CODE (t) == PARM_DECL) ! return false; else if (TREE_CODE (t) == VAR_DECL decl_function_context (t) !TREE_STATIC (t)) return false; --- 125,131 tree_is_indexable (tree t) { if (TREE_CODE (t) == PARM_DECL) ! return true; else if (TREE_CODE (t) == VAR_DECL decl_function_context (t) !TREE_STATIC (t)) return false; *** lto_output_tree_ref (struct output_block *** 237,242 --- 237,243 case VAR_DECL: case DEBUG_EXPR_DECL: gcc_assert (decl_function_context (expr) == NULL || TREE_STATIC (expr)); + case PARM_DECL: streamer_write_record_start (ob, LTO_global_decl_ref); lto_output_var_decl_index (ob-decl_state, ob-main_stream, expr); break; *** output_function (struct cgraph_node *nod *** 806,814 output_struct_function_base (ob, fn); - /* Output the head of the arguments list. */ - stream_write_tree (ob, DECL_ARGUMENTS (function), true); - /* Output all the SSA names used in the function. */ output_ssa_names (ob, fn); --- 807,812 Index: gcc/lto-streamer-in.c === *** gcc/lto-streamer-in.c.orig 2012-09-21 10:59:45.0 +0200 --- gcc/lto-streamer-in.c 2012-09-28 11:14:53.835068419 +0200 *** input_function (tree fn_decl, struct dat *** 823,829 gimple *stmts; basic_block bb; struct cgraph_node *node; - tree args, narg, oarg; fn = DECL_STRUCT_FUNCTION (fn_decl); tag = streamer_read_record_start (ib); --- 823,828 *** input_function (tree fn_decl, struct dat *** 834,855 input_struct_function_base (fn, data_in, ib); - /* Read all function arguments. We need to re-map them here to the - arguments of the merged function declaration. */ - args = stream_read_tree (ib, data_in); - for (oarg = args, narg = DECL_ARGUMENTS (fn_decl); -oarg narg; -oarg = TREE_CHAIN (oarg), narg = TREE_CHAIN (narg)) - { - unsigned ix; - bool res; - res = streamer_tree_cache_lookup (data_in-reader_cache, oarg, ix); - gcc_assert (res); - /* Replace the argument in the streamer cache. */ - streamer_tree_cache_insert_at (data_in-reader_cache, narg, ix); - } - gcc_assert (!oarg !narg); - /* Read all the SSA names. */ input_ssa_names (ib, data_in, fn); --- 833,838