Re: DECL_STRUCT_FUNCTION(cgraphnode-decl) == NULL when attempting Link-Time Optimization in plugin

2012-10-09 Thread Richard Guenther
On Mon, Oct 8, 2012 at 11:29 PM, David Malcolm dmalc...@redhat.com wrote:
 On Mon, 2012-10-08 at 18:21 +0200, Richard Guenther wrote:
 On Mon, Oct 8, 2012 at 5:17 PM, David Malcolm dmalc...@redhat.com wrote:
  I'm working on a static analysis extension to GCC via my
  gcc-python-plugin [1]
 
  The analysis is interprocedural (memory leak detection, as it happens).
  I have it working on one translation unit at a time, and I'm attempting
  to get it to work within Link Time Optimization so that it can see the
  interactions of all of the functions within a program or library, but
  I'm running what might be a bug (or I could be misunderstanding LTO).
 
  I have my plugin working within lto1, using -flto and -fplugin; my
  command line looks like:
  $ gcc -fPIC -shared -flto -g
  -fplugin=/home/david/coding/gcc-python/gcc-python/sm/python.so
  -fplugin-arg-python-script=examples/show-lto-supergraph.py
  tests/sm/lto/input-*.c --save-temps
 
  lto1 appears to be invoked twice by gcc.  The first time, there are two
  passes:
   IPA_PASS named whole-program
   IPA_PASS named inline
  and in_lto_p is true.
 
  The second invocation has these passes, called per-function:
GIMPLE_PASS *free_cfg_annotations
GIMPLE_PASS cplxlower0
GIMPLE_PASS optimized
RTL_PASS expand
GIMPLE_PASS *rest_of_compilation
RTL_PASS *init_function
...etc.., normal rest of compilation from RTL stage onwards
  and in_lto_p is false.
 
  I've attempted to wire in my interprocedural analysis as a new IPA pass
  before or after whole-program and inline within the first invocation
  of lto1.  In each case it is able to walk the callgraph, and the
  callgraph it sees appears to be correct: walking the linked list of
  cgraph_nodes the various cgraph_edge and cgraph_node are as expected,
  and for each cgraph_node, cgraph_node-decl is a non-NULL tree instance
  of type FUNCTION_DECL (as expected); I'm able to generate a graphviz
  rendering of the whole-program callgraph within lto1 from my plugin.
 
  However, for every cgraph_node, the
  DECL_STRUCT_FUNCTION(cgraph_node-decl) is (struct function*)NULL, which
  thwarts my code's attempt to look up the CFG of each underlying function
  and work on the underlying gimple.
 
  Looking with eu-readelf at the .o files shows these lto sections are
  present (f is the name of one of the functions that I'd like to access
  the gimple CFG of):
  $ eu-readelf -a input-f.o|grep lto
  [ 5] .gnu.lto_.inline.c8904cb9a96e7417 PROGBITS  
  006c 0026  0 E  0   0  1
  [ 6] .gnu.lto_f.c8904cb9a96e7417 PROGBITS  0092 
  0164  0 E  0   0  1
  [ 7] .gnu.lto_.cgraph.c8904cb9a96e7417 PROGBITS  
  01f6 0032  0 E  0   0  1
  [ 8] .gnu.lto_.vars.c8904cb9a96e7417 PROGBITS  
  0228 0012  0 E  0   0  1
  [ 9] .gnu.lto_.refs.c8904cb9a96e7417 PROGBITS  
  023a 0013  0 E  0   0  1
  [10] .gnu.lto_.statics.c8904cb9a96e7417 PROGBITS  
  024d 0014  0 E  0   0  1
  [11] .gnu.lto_.decls.c8904cb9a96e7417 PROGBITS  
  0261 01f3  0 E  0   0  1
  [12] .gnu.lto_.symtab.c8904cb9a96e7417 PROGBITS  
  0454 0011  0 E  0   0  1
  [13] .gnu.lto_.opts   PROGBITS  0465 00e9  
  0 E  0   0  1
 25: 0001  1 OBJECT  GLOBAL DEFAULT   COMMON __gnu_lto_v1
[2a]  __gnu_lto_v1
 
  Are the (struct function*) and CFG meant to exist at this stage, and be
  in gimple form, and is DECL_STRUCT_FUNCTION meant to be non-NULL for
  functions in translation units that were compiled with -flto?  (or have
  I misunderstood LTO?)

 It depends if you are in the WPA stage (lto1 was invoked with -fwpa)
 in which case
 no function bodies and thus no CFG is available at all, or if you are in 
 LTRANS
 stage (lto1 was invoked with -fltrans) which see _part_ of the whole programs
 callgraph and function bodies (and thus CFGs).  As a workaround you probably
 can make your example work by using -flto-partition=none which merges WPA
 and LTRANS stages and pull in the wole program into a single link-time TU.

 Many thanks: -fwpa was indeed being passed by gcc to lto1, and on
 adding -flto-partition=none to the gcc command line -fwpa goes away,
 and my plugin is able to see all of the gimple CFG from all of the .o
 files (and thus generate pretty graphviz renderings of the supergraph of
 all CFGs etc).

 FWIW the -fwpa behavior was surprising to me, and felt like something of
 a gotcha.  Based on my reading of
 http://gcc.gnu.org/onlinedocs/gccint/LTO-Overview.html#LTO-Overview
 I had assumed that LTO mode was the default when setting -flto, and
 that WHOPR mode was something that I had to add an extra option to
 enable (and thus I merely skimmed those parts of the docs), whereas it
 seems to be the other way around.  In my defense

Re: patch to fix constant math

2012-10-09 Thread Richard Guenther
On Mon, Oct 8, 2012 at 9:55 PM, Richard Sandiford
rdsandif...@googlemail.com wrote:
 Robert Dewar de...@adacore.com writes:
 On 10/8/2012 11:01 AM, Nathan Froyd wrote:
 - Original Message -
 Btw, as for Richards idea of conditionally placing the length field
 in
 rtx_def looks like overkill to me.  These days we'd merely want to
 optimize for 64bit hosts, thus unconditionally adding a 32 bit
 field to rtx_def looks ok to me (you can wrap that inside a union to
 allow both descriptive names and eventual different use - see what
 I've done to tree_base)

 IMHO, unconditionally adding that field isn't optimize for 64-bit
 hosts, but gratuitously make one of the major compiler data
 structures bigger on 32-bit hosts.  Not everybody can cross-compile
 from a 64-bit host.  And even those people who can don't necessarily
 want to.  Please try to consider what's best for all the people who
 use GCC, not just the cases you happen to be working with every day.

 I think that's rasonable in general, but as time goes on, and every
 $300 laptop is 64-bit capable, one should not go TOO far out of the
 way trying to make sure we can compile everything on a 32-bit machine.

 It's not 64-bit machine vs. 32-bit machine.  It's an LP64 ABI vs.
 an ILP32 ABI.  HJ  co. have put considerable effort into developing
 the x32 ABI for x86_64 precisely because ILP32 is still useful for
 64-bit machines.  Just as it was for MIPS when SGI invented n32
 (which is still useful now).  I believe 64-bit SPARC has a similar
 thing, and no doubt other architectures do too.

 After all, there shouldn't be much need for more than 2GB of virtual
 address space in an AVR cross compiler.  So why pay the cache penalty
 of 64-bit pointers and longs (GCC generally tries to avoid using long
 directly) when a 32-bit pointer will do?

 Many years ago, I moved the HOST_WIDE_INT fields out of rtunion
 and into the main rtx_def union because it produced a significant
 speed-up on n32 IRIX.  That was before tree-level optimisation,

But of course that doesn't change the alignment requirement of
that union which is the issue on lp64 hosts.  Also as HOST_WIDE_INT
is 64bits for most ilp32 targets as well it doesn't necessarily save
ilp32 hosts from having this issue (unless long long isn't aligned
to 8 bytes for them).  So what I said is that arranging for the 32bit
hole to contain useful data for most RTXen should be possible.

Richard.

 but I don't think we've really pruned that much RTL optimisation
 since then, so I'd be surprised if much has changed.


Re: Small cleanup/memory leak plugs for lto

2012-10-09 Thread Richard Guenther
On Mon, Oct 8, 2012 at 10:05 PM, Tobias Burnus bur...@net-b.de wrote:
 Some more issues found by Coverity scanner.

 lto-cgraph.c: The code seems to be unused, besides, it's a zero-trip loop as
 parm_num is set to 0 and then checked non nonzeroness.

 lto-opts: The check whether first_p is non NULL is always false: All calls
 have a variable ref as argument - and first_p is unconditionally
 dereferenced.

 lto_obj_file_open: One could check additionally check lo is NULL, but that
 has then to be directly after the XCNEW as already lto_file_init
 dereferences lo.


 static void
 append_to_collect_gcc_options (struct obstack *ob,
   bool *first_p, const char *opt)
 {
   const char *p, *q = opt;
-  if (!first_p)
-obstack_grow (ob,  , 1);

bogus change.  It should be

  if (!*first_p)
obstack_grow (ob,  , 1);

The rest looks ok.  Please re-test with the above change.

Thanks,
Richard.

 Build and regtested on x86-64-gnu-linux

 Tobias


Re: [lra] 3rd patch to speed more compilation of PR54146

2012-10-09 Thread Richard Guenther
On Tue, Oct 9, 2012 at 6:10 PM, Vladimir Makarov vmaka...@redhat.com wrote:
 On 10/08/2012 05:14 PM, Steven Bosscher wrote:

 On Mon, Oct 8, 2012 at 10:26 PM, Vladimir Makarov vmaka...@redhat.com
 wrote:

 So I checked it on big file with  hundred functionson Intel machine and
 got

 before a part of the patch implementing the insn stack as sbitmap
 real=243.40 user=241.61 system=1.00

 after the part of the patch:
 real=244.89 user=243.02 system=0.96

 Is that more than just noise, you think? A ~1.5s difference on ~240
 total isn't very much. I measured the timings on my set of cc1-i
 files, and sometimes the without-patch compiler was faster by a tiny
 amount, and sometimes it was slower. Even on an average of 10 runs I
 really couldn't say that the patch was a win or loss on the whole.

 I measured this on a mostly idle machine at home, not gcc17, which
 seems to be even more busy than usual lately, thanks to you and me :-)


 Sorry, Steven.  It was a noise.  I ran it again now 3 times  and found that
 was a noise.

 After some thinking, I realized that sbitmap is really the best
 representation for this particular case.  That is because at the 1st
 constraint pass we have all bits are set as we process *all* insns on the
 1st pass.  So sbitmap (at least before the extension and if we have pretty
 compact UIDs) will be always smaller than bitmap.

 I committed the following your patch to the branch (as rev. 192264).

 And again, sorry for the false conclusions.

Btw, congratulations for all the speedups (even though they probably
are noise for regular programs?)!  I'm looking forward to the
merge of LRA for x86 now.

Richard.

 2012-10-09  Steven Bosscher  ste...@gcc.gnu.org


 * lra-int.h (lra_constraint_insn_stack_bitmap,
 lra_constraint_insn_stack): Remove.
 (lra_pop_insn, lra_insn_stack_length): New prototypes.
 * lra.c (lra_constraint_insn_stack_bitmap): Make static sbitmap.
 (lra_constraint_insn_stack): Make static.
 (lra_push_insn_1): New function.
 (lra_push_insn): Rewrite using lra_push_insn_1.
 (lra_push_insn_and_update_insn_regno_info): Likewise.
 (lra_pop_insn, lra_insn_stack_length): New functions.
 * lra_constraints.c (lra_constraints): Use new interface to
 insns stack instead of manipulating in-place.



Re: TODO_rebuild_alias and -O0

2012-10-08 Thread Richard Guenther
On Mon, Oct 8, 2012 at 11:26 AM, Ludovic Courtès l...@gnu.org wrote:
 Hello,

 Consider the attached plug-in, which adds a new dummy pass after the
 “ssa” pass, with ‘TODO_rebuild_alias’ as its start flags:



 When compiling with -O0 a non-trivial file with that plug-in, one ends
 up with:

   numbers.c:8728:10: internal compiler error: in make_decl_rtl, at 
 varasm.c:1163
   gcc: internal compiler error: Aborted (program cc1)

 and a backtrace like this:

 --8---cut here---start-8---
 #6  0x0056bfa4 in fancy_abort (file=optimized out, line=1163,
 function=0xc2a3f5 make_decl_rtl) at ../../gcc-4.6.2/gcc/diagnostic.c:893
 #7  0x00853bae in make_decl_rtl (decl=0x7f3cfd0458c0) at 
 ../../gcc-4.6.2/gcc/varasm.c:1159
 #8  0x005bd485 in expand_expr_real_1 (exp=0x7f3cfd0458c0, 
 target=optimized out,
 tmode=VOIDmode, modifier=EXPAND_WRITE, alt_rtl=optimized out)
 at ../../gcc-4.6.2/gcc/expr.c:8462
 #9  0x005c83a8 in expand_expr (modifier=EXPAND_WRITE, mode=VOIDmode, 
 target=0x0,
 exp=0x7f3cfd0458c0) at ../../gcc-4.6.2/gcc/expr.h:422
 #10 expand_assignment (nontemporal=0 '\000', from=0x7f3cfc967160, 
 to=0x7f3cfd0458c0)
 at ../../gcc-4.6.2/gcc/expr.c:4422
 #11 expand_assignment (to=0x7f3cfd0458c0, from=0x7f3cfc967160, nontemporal=0 
 '\000')
 at ../../gcc-4.6.2/gcc/expr.c:4119
 #12 0x005350c8 in expand_gimple_stmt_1 (stmt=0x7f3cfd046460)
 at ../../gcc-4.6.2/gcc/cfgexpand.c:1975
 #13 expand_gimple_stmt (stmt=0x7f3cfd046460) at 
 ../../gcc-4.6.2/gcc/cfgexpand.c:2084
 #14 0x00536218 in expand_gimple_basic_block (bb=0x7f3cfd047a90)
 at ../../gcc-4.6.2/gcc/cfgexpand.c:3626
 #15 0x00539bbe in gimple_expand_cfg () at 
 ../../gcc-4.6.2/gcc/cfgexpand.c:4109
 #16 0x0068b589 in execute_one_pass (pass=0x10a3dc0) at 
 ../../gcc-4.6.2/gcc/passes.c:1556
 #17 0x0068b835 in execute_pass_list (pass=0x10a3dc0) at 
 ../../gcc-4.6.2/gcc/passes.c:1611
 #18 0x00757751 in tree_rest_of_compilation (fndecl=0x7f3cfdb03900)
 at ../../gcc-4.6.2/gcc/tree-optimize.c:422
 #19 0x008978cf in cgraph_expand_function (node=0x7f3cfd2c8420)
 at ../../gcc-4.6.2/gcc/cgraphunit.c:1576
 #20 0x0089975e in cgraph_output_in_order () at 
 ../../gcc-4.6.2/gcc/cgraphunit.c:1733
 #21 cgraph_optimize () at ../../gcc-4.6.2/gcc/cgraphunit.c:1894
 #22 0x0089992a in cgraph_finalize_compilation_unit ()
 at ../../gcc-4.6.2/gcc/cgraphunit.c:1096
 #23 0x004a3df5 in c_write_global_declarations () at 
 ../../gcc-4.6.2/gcc/c-decl.c:9871
 --8---cut here---end---8---

 There’s no such problem when optimizations are turned on, though.

 Does it mean that extra work is needed at -O0 before one can use alias info?

At -O0 no virtual operands are produced.  TODO_rebuild_alias only computes
points-to sets which are in itself not useful.

What do you want to achieve with TODO_rebuild_alias?

Richard.

 Thanks,
 Ludo’.



Re: TODO_rebuild_alias and -O0

2012-10-08 Thread Richard Guenther
On Mon, Oct 8, 2012 at 11:58 AM, Ludovic Courtès l...@gnu.org wrote:
 Richard Guenther richard.guent...@gmail.com skribis:

 At -O0 no virtual operands are produced.  TODO_rebuild_alias only computes
 points-to sets which are in itself not useful.

 What do you want to achieve with TODO_rebuild_alias?

 I basically want to use ‘ptr_derefs_may_alias_p’ in this particular pass.

That should work.

 Thanks,
 Ludo’.


Re: TODO_rebuild_alias and -O0

2012-10-08 Thread Richard Guenther
On Mon, Oct 8, 2012 at 1:42 PM, Ludovic Courtès l...@gnu.org wrote:
 Richard Guenther richard.guent...@gmail.com skribis:

 On Mon, Oct 8, 2012 at 11:58 AM, Ludovic Courtès l...@gnu.org wrote:
 Richard Guenther richard.guent...@gmail.com skribis:

 At -O0 no virtual operands are produced.  TODO_rebuild_alias only computes
 points-to sets which are in itself not useful.

 What do you want to achieve with TODO_rebuild_alias?

 I basically want to use ‘ptr_derefs_may_alias_p’ in this particular pass.

 That should work.

 It actually does, except that ‘ptr_derefs_may_alias_p’ returns true for
 two SSA names in cases like this:

   double *p, *q;
   p = malloc (123);
   q = malloc (234);

 (Where ‘malloc’ has the ‘malloc’ attribute.)

 For some reason, there’s no such false positive when using TODO_rebuild_alias.

Well, that should work means it works non-conservatively with
TODO_rebuild_alias, of course.

Richard.

 Ludo’.


Re: TODO_rebuild_alias and -O0

2012-10-08 Thread Richard Guenther
On Mon, Oct 8, 2012 at 3:32 PM, Ludovic Courtès l...@gnu.org wrote:
 Richard Guenther richard.guent...@gmail.com skribis:

 On Mon, Oct 8, 2012 at 1:42 PM, Ludovic Courtès l...@gnu.org wrote:
 Richard Guenther richard.guent...@gmail.com skribis:

 On Mon, Oct 8, 2012 at 11:58 AM, Ludovic Courtès l...@gnu.org wrote:
 Richard Guenther richard.guent...@gmail.com skribis:

 At -O0 no virtual operands are produced.  TODO_rebuild_alias only 
 computes
 points-to sets which are in itself not useful.

 What do you want to achieve with TODO_rebuild_alias?

 I basically want to use ‘ptr_derefs_may_alias_p’ in this particular pass.

 That should work.

 It actually does, except that ‘ptr_derefs_may_alias_p’ returns true for
 two SSA names in cases like this:

   double *p, *q;
   p = malloc (123);
   q = malloc (234);

 (Where ‘malloc’ has the ‘malloc’ attribute.)

 For some reason, there’s no such false positive when using 
 TODO_rebuild_alias.

 Well, that should work means it works non-conservatively with
 TODO_rebuild_alias, of course.

 Right.  :-)  So how can I get maximum accuracy, while avoid the assertion
 failure on -O0?

You have to analyze the assert more.  From the backtrace I cannot see
anything suspicious.

 Thanks,
 Ludo’.


Re: TODO_rebuild_alias and -O0

2012-10-08 Thread Richard Guenther
On Mon, Oct 8, 2012 at 6:04 PM, Ludovic Courtès
ludovic.cour...@inria.fr wrote:
 Richard Guenther richard.guent...@gmail.com skribis:

 On Mon, Oct 8, 2012 at 3:32 PM, Ludovic Courtès l...@gnu.org wrote:
 Richard Guenther richard.guent...@gmail.com skribis:

 On Mon, Oct 8, 2012 at 1:42 PM, Ludovic Courtès l...@gnu.org wrote:
 Richard Guenther richard.guent...@gmail.com skribis:

 On Mon, Oct 8, 2012 at 11:58 AM, Ludovic Courtès l...@gnu.org wrote:
 Richard Guenther richard.guent...@gmail.com skribis:

 At -O0 no virtual operands are produced.  TODO_rebuild_alias only 
 computes
 points-to sets which are in itself not useful.

 What do you want to achieve with TODO_rebuild_alias?

 I basically want to use ‘ptr_derefs_may_alias_p’ in this particular 
 pass.

 That should work.

 It actually does, except that ‘ptr_derefs_may_alias_p’ returns true for
 two SSA names in cases like this:

   double *p, *q;
   p = malloc (123);
   q = malloc (234);

 (Where ‘malloc’ has the ‘malloc’ attribute.)

 For some reason, there’s no such false positive when using 
 TODO_rebuild_alias.

 Well, that should work means it works non-conservatively with
 TODO_rebuild_alias, of course.

 Right.  :-)  So how can I get maximum accuracy, while avoid the assertion
 failure on -O0?

 You have to analyze the assert more.  From the backtrace I cannot see
 anything suspicious.

 The assert in ‘make_decl_rtl’ is:

   gcc_assert (TREE_CODE (decl) != VAR_DECL
   || TREE_STATIC (decl)
   || TREE_PUBLIC (decl)
   || DECL_EXTERNAL (decl)
   || DECL_REGISTER (decl));

 Upon inspection, ‘decl’ is VAR_DECL, but has all 4 flags above cleared.

 In my example source file, the assertion is hit here:

   numbers.c: In function ‘scm_c_make_polar’:
   numbers.c:8728:10: internal compiler error: in make_decl_rtl, at 
 varasm.c:1163

 That line in numbers.c reads this:

   sincos (ang, s, c);

 And indeed, the dump of my pass (with -ftree-dump-all) reads this:

   ;; Function scm_c_make_polar (scm_c_make_polar)

   Now a gimple register: D.26884
   No longer having address taken: s
   No longer having address taken: c


   Symbols to be put in SSA form

   { s c D.26884 }


   Incremental SSA update started at block: 0

   Number of blocks in CFG: 8
   Number of blocks to update: 7 ( 88%)

 So it looks like there’s additional state corresponding to these
 variables that needs updating?

Ah, no, but I suppose TODO_update_address_taken doesn't really work at -O0
so you need to fix that to do nothing at -O0.  Because s and c _do_ have their
address taken.

Richard.

 Thanks,
 Ludo’.


Re: DECL_STRUCT_FUNCTION(cgraphnode-decl) == NULL when attempting Link-Time Optimization in plugin

2012-10-08 Thread Richard Guenther
On Mon, Oct 8, 2012 at 5:17 PM, David Malcolm dmalc...@redhat.com wrote:
 I'm working on a static analysis extension to GCC via my
 gcc-python-plugin [1]

 The analysis is interprocedural (memory leak detection, as it happens).
 I have it working on one translation unit at a time, and I'm attempting
 to get it to work within Link Time Optimization so that it can see the
 interactions of all of the functions within a program or library, but
 I'm running what might be a bug (or I could be misunderstanding LTO).

 I have my plugin working within lto1, using -flto and -fplugin; my
 command line looks like:
 $ gcc -fPIC -shared -flto -g
 -fplugin=/home/david/coding/gcc-python/gcc-python/sm/python.so
 -fplugin-arg-python-script=examples/show-lto-supergraph.py
 tests/sm/lto/input-*.c --save-temps

 lto1 appears to be invoked twice by gcc.  The first time, there are two
 passes:
  IPA_PASS named whole-program
  IPA_PASS named inline
 and in_lto_p is true.

 The second invocation has these passes, called per-function:
   GIMPLE_PASS *free_cfg_annotations
   GIMPLE_PASS cplxlower0
   GIMPLE_PASS optimized
   RTL_PASS expand
   GIMPLE_PASS *rest_of_compilation
   RTL_PASS *init_function
   ...etc.., normal rest of compilation from RTL stage onwards
 and in_lto_p is false.

 I've attempted to wire in my interprocedural analysis as a new IPA pass
 before or after whole-program and inline within the first invocation
 of lto1.  In each case it is able to walk the callgraph, and the
 callgraph it sees appears to be correct: walking the linked list of
 cgraph_nodes the various cgraph_edge and cgraph_node are as expected,
 and for each cgraph_node, cgraph_node-decl is a non-NULL tree instance
 of type FUNCTION_DECL (as expected); I'm able to generate a graphviz
 rendering of the whole-program callgraph within lto1 from my plugin.

 However, for every cgraph_node, the
 DECL_STRUCT_FUNCTION(cgraph_node-decl) is (struct function*)NULL, which
 thwarts my code's attempt to look up the CFG of each underlying function
 and work on the underlying gimple.

 Looking with eu-readelf at the .o files shows these lto sections are
 present (f is the name of one of the functions that I'd like to access
 the gimple CFG of):
 $ eu-readelf -a input-f.o|grep lto
 [ 5] .gnu.lto_.inline.c8904cb9a96e7417 PROGBITS  006c 
 0026  0 E  0   0  1
 [ 6] .gnu.lto_f.c8904cb9a96e7417 PROGBITS  0092 
 0164  0 E  0   0  1
 [ 7] .gnu.lto_.cgraph.c8904cb9a96e7417 PROGBITS  01f6 
 0032  0 E  0   0  1
 [ 8] .gnu.lto_.vars.c8904cb9a96e7417 PROGBITS  0228 
 0012  0 E  0   0  1
 [ 9] .gnu.lto_.refs.c8904cb9a96e7417 PROGBITS  023a 
 0013  0 E  0   0  1
 [10] .gnu.lto_.statics.c8904cb9a96e7417 PROGBITS  
 024d 0014  0 E  0   0  1
 [11] .gnu.lto_.decls.c8904cb9a96e7417 PROGBITS  0261 
 01f3  0 E  0   0  1
 [12] .gnu.lto_.symtab.c8904cb9a96e7417 PROGBITS  0454 
 0011  0 E  0   0  1
 [13] .gnu.lto_.opts   PROGBITS  0465 00e9  0 
 E  0   0  1
25: 0001  1 OBJECT  GLOBAL DEFAULT   COMMON __gnu_lto_v1
   [2a]  __gnu_lto_v1

 Are the (struct function*) and CFG meant to exist at this stage, and be
 in gimple form, and is DECL_STRUCT_FUNCTION meant to be non-NULL for
 functions in translation units that were compiled with -flto?  (or have
 I misunderstood LTO?)

It depends if you are in the WPA stage (lto1 was invoked with -fwpa)
in which case
no function bodies and thus no CFG is available at all, or if you are in LTRANS
stage (lto1 was invoked with -fltrans) which see _part_ of the whole programs
callgraph and function bodies (and thus CFGs).  As a workaround you probably
can make your example work by using -flto-partition=none which merges WPA
and LTRANS stages and pull in the wole program into a single link-time TU.

Richard.

 This is gcc 4.7 on Fedora 17, specifically gcc-4.7.0-5.fc17.x86_64

 Thanks
 Dave
 [1] https://fedorahosted.org/gcc-python-plugin/



Re: Fixup INTEGER_CST

2012-10-08 Thread Richard Guenther
On Sun, Oct 7, 2012 at 7:22 PM, Jan Hubicka hubi...@ucw.cz wrote:
 On Sun, Oct 7, 2012 at 5:15 PM, Jan Hubicka hubi...@ucw.cz wrote:
  Hi,
  I added a santy check that after fixup all types that lost in the merging 
  are
  really dead.  And it turns out we have some zombies around.
 
  INTEGER_CST needs special care because it is special cased by the 
  streamer.  We also
  do not want to do inplace modificaitons on it because that would corrupt 
  the hashtable
  used by tree.c's sharing code
 
  Bootstrapped/regtested x86_64-linux, OK?

 No, I don't think we want to fixup INTEGER_CSTs this way.  Instead we
 want to fixup
 them where they end up used unfixed.

 Erm, I think it is what the patch does?

Ah, indeed.

 It replaces pointers to integer_cst with type that did not survive by pointer
 to new integer_cst. (with the optimization that INTEGER_CST with overflow
 is changed in place because it is allowed to do so).

Btw ...

  @@ -1526,6 +1549,11 @@ lto_ft_type (tree t)
 LTO_FIXUP_TREE (t-type_non_common.binfo);
 
 LTO_FIXUP_TREE (TYPE_CONTEXT (t));
  +
  +  if (TREE_CODE (t) == METHOD_TYPE)
  +TYPE_METHOD_BASETYPE (t);
  +  if (TREE_CODE (t) == OFFSET_TYPE)
  +TYPE_OFFSET_BASETYPE (t);

that looks like a no-op to me ... (both are TYPE_MAXVAL which
is already fixed up).

Thus, ok with removing the above hunk.

Thanks,
Richard.

   }
 
   /* Fix up fields of a BINFO T.  */


Re: handle isl and cloog in contrib/download_prerequisites

2012-10-08 Thread Richard Guenther
On Mon, Oct 8, 2012 at 3:16 AM, Jonathan Wakely jwakely@gmail.com wrote:
 On 7 October 2012 21:31, Manuel López-Ibáñez wrote:
 On 7 October 2012 22:13, Jonathan Wakely jwakely@gmail.com wrote:

 On Oct 7, 2012 12:00 AM, NightStrike nightstr...@gmail.com wrote:

 On Sat, Oct 6, 2012 at 7:30 AM, Manuel López-Ibáñez
 lopeziba...@gmail.com wrote:
  Hi,
 
  GCC now requires ISL and a very new CLOOG but download_prerequisites
  does not download those. Also, there is only one sensible place to

 As of what version is isl/cloog no longer optional?

 If they're really no longer optional then the prerequisites page and 4.8
 changes page need to be updated.

 The patch downloads isl and cloog unconditionally, does gcc build them
 unconditionally if they're found in the source dir?  If they are still
 optional I don't want download_prerequisites to fetch files that will slow
 down building gcc by building libs and enabling features I don't use.

 I guess they are optional in the sense that you can configure gcc to
 not require them. But the default configure in x86_64-gnu-linux
 requires isl and cloog.

 Are you sure?

 Seems to me the default is still the same as it always has been, i.e.
 Graphite optimisations can be enabled if ISL and cloog are present,
 but they're not required.  I can bootstrap without ISL anyway.

If good enough ISL and cloog are not found graphite is simply disabled
unless you explicitely enabled it via specifying either of ISL or cloog
configury.

Richard.


Re: [lra] patch to speed more compilation of PR54146

2012-10-08 Thread Richard Guenther
On Sun, Oct 7, 2012 at 11:27 PM, Steven Bosscher stevenb@gmail.com wrote:
 On Sun, Oct 7, 2012 at 5:59 PM, Vladimir Makarov wrote:
 The following patch speeds LRA up more on PR54146.  Below times for
 compilation of the test on gcc17.fsffrance.org (an AMD machine):

 Before:
 real=1214.71 user=1192.05 system=22.48
 After:
 real=1144.37 user=1124.31 system=20.11

 Hi Vlad,

 The next bottle-neck in my timings is in
 lra-eliminate.c:lra_eliminate(), in this loop:

FOR_EACH_BB (bb)
  FOR_BB_INSNS_SAFE (bb, insn, temp)
{
if (bitmap_bit_p (insns_with_changed_offsets, INSN_UID (insn)))
   process_insn_for_elimination (insn, final_p);
}

 The problem is in bitmap_bit_p. Random access to a large bitmap can be
 very slow.

 I'm playing with a patch to expand the insns_with_changed_offsets
 bitmap to an sbitmap, and will send a patch if this works better.

Or make insns_with_changed_offsets a VEC of insns (or a pointer-set).

Richard.

 Ciao!
 Steven


Re: [ping patch] Predict for loop exits in short-circuit conditions

2012-10-08 Thread Richard Guenther
On Mon, Oct 8, 2012 at 4:50 AM, Dehao Chen de...@google.com wrote:
 Attached is the updated patch. Yes, if we add a VRP pass before
 profile pass, this patch would be unnecessary. Should we add a VRP
 pass?

No, we don't want VRP in early optimizations.

Richard.

 Thanks,
 Dehao

 On Sat, Oct 6, 2012 at 9:38 AM, Jan Hubicka hubi...@ucw.cz wrote:
 ping^2

 Honza, do you think this patch can make into 4.8 stage 1?

 +  if (check_value_one ^ integer_onep (val))

 Probably better as !=
 (especially because GNU coding standard allows predicates to return more than
 just boolean)


 +{
 +  edge e1;
 +  edge_iterator ei;
 +  tree val = gimple_phi_arg_def (phi_stmt, i);
 +  edge e = gimple_phi_arg_edge (phi_stmt, i);
 +
 +  if (!TREE_CONSTANT (val) || !(integer_zerop (val) || integer_onep 
 (val)))
 +   continue;
 +  if (check_value_one ^ integer_onep (val))
 +   continue;
 +  if (VEC_length (edge, e-src-succs) != 1)
 +   {
 + if (!predicted_by_p (exit_edge-src, PRED_LOOP_ITERATIONS_GUESSED)
 +  !predicted_by_p (exit_edge-src, PRED_LOOP_ITERATIONS)
 +  !predicted_by_p (exit_edge-src, PRED_LOOP_EXIT))
 +   predict_edge_def (e, PRED_LOOP_EXIT, NOT_TAKEN);
 + continue;
 +   }
 +
 +  FOR_EACH_EDGE (e1, ei, e-src-preds)
 +   if (!predicted_by_p (exit_edge-src, PRED_LOOP_ITERATIONS_GUESSED)
 +!predicted_by_p (exit_edge-src, PRED_LOOP_ITERATIONS)
 +!predicted_by_p (exit_edge-src, PRED_LOOP_EXIT))
 + predict_edge_def (e1, PRED_LOOP_EXIT, NOT_TAKEN);

 Here you found an edge that you know is going to terminate the loop
 and you want to predict all paths to this edge as unlikely.
 Perhaps you want to use predict paths leading_to_edge for edge?

 You do not need to check PRED_LOOP_ITERATIONS and 
 PRED_LOOP_ITERATIONS_GUESSED
 because those never go to the non-exit edges.

 The nature of predict_paths_for_bb type heuristic is that they are not really
 additive: if the path leads to two different aborts it does not make it more
 sure that it will be unlikely.  So perhaps you can add !predicted_by_p (e, 
 pred)
 prior predict_edge_def call in the function?

 I wonder if we did VRP just before branch predction to jump thread the 
 shortcut
 condtions into loopback edges, would be there still cases where this
 optimization will match?

 Honza


Re: [PATCH] Fix PR54489 - FRE needing AVAIL_OUT

2012-10-08 Thread Richard Guenther
On Fri, 5 Oct 2012, Steven Bosscher wrote:

 On Fri, Sep 14, 2012 at 2:26 PM, Richard Guenther rguent...@suse.de wrote:
  If you can figure out a better name for the function we should
  probably move it to cfganal.c
 
 It looks like my previous e-mail about this appears to have gone got
 somehow, so retry:
 
 Your my_rev_post_order_compute is simply inverted_post_order_compute.
 The only difference is that you'll want to ignore EXIT_BLOCK, which is
 always added to the list by inverted_post_order_compute.

Indeed.  inverted_post_order_compute seems to handle a CFG without
infinite-loop and noreturns connected to exit though.  Possibly
that's why it doesn't care for not handling entry/exit.

I'm testing a patch to use inverted_post_order_compute from PRE.

Richard.


Re: vec_cond_expr adjustments

2012-10-08 Thread Richard Guenther
On Fri, Oct 5, 2012 at 5:01 PM, Marc Glisse marc.gli...@inria.fr wrote:
 [I am still a little confused, sorry for the long email...]


 On Tue, 2 Oct 2012, Richard Guenther wrote:

 +  if (TREE_CODE (op0) == VECTOR_CST  TREE_CODE (op1) == VECTOR_CST)
 +{
 +  int count = VECTOR_CST_NELTS (op0);
 +  tree *elts =  XALLOCAVEC (tree, count);
 +  gcc_assert (TREE_CODE (type) == VECTOR_TYPE);
 +
 +  for (int i = 0; i  count; i++)
 +   {
 + tree elem_type = TREE_TYPE (type);
 + tree elem0 = VECTOR_CST_ELT (op0, i);
 + tree elem1 = VECTOR_CST_ELT (op1, i);
 +
 + elts[i] = fold_relational_const (code, elem_type,
 +  elem0, elem1);
 +
 + if(elts[i] == NULL_TREE)
 +   return NULL_TREE;
 +
 + elts[i] = fold_negate_const (elts[i], elem_type);



 I think you need to invent something new similar to STORE_FLAG_VALUE
 or use STORE_FLAG_VALUE here.  With the above you try to map
 {0, 1} to {0, -1} which is only true if the operation on the element
 types
 returns {0, 1} (thus, STORE_FLAG_VALUE is 1).


 Er, seems to me that constant folding of a scalar comparison in the
 front/middle-end only returns {0, 1}.

 [and later]

 I'd say adjust your fold-const patch to not negate the scalar result
 but build a proper -1 / 0 value based on integer_zerop().


 I don't mind doing it that way, but I would like to understand first.
 LT_EXPR on scalars is guaranteed (in generic.texi) to be 0 or 1. So negating
 should be the same as testing with integer_zerop to build -1 or 0. Is it
 just a matter of style (then I am ok), or am I missing a reason which makes
 the negation wrong?

Just a matter of style.  Negating is a lot less descriptive for the actual
set of return values we produce.

 The point is we need to define some semantics for vector comparison
 results.


 Yes. I think a documentation patch should come first: generic.texi is
 missing an entry for VEC_COND_EXPR and the entry for LT_EXPR doesn't mention
 vectors. But before that we need to decide what to put there...


 One variant is to make it target independent which in turn
 would inhibit (or make it more difficult) to exploit some target features.
 You for example use {0, -1} for truth values - probably to exploit target
 features -


 Actually it was mostly because that is the meaning in the language. OpenCL
 says that ab is a vector of 0 and -1, and that ?: only looks at the MSB of
 the elements in the condition. The fact that it matches what some targets do
 is a simple consequence of the fact that OpenCL was based on what hardware
 already did.

Yes, it seems that the {0, -1} choice is most reasonable for GENERIC.  So
let's document that.


 even though the most natural middle-end way would be to
 use {0, 1} as for everything else


 I agree that it would be natural and convenient in a number of places.


 (caveat: there may be both signed and unsigned bools, we don't allow
 vector components with non-mode precision, thus you could argue that a
 signed bool : 1 is just sign-extended for your solution).


 Not sure how that would translate in the code.


 A different variant is to make it target dependent to leverage
 optimization opportunities


 That's an interesting possibility...


 that's why STORE_FLAG_VALUE exists.


 AFAICS it only appears when we go from gimple to rtl, not before (and there
 is already a VECTOR_STORE_FLAG_VALUE, although no target defines it). Which
 doesn't mean we couldn't make it appear earlier for vectors.


 For example with vector comparisons a  v result, when
 performing bitwise operations on it, you either have to make the target
 expand code to produce {0, -1} even if the natural compare instruction
 would, say, produce {0, 0x8} - or not constrain the possible values
 of its result (like forwprop would do with your patch).  In general we
 want constant folding to yield the same results as if the HW carried
 out the operation to make -O0 code not diverge from -O1.  Thus,

 v4si g;
 int main() { g = { 1, 2, 3, 4 }  { 4, 3, 2, 1}; }

 should not assign different values to g dependent on constant propagation
 performed or not.


 That one is clear, OpenCL constrains the answer to be {-1,-1,0,0}, whether
 your target likes it or not. Depending on how things are handled,
 comparisons could be constrained internally to only appear (possibly
 indirectly) in the first argument of a vec_cond_expr.

Yes, I realized that later.


 The easiest way out is something like STORE_FLAG_VALUE
 if there does not exist a middle-end choice for vector true / false
 components
 that can be easily generated from what the target produces.

 Like if you perform a FP comparison

 int main () { double x = 1.0; static _Bool b; b = x  3.0; }

 you get without CCP on x86_64:

ucomisd -8(%rbp), %xmm0
seta%al
movb%al, b.1715(%rip)

 thus the equivalent of

flag_reg = x  3.0;
b = flag_reg ? 1 : 0;


 where

Re: patch to fix constant math - second small patch

2012-10-08 Thread Richard Guenther
On Sat, Oct 6, 2012 at 12:48 AM, Kenneth Zadeck
zad...@naturalbridge.com wrote:
 This patch adds machinery to genmodes.c so that largest possible sizes of
 various data structures can be determined at gcc build time.  These
 functions create 3 symbols that are available in insn-modes.h:
 MAX_BITSIZE_MODE_INT - the bitsize of the largest int.
 MAX_BITSIZE_MODE_PARTIAL_INT - the bitsize of the largest partial int.
 MAX_BITSIZE_MODE_ANY_INT - the largest bitsize of any kind of int.

Ok.  Please document these macros in rtl.texi.

Richard.


Re: patch to fix constant math - third small patch

2012-10-08 Thread Richard Guenther
On Sat, Oct 6, 2012 at 5:55 PM, Kenneth Zadeck zad...@naturalbridge.com wrote:
 This is the third patch in the series of patches to fix constant math.
 this one changes some predicates at the rtl level to use the new predicate
 CONST_SCALAR_INT_P.
 I did not include a few that were tightly intertwined with other changes.

 Not all of these changes are strictly mechanical.   Richard, when reviewing
 this had me make additional changes to remove what he thought were latent
 bugs at the rtl level.   However, it appears that the bugs were not latent.
 I do not know what is going on here but i am smart enough to not look a gift
 horse in the mouth.

 All of this was done on the same machine with no changes and identical
 configs.  It is an x86-64 with ubuntu 12-4.

 ok for commit?

Patch missing, but if it's just mechanical changes and introduction
of CONST_SCALAR_INT_P consider it pre-approved.

Richard.

 in the logs below, gbBaseline is a trunk from friday and the gbWide is the
 same revision but with my patches.  Some of this like gfortran.dg/pr32627 is
 obviously flutter, but the rest does not appear to be.

 =
 heracles:~/gcc(13) gccBaseline/contrib/compare_tests
 gbBaseline/gcc/testsuite/gcc/gcc.log gbWide/gcc/testsuite/gcc/gcc.log
 New tests that PASS:

 gcc.dg/builtins-85.c scan-assembler mysnprintf
 gcc.dg/builtins-85.c scan-assembler-not __chk_fail
 gcc.dg/builtins-85.c (test for excess errors)


 heracles:~/gcc(14) gccBaseline/contrib/compare_tests
 gbBaseline/gcc/testsuite/gfortran/gfortran.log
 gbWide/gcc/testsuite/gfortran/gfortran.log
 New tests that PASS:

 gfortran.dg/pr32627.f03  -O3 -fomit-frame-pointer -funroll-loops (test for
 excess errors)
 gfortran.dg/pr32627.f03  -O3 -fomit-frame-pointer  (test for excess errors)
 gfortran.dg/pr32627.f03  -Os  (test for excess errors)
 gfortran.dg/pr32635.f  -O0  execution test
 gfortran.dg/pr32635.f  -O0  (test for excess errors)
 gfortran.dg/substr_6.f90  -O2  (test for excess errors)

 Old tests that passed, that have disappeared: (Eeek!)

 gfortran.dg/pr32627.f03  -O1  (test for excess errors)
 gfortran.dg/pr32627.f03  -O3 -fomit-frame-pointer -funroll-all-loops
 -finline-functions  (test for excess errors)
 gfortran.dg/pr32627.f03  -O3 -g  (test for excess errors)
 gfortran.dg/substring_equivalence.f90  -O  (test for excess errors)
 Using /home/zadeck/gcc/gccBaseline/gcc/testsuite/config/default.exp as
 tool-and-target-specific interface file.

 === g++ Summary ===

 # of expected passes49793
 # of expected failures284
 # of unsupported tests601

 runtest completed at Fri Oct  5 16:10:20 2012
 heracles:~/gcc(16) tail gbWide/gcc/testsuite/g++/g++.log Using
 /usr/share/dejagnu/config/unix.exp as generic interface file for target.
 Using /home/zadeck/gcc/gccWide/gcc/testsuite/config/default.exp as
 tool-and-target-specific interface file.

 === g++ Summary ===

 # of expected passes50472
 # of expected failures284
 # of unsupported tests613

 runtest completed at Fri Oct  5 19:51:50 2012







Re: Check that unlinked uses do not contain ssa-names when renaming.

2012-10-08 Thread Richard Guenther
On Sun, Oct 7, 2012 at 12:44 PM, Tom de Vries tom_devr...@mentor.com wrote:
 Richard,

 attached patch checks that unlinked uses do not contain ssa-names when 
 renaming.

 This assert triggers when compiling (without the fix) the PR54735 example.

 AFAIU, it was due to chance that we caught the PR54735 bug by hitting the
 verification failure, because the new vdef introduced by renaming happened to 
 be
 the same name as the ssa name referenced in the invalid unlinked use (in terms
 of maybe_replace_use: rdef == use).

 The assert from this patch catches all cases that an unlinked use contains an
 ssa-name.

 Bootstrapped and reg-tested on x86_64 (Ada inclusive).

 OK for trunk?

I don't think that is exactly what we should assert here ... (I thought about
adding checking myself ...).  What we'd want to assert is that before
any new DEF is registered (which may re-allocate an SSA name) that
no uses with SSA_NAME_IN_FREELIST appear.  Thus, a light verification
pass would be necessary at the beginning of update_ssa
(which I queued onto my TODO list ...).  We'd want that anyway to for
example catch the case where a non-virtual operand is partially renamed.

Thanks,
Richard.

 Thanks,
 - Tom

 2012-10-07  Tom de Vries  t...@codesourcery.com

 * tree-into-ssa.c (maybe_replace_use): Add assert.


Re: patch to fix constant math

2012-10-08 Thread Richard Guenther
On Sun, Oct 7, 2012 at 4:58 PM, Kenneth Zadeck zad...@naturalbridge.com wrote:

 On 10/07/2012 09:19 AM, Richard Guenther wrote:

 In fact, you could argue that the tree level did it wrong (not that i am
 suggesting to change this).   But it makes me think what was going on
  when
 the decision to make TYPE_PRECISION be an INT_CST rather than just a HWI
  was
 made.   For that there is an implication that it could never take more
  than
 a HWI since no place in the code even checks TREE_INT_CST_HIGH for
  these.

 Well - on the tree level we now always have two HWIs for all INTEGER_CSTs.
 If
 we can, based on the size of the underlying mode, reduce that to one
 HWI we already
 win something.  If we add an explicit length to allow a smaller
 encoding for larger modes
 (tree_base conveniently has an available 'int' for this ...) then we'd
 win in more cases.
 Thus, is CONST_INT really necessarily better than optimized CONST_WIDE_INT
 storage?

 i have to admit, that looking at these data structures gives me a headache.
 This all looks like something that Rube Goldberg would have invented had he
 done object oriented design  (Richard S did not know who Rube Goldberg when
 i mentioned this name to him a few years ago since this is an American
 thing, but the british had their own equivalent and I assume the germans do
 too.).

 i did the first cut of changing the rtl level structure and Richard S threw
 up on it and suggested what is there now, which happily (for me) i was able
 to get mike to implement.

 mike also did the tree level version of the data structures for me.   i will
 make sure he used that left over length field.

 The bottom line is that you most likely just save the length, but that is a
 big percent of something this small.  Both of rtl ints have a mode, so if we
 can make that change later, it will be space neutral.

Yes.

Btw, as for Richards idea of conditionally placing the length field in
rtx_def looks like overkill to me.  These days we'd merely want to
optimize for 64bit hosts, thus unconditionally adding a 32 bit
field to rtx_def looks ok to me (you can wrap that inside a union to
allow both descriptive names and eventual different use - see what
I've done to tree_base)

Richard.


Re: Fixup INTEGER_CST

2012-10-08 Thread Richard Guenther
On Mon, Oct 8, 2012 at 11:18 AM, Jan Hubicka hubi...@ucw.cz wrote:
 On Sun, Oct 7, 2012 at 7:22 PM, Jan Hubicka hubi...@ucw.cz wrote:
  On Sun, Oct 7, 2012 at 5:15 PM, Jan Hubicka hubi...@ucw.cz wrote:
   Hi,
   I added a santy check that after fixup all types that lost in the 
   merging are
   really dead.  And it turns out we have some zombies around.
  
   INTEGER_CST needs special care because it is special cased by the 
   streamer.  We also
   do not want to do inplace modificaitons on it because that would 
   corrupt the hashtable
   used by tree.c's sharing code
  
   Bootstrapped/regtested x86_64-linux, OK?
 
  No, I don't think we want to fixup INTEGER_CSTs this way.  Instead we
  want to fixup
  them where they end up used unfixed.
 
  Erm, I think it is what the patch does?

 Ah, indeed.

  It replaces pointers to integer_cst with type that did not survive by 
  pointer
  to new integer_cst. (with the optimization that INTEGER_CST with overflow
  is changed in place because it is allowed to do so).

 Btw ...

   @@ -1526,6 +1549,11 @@ lto_ft_type (tree t)
  LTO_FIXUP_TREE (t-type_non_common.binfo);
  
  LTO_FIXUP_TREE (TYPE_CONTEXT (t));
   +
   +  if (TREE_CODE (t) == METHOD_TYPE)
   +TYPE_METHOD_BASETYPE (t);
   +  if (TREE_CODE (t) == OFFSET_TYPE)
   +TYPE_OFFSET_BASETYPE (t);

 that looks like a no-op to me ... (both are TYPE_MAXVAL which
 is already fixed up).

 Ah, indeed.  They were result of experimenting with the stale pointers to the
 obsoletted types and field decls.  I now understand where they come from.  The
 reason is twofold.

   1) after merging records we replace field decls in the cache
  by new ones.  This however does not mean that they die, because
  the existing pointers to them are not replaced.
  I have WIP patch for that that however require one extra pass
  over the list of all trees.

Yes, I think this is also why we do

  /* ???  Not sure the above is all relevant in this
 path canonicalizing TYPE_FIELDS to that of the
 main variant.  */
  if (ix  i)
lto_fixup_types (f2);
  streamer_tree_cache_insert_at (cache, f1, ix);

something I dislike as well and something we should try to address in a
more formal way.

   2) As we query the type_hash while we are rewritting the types,
  we run into instability of the hashtable. This manifests itself
  as an ICE when one adds sanity check that while merging function
  types their arg types are equivalent, too.
  This ICEs compiling i.e. sqlite but I did not really managed to
  reduce this.  I tracked it down to the argument type being inserted
  into gimple_type_hash but at the time we query the new argument type,
  the original is no longer found despite their hashes are equivalent.
  The problem is hidden when things fit into the leader cache,
  so one needs rather big testcase.

Ugh.  For reduction you can disable those caches though.  The above
means there is a disconnect between hashing and comparing.
Maybe it's something weird with the early out

  if (TYPE_ARG_TYPES (t1) == TYPE_ARG_TYPES (t2))
goto same_types;
?

 So I tried to register all gimple types first.  Use TREE_VISITED within
 the merging code to mark that type is not a leader and then TREE_CHAIN
 to point to the leader.  This avoids need to re-query the hashtable
 from the later fixups.  We only look for types with TREEE_VISITED
 and replace them by TREE_CHAIN.

TREE_CHAIN is unused for types?  But we probably shouldn't add a new
use ...

 This has two passes.  First we compute the main variants and mark
 field_decls and type_decls for merging and in last pass we finally do
 fixup on what remained in the table.

 This allows me to poison pointers in the removed types in a way
 so the GGC would ICE if they stayed reachable.
 I however need the extra pass because
  1) I can not update the type_decls/field_decls while registering
 types or I run into the hash table problems
  2) I can not merge the second two passes because at the time
 I find type/field decls equialent there may be earlier pointers
 to them.

You need to merge all trees reachable from the one you start at once
(what I'm working on from time to time - work per tree SCC, in a DFS
walk).

Richard.

 Honza


Re: vec_cond_expr adjustments

2012-10-08 Thread Richard Guenther
On Mon, Oct 8, 2012 at 11:34 AM, Marc Glisse marc.gli...@inria.fr wrote:
 On Mon, 8 Oct 2012, Richard Guenther wrote:

 VEC_COND_EXPR is more complicated. We could for instance require that it
 takes as first argument a vector of -1 and 0 (thus 0, !=0 and the neon
 thing are equivalent). Which would leave to decide what the expansion of
 vec_cond_expr passes to the targets when the first argument is not a
 comparison, between !=0, 0, ==-1 or others (I vote for 0 because of
 opencl). One issue is that targets wouldn't know if it was a dummy
 comparison that can safely be ignored because the other part is the
 result
 of logical operations on comparisons (thus composed of -1 and 0) or a
 genuine comparison with an arbitrary vector, so a new optimization would
 be
 needed (in the back-end I guess or we would need an alternate instruction
 to
 vcond) to detect if a vector is a signed boolean vector.
 We could instead say that vec_cond_expr really follows OpenCL's semantics
 and looks at the MSB of each element. I am not sure that would change
 much,
 it would mostly delay the apparition of 0 to RTL expansion time (and
 thus
 make gimple slightly lighter).


 I think we should delay the decision on how to optimize this.  It's indeed
 not trivial and the GIMPLE middle-end aggressively forwards feeding
 comparisons into the VEC_COND_EXPR expressions already (somewhat
 defeating any CSE that might be possible here) in forwprop.


 Thanks for going through the long email :-)

 What does that imply for the first argument of VEC_COND_EXPR? Currently, the
 expander asserts that it is a comparison, but that is not reflected in the
 gimple checkers.

And I don't think we should reflect that in the gimple checkers rather fixup the
expander (transparently use p != 0 or p  0).

 If we document that VEC_COND_EXPR takes a vector of -1 and 0 (which is the
 case for a comparison), I don't think it prevents from later relaxing that
 to 0 or !=0. But then I don't know how to handle expansion when the
 argument is neither a comparison (vcond) nor a constant (vec_merge? I
 haven't tried but that should be doable), I would have to pass 0 or !=0 to
 the target.

Yes.

 So is the best choice to document that VEC_COND_EXPR takes as
 first argument a comparison and make gimple checking reflect that? (seems
 sad, but at least that would tell me what I can/can't do)

No, that would just mean that in GIMPLE you'd add this p != 0 or p  0.
And at some point in the future I really really want to push this embedded
expression to a separate statement so I have a SSA definition for it.

 By the way, since we are documenting comparisons as returning 0 and -1, does
 that bring back the integer_truep predicate?

Not sure, true would still be != 0 or all_onesp (all bits of the
precision are 1), no?

Richard.

 --
 Marc Glisse


Re: [RFC] Make vectorizer to skip loops with small iteration estimate

2012-10-08 Thread Richard Guenther
On Sat, Oct 6, 2012 at 11:34 AM, Jan Hubicka hubi...@ucw.cz wrote:
 Hi,
 I benchmarked the patch moving loop header copying and it is quite noticeable 
 win.

 Some testsuite updating is needed. In many cases it is just because the
 optimizations are now happening earlier.
 There are however few testusite failures I have torubles to deal with:
 ./testsuite/gcc/gcc.sum:FAIL: gcc.dg/tree-ssa/pr21559.c scan-tree-dump-times 
 vrp1 Threaded jump 3
 ./testsuite/gcc/gcc.sum:FAIL: gcc.dg/tree-ssa/ssa-dom-thread-2.c 
 scan-tree-dump-times vrp1 Jumps threaded: 1 1
 ./testsuite/gcc/gcc.sum:FAIL: gcc.dg/vect/O3-slp-reduc-10.c 
 scan-tree-dump-times vect vectorized 1 loops 2
 ./testsuite/g++/g++.sum:FAIL: g++.dg/tree-ssa/pr18178.C -std=gnu++98  
 scan-tree-dump-times vrp1 if  1
 ./testsuite/g++/g++.sum:FAIL: g++.dg/tree-ssa/pr18178.C -std=gnu++11  
 scan-tree-dump-times vrp1 if  1

 This is mostly about VRP losing its ability to thread some jumps from the
 duplicated loop header out of the loop across the loopback edge.  This seems 
 to
 be due to loop updating logic.  Do we care about these?

Yes, I think so.  At least we care that the optimized result is the same.

Can you elaborate on due to loop updating logic?

Can you elaborate on the def_split_header_continue_p change?  Which probably
should be tested and installed separately?

Thanks,
Richard.

 Honza

 Index: tree-ssa-threadupdate.c
 ===
 *** tree-ssa-threadupdate.c (revision 192123)
 --- tree-ssa-threadupdate.c (working copy)
 *** static bool
 *** 846,854 
   def_split_header_continue_p (const_basic_block bb, const void *data)
   {
 const_basic_block new_header = (const_basic_block) data;
 !   return (bb != new_header
 !  (loop_depth (bb-loop_father)
 ! = loop_depth (new_header-loop_father)));
   }

   /* Thread jumps through the header of LOOP.  Returns true if cfg changes.
 --- 846,860 
   def_split_header_continue_p (const_basic_block bb, const void *data)
   {
 const_basic_block new_header = (const_basic_block) data;
 !   const struct loop *l;
 !
 !   if (bb == new_header
 !   || loop_depth (bb-loop_father)  loop_depth 
 (new_header-loop_father))
 ! return false;
 !   for (l = bb-loop_father; l; l = loop_outer (l))
 ! if (l == new_header-loop_father)
 !   return true;
 !   return false;
   }

   /* Thread jumps through the header of LOOP.  Returns true if cfg changes.
 Index: testsuite/gcc.dg/unroll_2.c
 ===
 *** testsuite/gcc.dg/unroll_2.c (revision 192123)
 --- testsuite/gcc.dg/unroll_2.c (working copy)
 ***
 *** 1,5 
   /* { dg-do compile  { target i?86-*-linux* x86_64-*-linux* } } */
 ! /* { dg-options -O2 -fdump-rtl-loop2_unroll -fno-peel-loops 
 -fdisable-tree-cunroll=foo -fdisable-tree-cunrolli=foo 
 -fenable-rtl-loop2_unroll } */

   unsigned a[100], b[100];
   inline void bar()
 --- 1,5 
   /* { dg-do compile  { target i?86-*-linux* x86_64-*-linux* } } */
 ! /* { dg-options -O2 -fdump-rtl-loop2_unroll -fno-peel-loops 
 -fdisable-tree-cunroll=foo -fdisable-tree-cunrolli=foo 
 -fenable-rtl-loop2_unroll -fno-tree-dominator-opts } */

   unsigned a[100], b[100];
   inline void bar()
 Index: testsuite/gcc.dg/unroll_3.c
 ===
 *** testsuite/gcc.dg/unroll_3.c (revision 192123)
 --- testsuite/gcc.dg/unroll_3.c (working copy)
 ***
 *** 1,5 
   /* { dg-do compile  { target i?86-*-linux* x86_64-*-linux* } } */
 ! /* { dg-options -O2 -fdump-rtl-loop2_unroll -fno-peel-loops 
 -fdisable-tree-cunroll -fdisable-tree-cunrolli -fenable-rtl-loop2_unroll=foo 
 } */

   unsigned a[100], b[100];
   inline void bar()
 --- 1,5 
   /* { dg-do compile  { target i?86-*-linux* x86_64-*-linux* } } */
 ! /* { dg-options -O2 -fdump-rtl-loop2_unroll -fno-peel-loops 
 -fdisable-tree-cunroll -fdisable-tree-cunrolli -fenable-rtl-loop2_unroll=foo 
 -fno-tree-dominator-opts } */

   unsigned a[100], b[100];
   inline void bar()
 Index: testsuite/gcc.dg/torture/pr23821.c
 ===
 *** testsuite/gcc.dg/torture/pr23821.c  (revision 192123)
 --- testsuite/gcc.dg/torture/pr23821.c  (working copy)
 ***
 *** 1,9 
   /* { dg-do compile } */
   /* { dg-skip-if  { *-*-* } { -O0 -fno-fat-lto-objects } {  } } */
 ! /* At -O1 DOM threads a jump in a non-optimal way which leads to
  the bogus propagation.  */
 ! /* { dg-skip-if  { *-*-* } { -O1 } {  } } */
 ! /* { dg-options -fdump-tree-ivcanon-details } */

   int a[199];

 --- 1,8 
   /* { dg-do compile } */
   /* { dg-skip-if  { *-*-* } { -O0 -fno-fat-lto-objects } {  } } */
 ! /* DOM threads a jump in a non-optimal way which leads to
  the bogus propagation.  */
 ! /* { dg-options -fdump-tree-ivcanon-details -fno-tree-dominator-opts } */

   int a[199];

 

Re: [ping patch] Predict for loop exits in short-circuit conditions

2012-10-08 Thread Richard Guenther
On Mon, Oct 8, 2012 at 12:01 PM, Jan Hubicka hubi...@ucw.cz wrote:
 On Mon, Oct 8, 2012 at 11:04 AM, Jan Hubicka hubi...@ucw.cz wrote:
  On Mon, Oct 8, 2012 at 4:50 AM, Dehao Chen de...@google.com wrote:
   Attached is the updated patch. Yes, if we add a VRP pass before
   profile pass, this patch would be unnecessary. Should we add a VRP
   pass?
 
  No, we don't want VRP in early optimizations.
 
  I am not quite sure about that.  VRP
   1) makes branch prediction work better by doing jump threading early

 Well ... but jump threading may need basic-block duplication which may
 increase code size.  Also VRP and FRE have pass ordering issues.

   2) is, after FRE, most effective tree pass on removing code by my profile
  statistics.

 We also don't have DSE in early opts.  I don't want to end up with the
 situation that we do everything in early opts ... we should do _less_ there
 (but eventually iterate properly when processing cycles).

 Yep, i am not quite sure about most sane variant.  Missed simple jump 
 threading
 in early opts definitely confuse both profile estimate and inline size
 estimates.  But I am also not thrilled by adding more passes to early opts at
 all.  Also last time I looked into this, CCP missed a lot of CCP oppurtunities
 making VRP to artifically look like more useful.

Eh .. that shouldn't happen.  Do you have testcases by any chance?
I used to duplicate each SSA propagator pass and checked -fdump-statistics-stats
for that the 2nd pass does nothing (thus chaining CCP doesn't improve results).
But maybe that's not the issue you run into here?

 Have patch that bit improves profile updating after jump threading (i.e.
 re-does the profile for simple cases), but still jump threading is the most
 common case for profile become inconsistent after expand.

 On a related note, with -fprofile-report I can easilly track how much of code
 each pass in the queue removed.  I was thinking about running this on Mozilla
 and -O1 and removing those passes that did almost nothing.  Those are mostly
 re-run passes, both at Gimple and RTL level. Our passmanager is not terribly
 friendly for controlling pass per-repetition.

Sure.  You can also more thorougly instrument passes and use
-fdump-statistics for that (I've done that), but we usually have testcases
that require that each pass that still is there is present ...

 With introduction of -Og pass queue, do you think introducing -O1 pass queue
 for late tree passes (that will be quite short) is sane?

Yes.  I don't like the dump-file naming mess that results though, but if
we want to support optimized attribute switching between -O1 and -O2
then I guess we have to live with that ...

Originally I wanted to base -Og on -O1 (thus have them mostly share the
pass queue) and retain the same pass queue for -O2 and -Os.  Maybe
that's what we eventually want to do.  Thus, add a (off for -Og) loop
optimizer sub-pass to the queue and schedule some scalar cleanups
after it but inside it.

 What about RTL
 level?  I guess we can split the queues for RTL optimizations, too.
 All optimizations passes prior register allocation are sort of optional
 and I guess there are also -Og candidates.

Yes.  Though I first wanted to see actual issues with the RTL optimizers
and -Og.

 I hoever find the 3 times duplicated queues bit uncool, too, but I guess
 it is most compatible with PM organization.

Indeed ;)  We should at least try to share the queues for -Og and -O1.

 At -O3 the most effective passes on combine.c
 are:

 cfg (because of cfg cleanup) -1.5474%
 Early inlning -0.4991%
 FRE -7.9369%
 VRP -0.9321% (if run early), ccp does -0.2273%

I think VRP has the advantage of taking loop iteration counts into account.
Maybe we can add sth similar to CCP.  It's sad that VRP is too expensive,
it really is a form of CCP so merging both passes would be best (we can
at a single point, add_equivalence, turn off equivalence processing - the most
expensive part of VRP, and call that CCP ...).

 tailr -0.5305%

 After IPA
 copyrename -2.2850% (it packs cleanups after inlining)
 forwprop -0.5432%
 VRP -0.9700% (if rerun after early passes, otherwise it is about 2%)
 PRE -2.4123%
 DOM -0.5182%

 RTL passes
 into_cfglayout -3.1400% (i.e. first cleanup_cfg)
 fwprop1 -3.0467%
 cprop -2.7786%
 combine -3.3346%
 IRA -3.4912% (i.e. the cost model preffers hard regs)
 bbro -0.9765%

 The numbers on tramp3d and LTO cc1 binary and not that different.

Yes.

Richard.

 Honza


Re: [Patch] Fix PR53397

2012-10-08 Thread Richard Guenther
On Mon, Oct 8, 2012 at 12:01 PM, Kumar, Venkataramanan
venkataramanan.ku...@amd.com wrote:
 Hi Richard,

 I have incorporated your comments.

 Yes, call dump_mem_ref then, instead of repeating parts of its body.

 Reference object  is not yet created at the place we check for invariance. It 
 is still a tree expression.  I created a common function and used at all 
 places to dump the step, base and delta values of  memory reference 
 being analyzed.

 Please find the modified patch attached.

 GCC regression make check -k passes with x86_64-unknown-linux-gnu.

I presume also bootstrapped.

Ok.

Thanks,
Richard.

 Regards,
 Venkat.

 -Original Message-
 From: Richard Guenther [mailto:richard.guent...@gmail.com]
 Sent: Thursday, October 04, 2012 6:26 PM
 To: Kumar, Venkataramanan
 Cc: Richard Guenther; gcc-patches@gcc.gnu.org
 Subject: Re: [Patch] Fix PR53397

 On Tue, Oct 2, 2012 at 6:40 PM, Kumar, Venkataramanan 
 venkataramanan.ku...@amd.com wrote:
 Hi Richi,

 (Snip)
 + (!cst_and_fits_in_hwi (step))
 +{
 +  if( loop-inner != NULL)
 +{
 +  if (dump_file  (dump_flags  TDF_DETAILS))
 +{
 +  fprintf (dump_file, Reference %p:\n, (void *) ref);
 +  fprintf (dump_file, (base  );
 +  print_generic_expr (dump_file, base, TDF_SLIM);
 +  fprintf (dump_file, , step );
 +  print_generic_expr (dump_file, step, TDF_TREE);
 +  fprintf (dump_file, )\n);

 No need to repeat this - all references are dumped when we gather them.
 (Snip)

 The dumping happens at record_ref which is called after these statements 
 to record these references.

 When the step is invariant  we return from the function without recording 
 the references.

  so I thought of dumping the references here.

 Is there a cleaner way to dump the references at one place?

 Yes, call dump_mem_ref then, instead of repeating parts of its body.

 Richard.

 Regards,
 Venkat.



 -Original Message-
 From: Richard Guenther [mailto:rguent...@suse.de]
 Sent: Tuesday, October 02, 2012 5:42 PM
 To: Kumar, Venkataramanan
 Cc: gcc-patches@gcc.gnu.org
 Subject: Re: [Patch] Fix PR53397

 On Mon, 1 Oct 2012, venkataramanan.ku...@amd.com wrote:

 Hi,

 The below patch fixes the FFT/Scimark regression caused by useless
 prefetch generation.

 This fix tries to make prefetch less aggressive by prefetching arrays
 in the inner loop, when the step is invariant in the entire loop nest.

 GCC currently tries to prefetch invariant steps when they are in the
 inner loop. But does not check if the step is variant in outer loops.

 In the scimark FFT case, the trip count of the inner loop varies by a
 non constant step, which is invariant in the inner loop.
 But the step variable is varying in outer loop. This makes inner loop
 trip count small (at run time varies sometimes as small as 1
 iteration)

 Prefetching ahead x iteration when the inner loop trip count is
 smaller than x leads to useless prefetches.

 Flag used: -O3 -march=amdfam10

 Before
 **  **
 ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark **
 ** for details. (Results can be submitted to p...@nist.gov) **
 **  **
 Using   2.00 seconds min time per kenel.
 Composite Score:  550.50
 FFT Mflops:38.66(N=1024)
 SOR Mflops:   617.61(100 x 100)
 MonteCarlo: Mflops:   173.74
 Sparse matmult  Mflops:   675.63(N=1000, nz=5000)
 LU  Mflops:  1246.88(M=100, N=100)


 After
 **  **
 ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark **
 ** for details. (Results can be submitted to p...@nist.gov) **
 **  **
 Using   2.00 seconds min time per kenel.
 Composite Score:  639.20
 FFT Mflops:   479.19(N=1024)
 SOR Mflops:   617.61(100 x 100)
 MonteCarlo: Mflops:   173.18
 Sparse matmult  Mflops:   679.13(N=1000, nz=5000)
 LU  Mflops:  1246.88(M=100, N=100)

 GCC regression make check -k passes with x86_64-unknown-linux-gnu
 New tests that PASS:

 gcc.dg/pr53397-1.c scan-assembler prefetcht0 gcc.dg/pr53397-1.c
 scan-tree-dump aprefetch Issued prefetch
 gcc.dg/pr53397-1.c (test for excess errors) gcc.dg/pr53397-2.c
 scan-tree-dump aprefetch loop variant step
 gcc.dg/pr53397-2.c scan-tree-dump aprefetch Not prefetching
 gcc.dg/pr53397-2.c (test for excess errors)


 Checked CPU2006 and polyhedron on latest AMD processor, no regressions 
 noted.

 Ok to commit in trunk?

 regards,
 Venkat

 gcc/ChangeLog
 +2012-10-01  Venkataramanan Kumar  venkataramanan.ku...@amd.com
 +
 +   * tree-ssa-loop-prefetch.c (gather_memory_references_ref):$
 +   Perform non constant step prefetching in inner loop, only

[PATCH] Remove my_rev_post_order_compute

2012-10-08 Thread Richard Guenther

This replaces my_rev_post_order_compute in PRE by the already
existing inverted_post_order_compute, with the necessary adjustments.

Bootstrapped and tested on x86_64-unknown-linux-gnu, applied.

Richard.

2012-10-08  Richard Guenther  rguent...@suse.de

* tree-ssa-pre.c (postorder_num): New global.
(compute_antic): Initialize all blocks and adjust for
generic postorder.
(my_rev_post_order_compute): Remove.
(init_pre): Use inverted_post_order_compute.

Index: gcc/tree-ssa-pre.c
===
--- gcc/tree-ssa-pre.c  (revision 192119)
+++ gcc/tree-ssa-pre.c  (working copy)
@@ -430,6 +430,7 @@ typedef struct bb_bitmap_sets
 
 /* Basic block list in postorder.  */
 static int *postorder;
+static int postorder_num;
 
 /* This structure is used to keep track of statistics on what
optimization PRE was able to perform.  */
@@ -2456,7 +2457,7 @@ compute_antic (void)
   has_abnormal_preds = sbitmap_alloc (last_basic_block);
   sbitmap_zero (has_abnormal_preds);
 
-  FOR_EACH_BB (block)
+  FOR_ALL_BB (block)
 {
   edge_iterator ei;
   edge e;
@@ -2480,9 +2481,7 @@ compute_antic (void)
 }
 
   /* At the exit block we anticipate nothing.  */
-  ANTIC_IN (EXIT_BLOCK_PTR) = bitmap_set_new ();
   BB_VISITED (EXIT_BLOCK_PTR) = 1;
-  PA_IN (EXIT_BLOCK_PTR) = bitmap_set_new ();
 
   changed_blocks = sbitmap_alloc (last_basic_block + 1);
   sbitmap_ones (changed_blocks);
@@ -2496,7 +2495,7 @@ compute_antic (void)
 for PA ANTIC computation.  */
   num_iterations++;
   changed = false;
-  for (i = n_basic_blocks - NUM_FIXED_BLOCKS - 1; i = 0; i--)
+  for (i = postorder_num - 1; i = 0; i--)
{
  if (TEST_BIT (changed_blocks, postorder[i]))
{
@@ -2525,7 +2524,7 @@ compute_antic (void)
fprintf (dump_file, Starting iteration %d\n, num_iterations);
  num_iterations++;
  changed = false;
- for (i = n_basic_blocks - NUM_FIXED_BLOCKS - 1 ; i = 0; i--)
+ for (i = postorder_num - 1 ; i = 0; i--)
{
  if (TEST_BIT (changed_blocks, postorder[i]))
{
@@ -4593,78 +4592,6 @@ remove_dead_inserted_code (void)
   BITMAP_FREE (worklist);
 }
 
-/* Compute a reverse post-order in *POST_ORDER.  If INCLUDE_ENTRY_EXIT is
-   true, then then ENTRY_BLOCK and EXIT_BLOCK are included.  Returns
-   the number of visited blocks.  */
-
-static int
-my_rev_post_order_compute (int *post_order, bool include_entry_exit)
-{
-  edge_iterator *stack;
-  int sp;
-  int post_order_num = 0;
-  sbitmap visited;
-
-  if (include_entry_exit)
-post_order[post_order_num++] = EXIT_BLOCK;
-
-  /* Allocate stack for back-tracking up CFG.  */
-  stack = XNEWVEC (edge_iterator, n_basic_blocks + 1);
-  sp = 0;
-
-  /* Allocate bitmap to track nodes that have been visited.  */
-  visited = sbitmap_alloc (last_basic_block);
-
-  /* None of the nodes in the CFG have been visited yet.  */
-  sbitmap_zero (visited);
-
-  /* Push the last edge on to the stack.  */
-  stack[sp++] = ei_start (EXIT_BLOCK_PTR-preds);
-
-  while (sp)
-{
-  edge_iterator ei;
-  basic_block src;
-  basic_block dest;
-
-  /* Look at the edge on the top of the stack.  */
-  ei = stack[sp - 1];
-  src = ei_edge (ei)-src;
-  dest = ei_edge (ei)-dest;
-
-  /* Check if the edge source has been visited yet.  */
-  if (src != ENTRY_BLOCK_PTR  ! TEST_BIT (visited, src-index))
-{
-  /* Mark that we have visited the destination.  */
-  SET_BIT (visited, src-index);
-
-  if (EDGE_COUNT (src-preds)  0)
-/* Since the SRC node has been visited for the first
-   time, check its predecessors.  */
-stack[sp++] = ei_start (src-preds);
-  else
-post_order[post_order_num++] = src-index;
-}
-  else
-{
-  if (ei_one_before_end_p (ei)  dest != EXIT_BLOCK_PTR)
-post_order[post_order_num++] = dest-index;
-
-  if (!ei_one_before_end_p (ei))
-ei_next (stack[sp - 1]);
-  else
-sp--;
-}
-}
-
-  if (include_entry_exit)
-post_order[post_order_num++] = ENTRY_BLOCK;
-
-  free (stack);
-  sbitmap_free (visited);
-  return post_order_num;
-}
-
 
 /* Initialize data structures used by PRE.  */
 
@@ -4686,9 +4613,8 @@ init_pre (void)
   connect_infinite_loops_to_exit ();
   memset (pre_stats, 0, sizeof (pre_stats));
 
-
-  postorder = XNEWVEC (int, n_basic_blocks - NUM_FIXED_BLOCKS);
-  my_rev_post_order_compute (postorder, false);
+  postorder = XNEWVEC (int, n_basic_blocks);
+  postorder_num = inverted_post_order_compute (postorder);
 
   alloc_aux_for_blocks (sizeof (struct bb_bitmap_sets));
 


[PATCH] Fix PR54825

2012-10-08 Thread Richard Guenther

This fixes PR54825, properly FRE/PRE vector BIT_FIELD_REFs.

Bootstrapped and tested on x86_64-unknown-linux-gnu, applied.

Richard.

2012-10-08  Richard Guenther  rguent...@suse.de

PR tree-optimization/54825
* tree-ssa-sccvn.c (vn_nary_length_from_stmt): Handle BIT_FIELD_REF.
(init_vn_nary_op_from_stmt): Likewise.
* tree-ssa-pre.c (compute_avail): Use vn_nary_op_lookup_stmt.
* tree-ssa-sccvn.h (sizeof_vn_nary_op): Avoid overflow.

Index: gcc/tree-ssa-sccvn.c
===
*** gcc/tree-ssa-sccvn.c(revision 192120)
--- gcc/tree-ssa-sccvn.c(working copy)
*** vn_nary_length_from_stmt (gimple stmt)
*** 2194,2199 
--- 2194,2202 
  case VIEW_CONVERT_EXPR:
return 1;
  
+ case BIT_FIELD_REF:
+   return 3;
+ 
  case CONSTRUCTOR:
return CONSTRUCTOR_NELTS (gimple_assign_rhs1 (stmt));
  
*** init_vn_nary_op_from_stmt (vn_nary_op_t
*** 2220,2225 
--- 2223,2235 
vno-op[0] = TREE_OPERAND (gimple_assign_rhs1 (stmt), 0);
break;
  
+ case BIT_FIELD_REF:
+   vno-length = 3;
+   vno-op[0] = TREE_OPERAND (gimple_assign_rhs1 (stmt), 0);
+   vno-op[1] = TREE_OPERAND (gimple_assign_rhs1 (stmt), 1);
+   vno-op[2] = TREE_OPERAND (gimple_assign_rhs1 (stmt), 2);
+   break;
+ 
  case CONSTRUCTOR:
vno-length = CONSTRUCTOR_NELTS (gimple_assign_rhs1 (stmt));
for (i = 0; i  vno-length; ++i)
*** init_vn_nary_op_from_stmt (vn_nary_op_t
*** 2227,2232 
--- 2237,2243 
break;
  
  default:
+   gcc_checking_assert (!gimple_assign_single_p (stmt));
vno-length = gimple_num_ops (stmt) - 1;
for (i = 0; i  vno-length; ++i)
vno-op[i] = gimple_op (stmt, i + 1);
Index: gcc/tree-ssa-pre.c
===
*** gcc/tree-ssa-pre.c  (revision 192120)
--- gcc/tree-ssa-pre.c  (working copy)
*** compute_avail (void)
*** 3850,3860 
  || code == VEC_COND_EXPR)
continue;
  
! vn_nary_op_lookup_pieces (gimple_num_ops (stmt) - 1,
!   code,
!   gimple_expr_type (stmt),
!   gimple_assign_rhs1_ptr (stmt),
!   nary);
  if (!nary)
continue;
  
--- 3850,3856 
  || code == VEC_COND_EXPR)
continue;
  
! vn_nary_op_lookup_stmt (stmt, nary);
  if (!nary)
continue;
  
Index: gcc/tree-ssa-sccvn.h
===
*** gcc/tree-ssa-sccvn.h(revision 192120)
--- gcc/tree-ssa-sccvn.h(working copy)
*** typedef const struct vn_nary_op_s *const
*** 51,57 
  static inline size_t
  sizeof_vn_nary_op (unsigned int length)
  {
!   return sizeof (struct vn_nary_op_s) + sizeof (tree) * (length - 1);
  }
  
  /* Phi nodes in the hashtable consist of their non-VN_TOP phi
--- 51,57 
  static inline size_t
  sizeof_vn_nary_op (unsigned int length)
  {
!   return sizeof (struct vn_nary_op_s) + sizeof (tree) * length - sizeof 
(tree);
  }
  
  /* Phi nodes in the hashtable consist of their non-VN_TOP phi


Re: patch to fix constant math - third small patch

2012-10-08 Thread Richard Guenther
On Mon, Oct 8, 2012 at 1:36 PM, Kenneth Zadeck zad...@naturalbridge.com wrote:
 yes, my bad.   here it is with the patches.

Just for the record, ok!

Thanks,
Richard.

 On 10/06/2012 11:55 AM, Kenneth Zadeck wrote:

 This is the third patch in the series of patches to fix constant math.
 this one changes some predicates at the rtl level to use the new predicate
 CONST_SCALAR_INT_P.
 I did not include a few that were tightly intertwined with other changes.

 Not all of these changes are strictly mechanical.   Richard, when
 reviewing this had me make additional changes to remove what he thought were
 latent bugs at the rtl level.   However, it appears that the bugs were not
 latent.I do not know what is going on here but i am smart enough to not
 look a gift horse in the mouth.

 All of this was done on the same machine with no changes and identical
 configs.  It is an x86-64 with ubuntu 12-4.

 ok for commit?

 in the logs below, gbBaseline is a trunk from friday and the gbWide is the
 same revision but with my patches.  Some of this like gfortran.dg/pr32627 is
 obviously flutter, but the rest does not appear to be.

 =
 heracles:~/gcc(13) gccBaseline/contrib/compare_tests
 gbBaseline/gcc/testsuite/gcc/gcc.log gbWide/gcc/testsuite/gcc/gcc.log
 New tests that PASS:

 gcc.dg/builtins-85.c scan-assembler mysnprintf
 gcc.dg/builtins-85.c scan-assembler-not __chk_fail
 gcc.dg/builtins-85.c (test for excess errors)


 heracles:~/gcc(14) gccBaseline/contrib/compare_tests
 gbBaseline/gcc/testsuite/gfortran/gfortran.log
 gbWide/gcc/testsuite/gfortran/gfortran.log
 New tests that PASS:

 gfortran.dg/pr32627.f03  -O3 -fomit-frame-pointer -funroll-loops (test for
 excess errors)
 gfortran.dg/pr32627.f03  -O3 -fomit-frame-pointer  (test for excess
 errors)
 gfortran.dg/pr32627.f03  -Os  (test for excess errors)
 gfortran.dg/pr32635.f  -O0  execution test
 gfortran.dg/pr32635.f  -O0  (test for excess errors)
 gfortran.dg/substr_6.f90  -O2  (test for excess errors)

 Old tests that passed, that have disappeared: (Eeek!)

 gfortran.dg/pr32627.f03  -O1  (test for excess errors)
 gfortran.dg/pr32627.f03  -O3 -fomit-frame-pointer -funroll-all-loops
 -finline-functions  (test for excess errors)
 gfortran.dg/pr32627.f03  -O3 -g  (test for excess errors)
 gfortran.dg/substring_equivalence.f90  -O  (test for excess errors)
 Using /home/zadeck/gcc/gccBaseline/gcc/testsuite/config/default.exp as
 tool-and-target-specific interface file.

 === g++ Summary ===

 # of expected passes49793
 # of expected failures284
 # of unsupported tests601

 runtest completed at Fri Oct  5 16:10:20 2012
 heracles:~/gcc(16) tail gbWide/gcc/testsuite/g++/g++.log Using
 /usr/share/dejagnu/config/unix.exp as generic interface file for target.
 Using /home/zadeck/gcc/gccWide/gcc/testsuite/config/default.exp as
 tool-and-target-specific interface file.

 === g++ Summary ===

 # of expected passes50472
 # of expected failures284
 # of unsupported tests613

 runtest completed at Fri Oct  5 19:51:50 2012








Re: [RFC] Implement load sinking in loops

2012-10-08 Thread Richard Guenther
On Mon, Oct 8, 2012 at 12:38 PM, Eric Botcazou ebotca...@adacore.com wrote:
 Hi,

 we recently noticed that, even at -O3, the compiler doesn't figure out that
 the following loop is dumb:

 #define SIZE 64

 int foo (int v[])
 {
   int r;

   for (i = 0; i  SIZE; i++)
 r = v[i];

   return r;
 }

 which was a bit of a surprise.  On second thoughts, this isn't entirely
 unexpected, as it probably matters only for (slightly) pathological cases.
 The attached patch nevertheless implements a form of load sinking in loops so
 as to optimize these cases.  It's combined with invariant motion to optimize:

 int foo (int v[], int a)
 {
   int r, i;

   for (i = 0; i  SIZE; i++)
 r = v[i] + a;

   return r;
 }

 and with store sinking to optimize:

 int foo (int v1[], int v2[])
 {
   int r[SIZE];
   int i, j;

   for (j = 0; j  SIZE; j++)
 for (i = 0; i  SIZE; i++)
   r[j] = v1[j] + v2[i];

   return r[SIZE - 1];
 }

 The optimization is enabled at -O2 in the patch for measurement purposes but,
 given how rarely it triggers (e.g. exactly 10 occurrences in a GCC bootstrap,
 compiler-only, all languages except Go), it's probably best suited to -O3.
 Or perhaps we don't care and it should simply be dropped...  Thoughts?

Incidentially we have scev-const-prop to deal with the similar case of
scalar computations.  But I realize this doesn't work for expressions that
are dependent on a loop variant load.

@@ -103,6 +103,7 @@ typedef struct mem_ref_loc
 {
   tree *ref;   /* The reference itself.  */
   gimple stmt; /* The statement in that it occurs.  */
+  tree lhs;/* The (ultimate) LHS for a load.  */
 } *mem_ref_loc_p;

isn't that the lhs of stmt?

+static gimple_seq
+copy_load_and_single_use_chain (mem_ref_loc_p loc, tree *new_lhs)
+{
+  tree mem = *loc-ref;
+  tree lhs, tmp_var, ssa_name;
+  gimple_seq seq = NULL;
+  gimple stmt;
+  unsigned n = 0;
+
+  /* First copy the load and create the new LHS for it.  */
+  lhs = gimple_assign_lhs (loc-stmt);
+  tmp_var = create_tmp_reg (TREE_TYPE (lhs), get_lsm_tmp_name (mem, n++));

use make_temp_ssa_name or simply copy_ssa_name (not sure you need
fancy names here).

+  if (gimple_assign_rhs1 (use_stmt) == lhs)
+   {
+ op1 = ssa_name;
+ op2 = gimple_assign_rhs2 (use_stmt);
+   }
+  else
+   {
+ op1 = gimple_assign_rhs1 (use_stmt);
+ op2 = ssa_name;
+   }

this may enlarge lifetime of the other operand?  And it looks like it would
break with unary stmts (accessing out-of-bounds op2).  Also for
is_gimple_min_invariant other operand which may be for example a.b
you need to unshare_expr it.

+  lhs = gimple_assign_lhs (use_stmt);
+  tmp_var = create_tmp_reg (TREE_TYPE (lhs), get_lsm_tmp_name (mem, n++));
+  stmt = gimple_build_assign_with_ops (rhs_code, tmp_var, op1, op2);
+  ssa_name = make_ssa_name (tmp_var, stmt);
+  gimple_assign_set_lhs (stmt, ssa_name);

see above.  This can now be simplified to

   lhs = gimple_assign_lhs (use_stmt);
   ssa_name = copy_ssa_name (lhs, NULL);
   stmt = gimple_build_assign_with_ops (rhs_code, ssa_name, op1, op2);

Btw - isn't this all a bit backward (I mean the analysis in execute_lm?)
What you want is apply this transform to as much of the _DEF_s of
the loop-closed PHI nodes - only values used outside of the loop are
interesting.  Thats (sort-of) what SCEV const-prop does (well, it also
uses SCEV to compute the overall effect of the iterations).  So what
you want to know is whether when walking the DEF chain of the
loop closed PHI you end up at definitions before the loop or at
definitions that are not otherwise used inside the loop.

Which means it is really expression sinking.  Does tree-ssa-sink manage
to sink anything out of a loop?  Even scalar computation parts I mean?  For

 for (..)
   {
 a = x[i];
 y[i] = a;
 b = a * 2;
   }
  ... = b;

it should be able to sink b = a*2.

So I think the more natural place to implement this is either SCEV cprop
or tree-ssa-sink.c.  And process things from the loop-closed PHI use
walking the DEFs (first process all, marking interesting things to also
catch commonly used exprs for two PHI uses).

Again you might simply want to open a bugreport for this unless you
want to implement it yourself.

Thanks,
Richard.

 Tested on x86_64-suse-linux.


 2012-10-08  Eric Botcazou  ebotca...@adacore.com

 * gimple.h (gsi_insert_seq_on_edge_before): Declare.
 * gimple-iterator.c (gsi_insert_seq_on_edge_before): New function.
 * tree-ssa-loop-im.c (struct mem_ref_loc): Add LHS field.
 (mem_ref_in_stmt): Remove gcc_assert.
 (copy_load_and_single_use_chain): New function.
 (execute_lm): Likewise.
 (hoist_memory_references): Hoist the loads after the stores.
 (ref_always_accessed_p): Rename into...
 (ref_always_stored_p): ...this.  Remove STORE_P and add ONCE_P.
 (can_lsm_ref_p): New 

Re: [RFC] Implement load sinking in loops

2012-10-08 Thread Richard Guenther
On Mon, Oct 8, 2012 at 2:32 PM, Richard Guenther
richard.guent...@gmail.com wrote:
 On Mon, Oct 8, 2012 at 12:38 PM, Eric Botcazou ebotca...@adacore.com wrote:
 Hi,

 we recently noticed that, even at -O3, the compiler doesn't figure out that
 the following loop is dumb:

 #define SIZE 64

 int foo (int v[])
 {
   int r;

   for (i = 0; i  SIZE; i++)
 r = v[i];

   return r;
 }

 which was a bit of a surprise.  On second thoughts, this isn't entirely
 unexpected, as it probably matters only for (slightly) pathological cases.
 The attached patch nevertheless implements a form of load sinking in loops so
 as to optimize these cases.  It's combined with invariant motion to optimize:

 int foo (int v[], int a)
 {
   int r, i;

   for (i = 0; i  SIZE; i++)
 r = v[i] + a;

   return r;
 }

 and with store sinking to optimize:

 int foo (int v1[], int v2[])
 {
   int r[SIZE];
   int i, j;

   for (j = 0; j  SIZE; j++)
 for (i = 0; i  SIZE; i++)
   r[j] = v1[j] + v2[i];

   return r[SIZE - 1];
 }

 The optimization is enabled at -O2 in the patch for measurement purposes but,
 given how rarely it triggers (e.g. exactly 10 occurrences in a GCC bootstrap,
 compiler-only, all languages except Go), it's probably best suited to -O3.
 Or perhaps we don't care and it should simply be dropped...  Thoughts?

 Incidentially we have scev-const-prop to deal with the similar case of
 scalar computations.  But I realize this doesn't work for expressions that
 are dependent on a loop variant load.

 @@ -103,6 +103,7 @@ typedef struct mem_ref_loc
  {
tree *ref;   /* The reference itself.  */
gimple stmt; /* The statement in that it occurs.  */
 +  tree lhs;/* The (ultimate) LHS for a load.  */
  } *mem_ref_loc_p;

 isn't that the lhs of stmt?

 +static gimple_seq
 +copy_load_and_single_use_chain (mem_ref_loc_p loc, tree *new_lhs)
 +{
 +  tree mem = *loc-ref;
 +  tree lhs, tmp_var, ssa_name;
 +  gimple_seq seq = NULL;
 +  gimple stmt;
 +  unsigned n = 0;
 +
 +  /* First copy the load and create the new LHS for it.  */
 +  lhs = gimple_assign_lhs (loc-stmt);
 +  tmp_var = create_tmp_reg (TREE_TYPE (lhs), get_lsm_tmp_name (mem, n++));

 use make_temp_ssa_name or simply copy_ssa_name (not sure you need
 fancy names here).

 +  if (gimple_assign_rhs1 (use_stmt) == lhs)
 +   {
 + op1 = ssa_name;
 + op2 = gimple_assign_rhs2 (use_stmt);
 +   }
 +  else
 +   {
 + op1 = gimple_assign_rhs1 (use_stmt);
 + op2 = ssa_name;
 +   }

 this may enlarge lifetime of the other operand?  And it looks like it would
 break with unary stmts (accessing out-of-bounds op2).  Also for
 is_gimple_min_invariant other operand which may be for example a.b
 you need to unshare_expr it.

 +  lhs = gimple_assign_lhs (use_stmt);
 +  tmp_var = create_tmp_reg (TREE_TYPE (lhs), get_lsm_tmp_name (mem, 
 n++));
 +  stmt = gimple_build_assign_with_ops (rhs_code, tmp_var, op1, op2);
 +  ssa_name = make_ssa_name (tmp_var, stmt);
 +  gimple_assign_set_lhs (stmt, ssa_name);

 see above.  This can now be simplified to

lhs = gimple_assign_lhs (use_stmt);
ssa_name = copy_ssa_name (lhs, NULL);
stmt = gimple_build_assign_with_ops (rhs_code, ssa_name, op1, op2);

 Btw - isn't this all a bit backward (I mean the analysis in execute_lm?)
 What you want is apply this transform to as much of the _DEF_s of
 the loop-closed PHI nodes - only values used outside of the loop are
 interesting.  Thats (sort-of) what SCEV const-prop does (well, it also
 uses SCEV to compute the overall effect of the iterations).  So what
 you want to know is whether when walking the DEF chain of the
 loop closed PHI you end up at definitions before the loop or at
 definitions that are not otherwise used inside the loop.

 Which means it is really expression sinking.  Does tree-ssa-sink manage
 to sink anything out of a loop?  Even scalar computation parts I mean?  For

  for (..)
{
  a = x[i];
  y[i] = a;
  b = a * 2;
}
   ... = b;

 it should be able to sink b = a*2.

 So I think the more natural place to implement this is either SCEV cprop
 or tree-ssa-sink.c.  And process things from the loop-closed PHI use
 walking the DEFs (first process all, marking interesting things to also
 catch commonly used exprs for two PHI uses).

 Again you might simply want to open a bugreport for this unless you
 want to implement it yourself.

We indeed sink 2*tem but not a[i] here.  Because tree-ssa-sink.c doesn't
sink loads (IIRC) at all, but I've seen patches to fix that (IIRC).

int a[256];
int foo (int x)
{
  int i, k = 0;
  for (i = 0; i  x; ++i)
{
  int tem = a[i];
  k = 2*tem;
}
  return k;
}

Richard.

 Thanks,
 Richard.

 Tested on x86_64-suse-linux.


 2012-10-08  Eric Botcazou  ebotca...@adacore.com

 * gimple.h (gsi_insert_seq_on_edge_before): Declare.
 * gimple-iterator.c

Re: gcc/lto/lto.c: Free lto_file struct after closing the file

2012-10-08 Thread Richard Guenther
On Mon, Oct 8, 2012 at 2:39 PM, Tobias Burnus bur...@net-b.de wrote:
 lto_obj_file_open allocates:
   lo = XCNEW (struct lto_simple_object);
 However, the data is never freed - neither explicitly nor in
 lto_obj_file_close.

 In the attached patch, I free the memory now after the call to
 lto_obj_file_close.

 Build and regtested on x86-64-gnu-linux.
 OK for the trunk?

Ok.

Thanks,
Richard.

 Tobias


Re: patch to fix constant math

2012-10-08 Thread Richard Guenther
On Mon, Oct 8, 2012 at 5:01 PM, Nathan Froyd froy...@mozilla.com wrote:
 - Original Message -
 Btw, as for Richards idea of conditionally placing the length field
 in
 rtx_def looks like overkill to me.  These days we'd merely want to
 optimize for 64bit hosts, thus unconditionally adding a 32 bit
 field to rtx_def looks ok to me (you can wrap that inside a union to
 allow both descriptive names and eventual different use - see what
 I've done to tree_base)

 IMHO, unconditionally adding that field isn't optimize for 64-bit
 hosts, but gratuitously make one of the major compiler data
 structures bigger on 32-bit hosts.  Not everybody can cross-compile
 from a 64-bit host.  And even those people who can don't necessarily
 want to.  Please try to consider what's best for all the people who
 use GCC, not just the cases you happen to be working with every day.

The challenge would of course be to have the overhead only for a minority
of all RTX codes.  After all that 32bits are free to be used for every one.

And I would not consider RTX a 'major compiler data structure' - of course
that makes the whole issue somewhat moot ;)

Richard.

 -Nathan


Re: reverse bitfield patch

2012-10-07 Thread Richard Guenther
On Fri, Oct 5, 2012 at 8:15 PM, DJ Delorie d...@redhat.com wrote:


 Why do you need to change varasm.c at all?  The hunks seem to be
 completely separate of the attribute.

 Because static constructors have fields in the original order, not the
 reversed order.  Otherwise code like this is miscompiled:

Err - the struct also has fields in the original order - only the bit positions
of the fields are different because of the layouting option.

 struct foo a = { 1, 2, 3 };

 because the 1, 2, 3 are in the C layout order, but the underlying data
 needs to be stored in the reversed order.

And you expect no other code looks at fields of a structure and its
initializer?  It's bad to keep this not in-sync.  Thus I don't think it's
viable to re-order fields just because bit allocation is reversed.

 which will severely pessimize bitfield accesses to structs with the
 bitfield-order attribute.

 The typical use-case for this feature is memory-mapped hardware, where
 pessimum access is preferred anyway.

I doubt that, looking at constraints for strict volatile bitfields.

 so you are supporting this as #pragma.  Which ends up tacking
 bit_order to each type.  Rather than this, why not operate similar
 to the packed pragma, thus, adjust a global variable in
 stor-layout.c.

 Because when I first proposed this feature, I was told to do it this
 way.

 I don't see a value in attaching 'native' or 'msb'/'lsb' if it is
 equal to 'native'.  You un-necessarily pessimize code generation (is
 different code even desired for a no-op bit_order attribute?).

 If the attribute corresponds to the native mode, it should be a no-op.
 The pessimizing only happens when the fields are actually in reverse
 order.

No, because you check for the attribute presence.

 So no, I don't like this post-process layouting thing.  It's a
 layouting mode so it should have effects at bitfield layout time.

 The actual reversal happens in stor-layout.c.  Everything else is
 there to compensate for a possible non-linear layout.


Re: patch to fix constant math

2012-10-07 Thread Richard Guenther
On Fri, Oct 5, 2012 at 6:34 PM, Kenneth Zadeck zad...@naturalbridge.com wrote:
 richard s,
 there are two comments that i deferred to you.   that have the word richard
 in them,

 richi,
 thank, i will start doing this now.

 kenny

 On 10/05/2012 09:49 AM, Richard Guenther wrote:

 On Fri, Oct 5, 2012 at 2:39 PM, Richard Guenther
 richard.guent...@gmail.com wrote:

 Ok, I see where you are going.  Let me look at the patch again.

 * The introduction and use of CONST_SCALAR_INT_P could be split out
(obvious and good)

 * DEF_RTL_EXPR(CONST_WIDE_INT, const_wide_int, , RTX_CONST_OBJ)
defining that new RTX is orthogonal to providing the wide_int
 abstraction
for operating on CONST_INT and CONST_DOUBLE, right?

 @@ -144,6 +144,7 @@ static bool
   prefer_and_bit_test (enum machine_mode mode, int bitnum)
   {
 bool speed_p;
 +  wide_int mask = wide_int::set_bit_in_zero (bitnum, mode);

 set_bit_in_zero ... why use this instead of
 wide_int::zero (mode).set_bit (bitnum) that would match the old
 double_int_zero.set_bit interface and be more composed of primitives?

 adding something like this was just based on usage.We do this operation
 all over the place, why not make it concise and efficient. The api is
 very bottom up.   I looked at what the clients were doing all over the place
 and added those functions.

 wide-int has both and_not and or_not.   double-int only has and_not, but
 there are a lot of places in where you do or_nots, so we have or_not also.


 if (and_test == 0)
   {
 @@ -164,8 +165,7 @@ prefer_and_bit_test (enum machine_mode m
   }

 /* Fill in the integers.  */
 -  XEXP (and_test, 1)
 -= immed_double_int_const (double_int_zero.set_bit (bitnum), mode);
 +  XEXP (and_test, 1) = immed_wide_int_const (mask);

 I suppose immed_wide_int_const may create a CONST_INT, a CONST_DOUBLE
 or a CONST_WIDE_INT?

 yes, on converted targets it builds const_ints and const_wide_ints and on
 unconverted targets it builds const_ints and const_doubles.The reason i
 did not want to convert the targets is that the code that lives in the
 targets generally wants to use the values to create constants and this code
 is very very very target specific.


 +#if TARGET_SUPPORTS_WIDE_INT
 +/* Returns 1 if OP is an operand that is a CONST_INT or CONST_WIDE_INT.
 */
 +int
 +const_scalar_int_operand (rtx op, enum machine_mode mode)
 +{

 why is it necessary to condition anything on TARGET_SUPPORTS_WIDE_INT?
 It seems testing/compile coverage of this code will be bad ...

 Can't you simply map CONST_WIDE_INT to CONST_DOUBLE for targets

 The accessors for the two are completely different.const doubles know
 that there are exactly two hwi's.   for const_wide_ints, you only know that
 the len is greater than 1.   anything with len 1 would be CONST_INT.

 In a modern c++ world, const_int would be a subtype of const_int, but that
 is a huge patch.


 not supporting CONST_WIDE_INT (what does it mean to support
 CONST_WIDE_INT?  Does a target that supports CONST_WIDE_INT no
 longer use CONST_DOUBLEs for integers?)

 yes, that is exactly what it means.

 +  if (!CONST_WIDE_INT_P (op))
 +return 0;

 hm, that doesn't match the comment (CONST_INT is supposed to return 1 as
 well).
 The comment doesn't really tell what the function does it seems,

 I need some context here to reply.


 +  int prec = GET_MODE_PRECISION (mode);
 +  int bitsize = GET_MODE_BITSIZE (mode);
 +
 +  if (CONST_WIDE_INT_NUNITS (op) * HOST_BITS_PER_WIDE_INT  bitsize)
 +   return 0;

 it's mode argument is not documented.  And this doesn't even try to see if
 the constant would fit the mode anyway (like if it's zero).  Not sure what
 the function is for.

 I will upgrade the comments, they were inherited from the old version with
 the const_double changed to the const_wide_int


 +   {
 + /* Multiword partial int.  */
 + HOST_WIDE_INT x
 +   = CONST_WIDE_INT_ELT (op, CONST_WIDE_INT_NUNITS (op) - 1);
 + return (wide_int::sext (x, prec  (HOST_BITS_PER_WIDE_INT - 1))
 + == x);

 err - so now we have wide_int with a mode but we pass in another mode
 and see if they have the same value?  Same value as-in signed value?
 Which probably is why you need to rule out different size constants above?
 Because you don't know whether to sign or zero extend?

 no it is worse than that.   I made the decision, which i think is correct,
 that we were not going to carry the mode inside const_wide_int.   The
 justification was that we do not do it for const_int.There is a comment
 in the constructor for const_wide_int that says it would be so easy to just
 put this in.

 But, this is an old api that did not change.   only the internals changed to
 support const_wide_int.



 +/* Returns 1 if OP is an operand that is a CONST_WIDE_INT.  */
 +int
 +const_wide_int_operand (rtx op, enum machine_mode mode)
 +{

 similar comments, common code should be factored out

Re: patch to fix constant math

2012-10-07 Thread Richard Guenther
On Sun, Oct 7, 2012 at 3:11 PM, Kenneth Zadeck zad...@naturalbridge.com wrote:

 On 10/07/2012 08:47 AM, Richard Guenther wrote:

 len seems redundant unless you want to optimize encoding.
 len == (precision + HOST_BITS_PER_WIDE_INT - 1) /
  HOST_BITS_PER_WIDE_INT.

 
 that is exactly what we do.   However, we are optimizing time, not
  space.
 the value is always stored in compressed form, i.e the same
  representation
 as is used inside the CONST_WIDE_INTs and INT_CSTs. this makes the
 transformation between them very fast.   And since we do this a lot, it
 needs to be fast.   So the len is the number of HWIs needed to represent
  the
 value which is typically less than what would be implied by the
  precision.

 But doesn't this require a sign?  Otherwise how do you encode TImode
 0x?
 Would len be 1 here?  (and talking about CONST_WIDE_INT vs CONST_INT,
 wouldn't CONST_INT be used anyway for all ints we can encode small?  Or
 is it (hopefully) the goal to replace CONST_INT with CONST_WIDE_INT
 everywhere?)  Or do you say wide_int is always signed', thus TImode
 0x
 needs len == 2 and an explicit zero upper HWI word?

 the compression of this has len ==2 with the explicit upper being 0.


   Or do you say wide_int
 is always unsigned, thus TImode -1 needs len == 2?  Noting that
 double-ints
 were supposed to be twos-complement (thus, 'unsigned') numbers having
 implicit non-zero bits sounds error-prone.

 That said - I don't see any value in optimizing storage for short-lived
 (as you say) wide-ints (apart from limiting it to the mode size).  For
 example mutating operations on wide-ints are not really possible
 if you need to extend storage.

 the compression used is independent of whether it is sign or unsigned.  but
 the compression looks signed.   i.e. you do not represent the upper hwi if
 they would just be a sign extension of the top hwi that is represented.
 However, this does not imply anything about the math that was used to set
 those bits that way, and you can always go back to the full rep independent
 of the sign.

 I do not have any long term plans to merge CONST_INT into CONST_WIDE_INT.
 It would be a huge change in the ports and would be fairly space
 inefficient.  My guess is that in real code, less than 1% of real constants
 will have a length greater than 1 even on a 32 bit host.   CONST_INT is very
 space efficient. This could have been mentioned as part of Richard's
 response to your comment about the way we represent the CONST_WIDE_INTs.
 In practice, there are almost none of them anyway.

 In fact, you could argue that the tree level did it wrong (not that i am
 suggesting to change this).   But it makes me think what was going on when
 the decision to make TYPE_PRECISION be an INT_CST rather than just a HWI was
 made.   For that there is an implication that it could never take more than
 a HWI since no place in the code even checks TREE_INT_CST_HIGH for these.

Well - on the tree level we now always have two HWIs for all INTEGER_CSTs.  If
we can, based on the size of the underlying mode, reduce that to one
HWI we already
win something.  If we add an explicit length to allow a smaller
encoding for larger modes
(tree_base conveniently has an available 'int' for this ...) then we'd
win in more cases.
Thus, is CONST_INT really necessarily better than optimized CONST_WIDE_INT
storage?

Richard.

 +  enum Op {
 +NONE,
 
 we don't have sth like this in double-int.h, why was the double-int
 mechanism
 not viable?

 
 i have chosen to use enums for things rather than passing booleans.

 But it's bad to use an enum with 4 values for a thing that can only
 possibly
 expect two.  You cannot optimize tests as easily.  Please retain the bool
 uns
 parameters instead.

 I am breaking it into two enums.


Re: Fixup INTEGER_CST

2012-10-07 Thread Richard Guenther
On Sun, Oct 7, 2012 at 5:15 PM, Jan Hubicka hubi...@ucw.cz wrote:
 Hi,
 I added a santy check that after fixup all types that lost in the merging are
 really dead.  And it turns out we have some zombies around.

 INTEGER_CST needs special care because it is special cased by the streamer.  
 We also
 do not want to do inplace modificaitons on it because that would corrupt the 
 hashtable
 used by tree.c's sharing code

 Bootstrapped/regtested x86_64-linux, OK?

No, I don't think we want to fixup INTEGER_CSTs this way.  Instead we
want to fixup
them where they end up used unfixed.

Richard.


 PR lto/54839
 * lto/lto.c (remember_with_vars): Also fixup INTEGER_CST.
 (fixup_integer_cst): New functoin.
 (lto_ft_type): Fixup BASETYPE of methods and offsets.
 Index: lto/lto.c
 ===
 --- lto/lto.c   (revision 192164)
 +++ lto/lto.c   (working copy)
 @@ -1408,11 +1408,36 @@ remember_with_vars (tree t)
 (tt) = GIMPLE_REGISTER_TYPE (tt); \
   if (VAR_OR_FUNCTION_DECL_P (tt)  TREE_PUBLIC (tt)) \
 remember_with_vars (t); \
 + if (TREE_CODE (tt) == INTEGER_CST) \
 +   (tt) = fixup_integer_cst (tt); \
 } \
  } while (0)

  static void lto_fixup_types (tree);

 +/* Return integer_cst T with updated type.  */
 +
 +static tree
 +fixup_integer_cst (tree t)
 +{
 +  tree type = GIMPLE_REGISTER_TYPE (TREE_TYPE (t));
 +
 +  if (type == TREE_TYPE (t))
 +return t;
 +
 +  /* If overflow was set, streamer_read_integer_cst
 + produced local copy of T. */
 +  if (TREE_OVERFLOW (t))
 +{
 +  TREE_TYPE (t) = type;
 +  return t;
 +}
 +  else
 +  /* Otherwise produce new shared node for the new type.  */
 +return build_int_cst_wide (type, TREE_INT_CST_LOW (t),
 +  TREE_INT_CST_HIGH (t));
 +}
 +
  /* Fix up fields of a tree_typed T.  */

  static void
 @@ -1526,6 +1549,11 @@ lto_ft_type (tree t)
LTO_FIXUP_TREE (t-type_non_common.binfo);

LTO_FIXUP_TREE (TYPE_CONTEXT (t));
 +
 +  if (TREE_CODE (t) == METHOD_TYPE)
 +TYPE_METHOD_BASETYPE (t);
 +  if (TREE_CODE (t) == OFFSET_TYPE)
 +TYPE_OFFSET_BASETYPE (t);
  }

  /* Fix up fields of a BINFO T.  */


Re: inlined memcpy/memset degradation in gcc 4.6 or later

2012-10-05 Thread Richard Guenther
On Thu, Oct 4, 2012 at 7:44 PM, Joe Buck joe.b...@synopsys.com wrote:

 On Tue, Oct 2, 2012 at 4:19 PM, Walter Lee w...@tilera.com wrote:
  On TILE-Gx, I'm observing a degradation in inlined memcpy/memset in
  gcc 4.6 and later versus gcc 4.4.  Though I find the problem on
  TILE-Gx, I think this is a problem for any architectures with
  SLOW_UNALIGNED_ACCESS set to 1.
 
  Consider the following program:
 
  struct foo {
int x;
  };
 
  void copy(struct foo* f0, struct foo* f1)
  {
memcpy (f0, f1, sizeof(struct foo));
  }
 
  In gcc 4.4, I get the desired inline memcpy: ...
  In gcc 4.7, however, I get inlined byte-by-byte copies: ...

 On Thu, Oct 04, 2012 at 01:58:54PM +0200, Richard Guenther wrote:
 There is no way to fix it.  memcpy does not require aligned arguments
 and the merely presence of a typed pointer contains zero information
 of alignment for the middle-end.  If you want to excercise C's notion
 of alignemnt requirements then do not use memcpy but

  *f0 = *f1;

 which works equally well.

 Perhaps I'm missing something.  While memcpy is not permitted to assume
 alignment of its arguments, copy is.  Otherwise, if I wrote

 void copy(struct foo* f0, struct foo* f1)
 {
 *f0 = *f1;
 }

 the compiler would also be forbidden from assuming any alignment.  So,
 when compiling copy, do we lack the information that the memcpy call is
 to the standard ISO memcpy function?  If we know that it is the standard
 function we should be able to do the copy assuming everything is properly
 aligned.

If we see the above aggregate copy then we should be able to compile
the function assuming that f0 and f1 are properly aligned for type struct foo.
If we see C source code using memcpy (f0, f1, sizeof (struct foo)) then
we cannot assume anything about the alignment of f0 or f1 based on the
fact that the code uses the ISO memcpy function.

Thus, I do not understand your question.  When we compile copy how
is memcpy relevant?

Richard.



 Btw, the new beavior even fixed bugs.

 Could you point to a PR that was fixed by the change?  There must be some
 way to distinguish this case from those cases.


Re: Functions that are CSEable but not pure

2012-10-05 Thread Richard Guenther
On Thu, Oct 4, 2012 at 8:02 PM, Jason Merrill ja...@redhat.com wrote:
 On 10/04/2012 01:42 PM, Richard Guenther wrote:

 So I suppose the testcase that would be valid but break with using
 pure would be instead

 int main()
 {
int x = init_count;
int *p = get_me();
if (init_count == x)
  __builtin_abort();
int *q = get_me();
if (init_count == x)
  __builtin_abort();
 }

 here when get_me is pure we CSE init_count over the _first_ call of
 get_me.


 That's OK for C++ thread_local semantics; the initialization is specified to
 happen some time before the first use, so the testcase is making an invalid
 assumption.  This might not be desirable for user-written singleton
 functions, though.

Ok, so then let me make another example to try breaking a valid program
with the thread_local wrapper being pure:

int main()
{
  int p = get_me();
  *p;
  if (init_count == 1)
__builtin_abort();
}

note my using of C++ references (which in this case cannot bind to NULL
and thus *p might not trap(?)).  So the C++ source would be something like

thread_local int init_count;
int foo () { init_count = 1; return 0; }
thread_local int i = foo ();

int main()
{
  i;
  if (init_count != 1)
__builtin_abort ();
}

is that valid?  When lowered to the above we would DCE the load of i
and the call to the initialization wrapper (because it is pure).  Obviously
then init_count is no longer 1.

Thus my question is - is a valid program allowed to access side-effects
of the dynamic TLS initializer by not going through the TLS variable?
I realize that's somewhat ill-formed if the TLS variable was a class with
a static method doing the access to init_count - using that method
would not keep the class instance life.

Preventing DCE but not CSE for const/pure functions can be for
example done by using the looping-const-or-pure flag (but that
also tells the compiler that this function may not return, so it is
very specific about the kind of possible side-effect to be preserved).
The very specific nature of thread_local TLS init semantics
('somewhen before') is hard to make use of, so if we want to
change looping-const-or-pure to something like const-or-pure-with-side-effects
we should constrain things more.

Richard.

 Jason



Re: reverse bitfield patch

2012-10-05 Thread Richard Guenther
On Thu, Oct 4, 2012 at 8:38 PM, DJ Delorie d...@redhat.com wrote:

 ChangeLog missing, new functions need a toplevel comment documenting
 function, argument and return value as per coding conventions.

 Any review of the patch itself?  I know the overhead is not there...

Why do you need to change varasm.c at all?  The hunks seem to be
completely separate of the attribute.

I don't like that you post-process the layout.  How does reversal
inter-operate with all the other features we have, like strict volatile
bitfields?  I for example see

+ /* If the bitfield-order attribute has been used on this
+structure, the fields might not be in bit-order.  In that
+case, we need a separate representative for each
+field.  */
+ else if (DECL_FIELD_OFFSET (field)  DECL_FIELD_OFFSET (repr)
+  || (DECL_FIELD_OFFSET (field) == DECL_FIELD_OFFSET (repr)
+   DECL_FIELD_BIT_OFFSET (field) 
DECL_FIELD_BIT_OFFSET (repr)))
+   {
+ finish_bitfield_representative (repr, prev);
+ repr = start_bitfield_representative (field);
+   }

which will severely pessimize bitfield accesses to structs with the
bitfield-order attribute.


+
+  type_attr_list = tree_cons (get_identifier (bit_order),
+ build_tree_list (NULL_TREE,
get_identifier (bit_order)),
+ type_attr_list);
+
+  TYPE_ATTRIBUTES (node) = type_attr_list;
+}
+
+void
+rx_note_pragma_bitorder (char *mode)
+{
+  if (mode == NULL)

so you are supporting this as #pragma.  Which ends up tacking bit_order
to each type.  Rather than this, why not operate similar to the packed
pragma, thus, adjust a global variable in stor-layout.c.

+static tree
+handle_bitorder_attribute (tree *ARG_UNUSED (node), tree ARG_UNUSED (name),
+  tree ARG_UNUSED (args),
+  int ARG_UNUSED (flags), bool *no_add_attrs)
+{
+  tree bmode;
+  const char *bname;
+
+  /* Allow no arguments to mean native.  */
+  if (args == NULL_TREE)
+return NULL_TREE;
+
+  bmode = TREE_VALUE (args);
+
+  bname = IDENTIFIER_POINTER (bmode);
+  if (strcmp (bname, msb)
+   strcmp (bname, lsb)
+   strcmp (bname, swapped)
+   strcmp (bname, native))
+{
+  error (%qE is not a valid bit_order - use lsb, msb, native, or
swapped, bmode);
+  *no_add_attrs = true;
+}
+
+  return NULL_TREE;

I don't see a value in attaching 'native' or 'msb'/'lsb' if it is
equal to 'native'.  You
un-necessarily pessimize code generation (is different code even desired
for a no-op bit_order attribute?).

So no, I don't like this post-process layouting thing.  It's a layouting mode
so it should have effects at bitfield layout time.

Richard.


Re: libbacktrace and darwin

2012-10-05 Thread Richard Guenther
On Fri, Oct 5, 2012 at 9:15 AM, Tristan Gingold ging...@adacore.com wrote:

 On Oct 4, 2012, at 11:23 PM, Ian Lance Taylor wrote:

 On Thu, Oct 4, 2012 at 1:32 PM, Jack Howarth howa...@bromo.med.uc.edu 
 wrote:
   Is libbacktrace currently functional in gcc trunk and is it expected
 to function on darwin? While I could understand it not working on installed
 binaries of FSF gcc that were stripped, I would think it should work for
 make check in the build tree since all of the debug code should be present
 in the object files. Or doesn't libbacktrace know to look there for the
 dwarf code? Thanks in advance for any clarifications.

 libbacktrace is functional in GCC trunk.  However, it does not yet
 support the Mach-O object file format.  I hope to work on that at some
 point, but it would be great if somebody else tackled it.  It's
 probably straightforward to implement based on code in
 libiberty/simple-object-mach-o.c.

 I doubt it will, as dwarf info aren't in the executable.  I think it will be
 simpler to support .dsym (separate debug file) at first.

So I take it that libbacktrace doesn't work with the separate DWARF
debuginfo as shipped by all Linux distributions either?

Richard.

 Tristan.



Re: Variable DECL_SIZE

2012-10-05 Thread Richard Guenther
On Fri, Oct 5, 2012 at 10:28 AM, Ilya Enkovich enkovich@gmail.com wrote:
 2012/10/4 Richard Guenther richard.guent...@gmail.com:
 On Wed, Oct 3, 2012 at 7:05 PM, Ilya Enkovich enkovich@gmail.com wrote:
 Hi,

 I fall into ssa verification failure when try to pass field's
 DECL_SIZE as an operand for CALL_EXPR. The fail occurs if field's size
 is not a constant. In such case DECL_SIZE holds a VAR_DECL and I need
 to find it's proper SSA_NAME. I thought it may be the default one but
 gimple_default_def returns NULL. What is the right way to obtain
 correct ssa_name in such case?

 There is no way.  You have to know that you need it's size at the point of
 gimplification.  Later there is no way to recover it.

 Wow. It means I cannot also get an address of subsequent fields in the
 structure. It looks weird. Is there a way to somehow preserve this
 information during gimplification and use in later passes?

By using it ;)  See how gimplification of COMPONENT_REF works
for example.  The offset becomes an explicit operand and its
computation get put into explicit statements.  There is no 'reverse-mapping'
to lookup or even re-construct this information at later time.

Richard.

 Ilya


 Richard.


 Thanks,
 Ilya


Re: Variable DECL_SIZE

2012-10-05 Thread Richard Guenther
On Fri, Oct 5, 2012 at 2:36 PM, Ilya Enkovich enkovich@gmail.com wrote:
 2012/10/5 Richard Guenther richard.guent...@gmail.com:
 On Fri, Oct 5, 2012 at 10:28 AM, Ilya Enkovich enkovich@gmail.com 
 wrote:
 2012/10/4 Richard Guenther richard.guent...@gmail.com:
 On Wed, Oct 3, 2012 at 7:05 PM, Ilya Enkovich enkovich@gmail.com 
 wrote:
 Hi,

 I fall into ssa verification failure when try to pass field's
 DECL_SIZE as an operand for CALL_EXPR. The fail occurs if field's size
 is not a constant. In such case DECL_SIZE holds a VAR_DECL and I need
 to find it's proper SSA_NAME. I thought it may be the default one but
 gimple_default_def returns NULL. What is the right way to obtain
 correct ssa_name in such case?

 There is no way.  You have to know that you need it's size at the point of
 gimplification.  Later there is no way to recover it.

 Wow. It means I cannot also get an address of subsequent fields in the
 structure. It looks weird. Is there a way to somehow preserve this
 information during gimplification and use in later passes?

 By using it ;)  See how gimplification of COMPONENT_REF works
 for example.  The offset becomes an explicit operand and its
 computation get put into explicit statements.  There is no 'reverse-mapping'
 to lookup or even re-construct this information at later time.

 I think gimplification pass is too early for me to decide if I want to
 use all these offsets and sizes.

 Actually when I a look into GIMPLE dumps I see that all these values
 are computed and corresponding SSA_NAMEs exist in the code.

 I use the following source exapmle:

 int foo(int len)
 {
   struct {
 int a;
 int buf[len];
 int b;
   } s;
   return foo1(s.buf);
 }

 In GIMPLE I see a lot of values computed for all sizes, unit sizes and 
 offsets:

   saved_stack.2_1 = __builtin_stack_save ();
   len.0_3 = len_2(D);
   D.2241_4 = (bitsizetype) len.0_3;
   D.2242_5 = D.2241_4 * 32;
   D.2243_6 = (sizetype) len.0_3;
   D.2244_7 = D.2243_6 * 4;
   D.2245_8 = (long int) len.0_3;
   D.2246_9 = D.2245_8 + -1;
   D.2247_10 = (sizetype) D.2246_9;
   D.2241_11 = (bitsizetype) len.0_3;
   D.2242_12 = D.2241_11 * 32;
   D.2243_13 = (sizetype) len.0_3;
   D.2244_14 = D.2243_13 * 4;
   D.2243_15 = (sizetype) len.0_3;
   D.2248_16 = D.2243_15 + 1;
   D.2249_17 = D.2248_16 * 4;
   D.2243_18 = (sizetype) len.0_3;
   D.2250_19 = D.2243_18 + 2;
   D.2251_20 = D.2250_19 * 32;
   D.2252_21 = D.2251_20;
 ...

 I suppose there is always a sinle SSA_NAME for such vars and therefore
 I may associate them, e.g. via default definition. The only problem
 them is to not kill them until the very end. Currenty all those unused
 SSA_NAMES are killed by early optimizations pass. But even on O0 now
 we have all values computed and available for use until expand but
 cannot access them.

But that's exactly the issue.  Nothing keeps them live if they are not used.
And there is no back-mapping between DECL_SIZE and the SSA name
holding it's value at a certain point in the program.

Richard.

 Ilya

 Richard.

 Ilya


 Richard.


 Thanks,
 Ilya


Re: RFC: C++ PATCH to support dynamic initialization and destruction of C++11 and OpenMP TLS variables

2012-10-05 Thread Richard Guenther
On Thu, Oct 4, 2012 at 7:38 PM, Jason Merrill ja...@redhat.com wrote:
 Both C++11 and OpenMP specify that thread_local/threadprivate variables can
 have dynamic initialization and destruction semantics.  This sequence of
 patches implements that.

 The first patch adds the C++11 thread_local keyword and implements the C++11
 parsing rules for thread_local, which differ somewhat from __thread.  It
 also allows dynamic initialization of function-local thread_local variables.

 The second patch adds a __cxa_thread_atexit interface for registering
 cleanups to be run when a thread exits.  The current implementation is not
 fully conformant; on exit, TLS destructors are supposed to run before any
 destructors for non-TLS variables, but I think that will require glibc
 support for __cxa_thread_atexit.

 The third patch implements dynamic initialization of non-function-local TLS
 variables using the init-on-first-use idiom: uses of such a variable are
 replaced with calls to a wrapper function, so that

 int f();
 thread_local int i = f();

 implies

 void i_init() {
   static bool initialized;
   if (!initialized)
 {
   initialized = true;
   i = f();
 }
 }
 inline int i_wrapper() {
   i_init();
   return i;
 }

 Note that if we see

 extern thread_local int i;

 in a header, we don't know whether it has a dynamic initializer in its
 defining translation unit, so we need to make conservative assumptions.  For
 a type that doesn't always get dynamic initialization, we make i_init a
 weakref and only call it if it exists.  In the same translation unit as the
 definition, we optimize appropriately.

 The wrapper function is the function I'm talking about in
   http://gcc.gnu.org/ml/gcc/2012-10/msg00024.html

 Any comments before I check this in?

I wonder if an implementation is conforming that performs non-local TLS
variable inits at thread creation time instead (probably also would require
glibc support)?

Or if we have the extra indirection via a reference anyway, we could
have a pointer TLS variable (NULL initialized) that on the first access
will trap where in a trap handler we could then perform initialization
and setup of that pointer.

Should we document our choice of implementation somewhere in the
manual?  And thus implicitely provide the hint that using non-local TLS
variables with dynamic initialization comes with an abstraction penalty
that we might not be easily able to remove?

Thanks,
Richard.

 Jason


Re: Use conditional casting with symtab_node

2012-10-05 Thread Richard Guenther
On Thu, Oct 4, 2012 at 8:16 PM, Diego Novillo dnovi...@google.com wrote:
 On Thu, Oct 4, 2012 at 2:14 PM, Lawrence Crowl cr...@googlers.com wrote:

 So, Jan Hubicka requested and approved the current spelling.
 What now?

 I don't think we should hold this up.  The names Jan requested seem
 reasonable enough.  We seem to be running in circles here.

I suppose I have your promise that we'll release with consistent names.
Please allocate some work hours on your side for the renaming of
cgraph_node and varpool_node for the case Honza doesn't get to it in time.

I see all these patches with mixed feeling - it puts breaks on all developers
because they need to learn the new interface which does not bring any
immediate benefit.  So I think _your_ development time would be better
spent by fixing open bugs or by tackling some of the existing scalability
issues in GCC (rather than quoting funny '0.001% faster with 99% confidence'
stuff).

Thanks,
Richard.


 Diego.


Re: patch to fix constant math

2012-10-05 Thread Richard Guenther
On Thu, Oct 4, 2012 at 9:27 PM, Richard Sandiford
rdsandif...@googlemail.com wrote:
 Kenneth Zadeck zad...@naturalbridge.com writes:
 On 10/04/2012 12:58 PM, Richard Guenther wrote:
 On Thu, Oct 4, 2012 at 3:55 PM, Kenneth Zadeck zad...@naturalbridge.com 
 wrote:
 Let me talk about the mode here first.

 What this interface/patch provides is a facility where the constant math
 that is done in optimizations is done exactly the way that it would be done
 on the target machine.   What we have now is a compiler that only does this
 if it convenient to do on the host.   I admit that i care about this more
 than others right now, but if intel adds a couple of more instructions to
 their vector units, other people will start to really care about this 
 issue.
 If you take an OImode value with the current compiler and left shift it by
 250 the middle end will say that the result is 0.   This is just wrong!!!

 What this means is that the bitsize and precision of the operations need to
 be carried along when doing math. when wide-int  checks for overflow on the
 multiply or add, it is not checking the if the value overflowed on two 
 HWIs,
 it is checking if the add overflowed in the mode of the types that are
 represented on the target.   When we do shift, we are not doing a shift
 within two HWIs, we are truncating the shift value (if this is appropriate)
 according to the bitsize and shifting according the precision.

 I think that an argument could be made that storing the mode should be
 changed to an explicit precision and bitsize.  (A possible other option
 would be to store a tree type, but this would make the usage at the rtl
 level very cumbersome since types are rare.) Aside from the work, you would
 not get much push back.

 But the signess is a different argument.   At the rtl level, the signess is
 a matter of context.   (you could argue that this is a mistake and i would
 agree, but that is an even bigger change.)   But more to the point, at the
 tree level, there are a surprising number of places where the operation
 desired does not follow the sign of the types that were used to construct
 the constants.   Furthermore, not carrying the sign is more consistent with
 the double int code, which as you point out carries nothing.
 Well, on RTL the signedness is on the operation (you have sdiv and udiv, 
 etc.).
 yes, there is a complete enough set of operations that allow you to
 specify the signess where this matters.

 double-int tries to present a sign-less twos-complement entity of size
 2 * HOST_BITS_PER_WIDE_INT.  I think that is sensible and for
 obvious reasons should not change.  Both tree and RTL rely on this.

 I disagree, at least from an RTL perspective.  HOST_BITS_PER_WIDE_INT is
 a host property, and as Kenny says, it's not really sensible to tie our
 main target-specific IR to something host-specific.  We've only done
 that for historical reasons.

Oh, I agree - that HOST_WIDE_INT provides the limit for the largest
integer constants we can encode on a target is a bug!

 On a target that's been converted to wide_int, I don't think a pair
 of HWIs (whether or not it's expressed as a double_int) has any
 significance at all at the RTL level.

 As far as the wide_ints recording a mode or precision goes: we're in
 the lucky position of having tried both options.  Trees record the
 type (and thus precision) of all compile-time integer constants.
 RTL doesn't.  And the RTL way is a right pain.  It leads to interfaces
 in which rtx arguments often have to be accompanied by a mode.  And it
 leads to clunky differences like those between simplify_unary_operation
 (which takes two mode arguments) and simplify_binary_operation
 (which with our current set of binary operations requires only one).

But the issue here is not that double_int doesn't record a mode
or precision (or a sign).  The issue is that CONST_DOUBLE and CONST_INT
don't!  The _tree_ INTEGER_CST contains of a type and a double-int.
I see double-int (and the proposed wide-int) as a common building-block
used for kind of arbitrary precision (as far as the target needs) integer
constants on both tree and RTL.  And having a common worker implementation
requires it to operate on something different than tree types or RTL mode
plus signedness.

 To put it another way: every wide_int operation needs to know
 the precision of its arguments and the precision of its result.

That's not true.  Every tree or RTL operation does, not every
wide_int operation.  double_int's are just twos-complement numbers
of precision 2 * HOST_BITS_PER_WIDE_INT.  wide_int's should
be just twos-complement numbers of precision len *
WHATEVER_HOST_COMPONENT_TYPE_IS_SUITABLE_FOR_A_FAST_IMPLEMENTATION.
Operations on double_int and wide_int are bare-metal,
in nearly all of the times you should use routines that do
proper sign-/zero-extending to a possibly smaller precision.  When
using bare-metal you are supposed to do that yourself.

Which is why I was suggesting

Re: patch to fix constant math

2012-10-05 Thread Richard Guenther
On Fri, Oct 5, 2012 at 11:26 AM, Richard Guenther
richard.guent...@gmail.com wrote:
 Look at RTL users of the double-int routines and provide wrappers
 that take RTXen as inputs.  Enforce that all CONSTs have a mode.

Which would, btw, allow to merge CONST_INT, CONST_DOUBLE
and CONST_WIDE by making the storage size variable and it's
length specified by the mode of the RTX (I never liked the distinction
of CONST_INT and CONST_DOUBLE, and you are right, the
CONST_DOUBLE paths are seldomly excercised).

Richard.


Re: patch to fix constant math

2012-10-05 Thread Richard Guenther
On Fri, Oct 5, 2012 at 11:55 AM, Richard Sandiford
rdsandif...@googlemail.com wrote:
 Richard Guenther richard.guent...@gmail.com writes:
 As far as the wide_ints recording a mode or precision goes: we're in
 the lucky position of having tried both options.  Trees record the
 type (and thus precision) of all compile-time integer constants.
 RTL doesn't.  And the RTL way is a right pain.  It leads to interfaces
 in which rtx arguments often have to be accompanied by a mode.  And it
 leads to clunky differences like those between simplify_unary_operation
 (which takes two mode arguments) and simplify_binary_operation
 (which with our current set of binary operations requires only one).

 But the issue here is not that double_int doesn't record a mode
 or precision (or a sign).  The issue is that CONST_DOUBLE and CONST_INT
 don't!  The _tree_ INTEGER_CST contains of a type and a double-int.
 I see double-int (and the proposed wide-int) as a common building-block
 used for kind of arbitrary precision (as far as the target needs) integer
 constants on both tree and RTL.  And having a common worker implementation
 requires it to operate on something different than tree types or RTL mode
 plus signedness.

 To put it another way: every wide_int operation needs to know
 the precision of its arguments and the precision of its result.

 That's not true.  Every tree or RTL operation does, not every
 wide_int operation.  double_int's are just twos-complement numbers
 of precision 2 * HOST_BITS_PER_WIDE_INT.  wide_int's should
 be just twos-complement numbers of precision len *
 WHATEVER_HOST_COMPONENT_TYPE_IS_SUITABLE_FOR_A_FAST_IMPLEMENTATION.
 Operations on double_int and wide_int are bare-metal,
 in nearly all of the times you should use routines that do
 proper sign-/zero-extending to a possibly smaller precision.  When
 using bare-metal you are supposed to do that yourself.

 Which is why I was suggesting to add double_sint, double_uint,
 double_sintp (with precision), etc., wrappers and operations.

 Not storing the precision in the wide_int is putting the onus
 on the caller to keep track of the precision separately.

 But that's a matter of providing something high-level ontop of
 the bare-metal double-int/wide-int (something shared between
 RTL and trees).  Duplicating information in the bare-metal
 doesn't make sense (and no, INTEGER_CSTs on the tree level
 are _not_ short-lived, and how can a double-int make sense on
 the tree level when you say it doesn't make sense on the RTL level?)

 I think we're talking at cross purposes here.  To the extent that
 I'm not really sure where to begin. :-)  Just in case this is it:
 the idea is that wide_int is the type used to _process_ integers.
 It is not suppoed to be the type used to store integers in the IRs.
 The way trees store integers and the way that RTL stores integers
 are up to them.  For RTL the idea is that we continue to store integers
 that happen to fit into a HWI as a CONST_INT (regardless of the size of
 the CONST_INT's context-determined mode).  Wider integers are stored
 as a CONST_DOUBLE (for unconverted targets) or a CONST_WIDE_INT
 (for converted targets).  None of the three use the wide_int type;
 they use more compact representations instead.  And as Kenny says,
 using CONST_INT continues to be an important way of reducing the
 IR footprint.

 Whenever we want to do something interesting with the integers,
 we construct a wide_int for them and use the same processing code
 for all three rtx types.  This avoids having the separate single-HWI
 and double-HWI paths that we have now.  It also copes naturally with
 cases where we start out with single-HWI values but end up with wider
 ones.

 But the operations that we do on these wide_ints will all be to a mode
 or precision.  Shifts of QImode integers are different from shifts of
 HImode integers, etc.

 If you knew all that already (you probably did) and I've completely
 missed the point, please say so. :-)

 I'm not sure what you mean by bare metal.

The issue is that unlike RTL where we construct double-ints from
CONST_INT/CONST_DOUBLE right now, tree has the double-int
_embedded_.  That's why I say that the thing we embed in
a CONST_WIDE_INT or a tree INTEGER_CST needs to be
bare metal, and that's what I would call wide-int.

I think you have two things mixed in this patch which might add to
the confusion (heh).  One is, making the thing you work on in RTL
(which used to be double-ints before this patch) be wide-ints
which carry additional information taken from the IL RTX piece
at construction time.  That's all nice and good.  The other thing
is adding a CONST_WIDE_INT to support wider integer constants
than the weird host-dependent limitation we have right now.
Mixing those things together probably made you be able to make
Kenny work on it ... heh ;)

So to me wide-ints provide the higher-level abstraction ontop of
double-ints (which would remain the storage entity).  Such
higher-level

Re: [PATCH, gdc] - Merging gdc (GNU D Compiler) into gcc

2012-10-05 Thread Richard Guenther
On Fri, Oct 5, 2012 at 12:07 PM, Iain Buclaw ibuc...@ubuntu.com wrote:
 On 5 October 2012 01:06, Joseph S. Myers jos...@codesourcery.com wrote:
 On Thu, 4 Oct 2012, Iain Buclaw wrote:

 The only patches to gcc proper are documentation-related and adding
 the D frontend / libphobos to configure and make files.  I would have
 thought that these would typically only be included with the actual
 front-end?

 Looking back at my previous review comments, I suggested that you might
 need to split up c-common.[ch] so that certain parts of attribute handling
 could be shared with D, because duplicate code copied from elsewhere in
 GCC was not an appropriate implementation approach.  Have you then
 eliminated the duplicate code in some other way that does not involve
 splitting up those files so code can be shared?


 Ah, no; thanks for reminding me of this.

 The code duplicated from c-common.[ch] are the handlers for C
 __attributes__,  however gdc doesn't use all of them because some just
 don't have a fitting place eg: gnu_inline, artificial.

 Would the best approach be to move all handle_* functions and any
 helper functions into a new source file that can be shared between
 frontends, and define two new frontend hooks,
 LANG_HOOK_ATTRIBUTE_TABLE and LANG_HOOK_FORMAT_ATTRIBUTE_TABLE ?

Btw, the LTO frontend also has most of the stuff duplicated ... (see
lto/lto-lang.c).
Not sure why ...

Richard.


 Regards
 --
 Iain Buclaw

 *(p  e ? p++ : p) = (c  0x0f) + '0';


Re: Use conditional casting with symtab_node

2012-10-05 Thread Richard Guenther
On Fri, Oct 5, 2012 at 12:50 PM, Nathan Froyd froy...@mozilla.com wrote:
 - Original Message -
 I see all these patches with mixed feeling - it puts breaks on all
 developers
 because they need to learn the new interface which does not bring any
 immediate benefit.  So I think _your_ development time would be
 better
 spent by fixing open bugs or by tackling some of the existing
 scalability
 issues in GCC (rather than quoting funny '0.001% faster with 99%
 confidence'
 stuff).

 This tone is unnecessary.

Sorry, that wasn't intended.  I question these numbers because
unless you bootstrap say 100 times the noise in bootstrap speed
is way too high to make such claims.  Of course critical information
is missing:

The new code bootstraps .616% faster with a 99% confidence of being faster.

99% confidence on what basis?  What's your sample size?

Why does the patch need this kind of marketing?

 I, for one, think that it's excellent that Lawrence is writing these
 cleanup patches and measuring what impact they have on performance.
 Bonus points that they are making the compiler faster.  Speed of the
 compiler *is* a scalability issue, and it's one that GCC doesn't
 appear to have paid all that much attention to over the years.

I just don't believe the 0.5% numbers.

Richard.

 -Nathan


Re: patch to fix constant math

2012-10-05 Thread Richard Guenther
On Fri, Oct 5, 2012 at 1:24 PM, Richard Sandiford
rdsandif...@googlemail.com wrote:
 Richard Guenther richard.guent...@gmail.com writes:
 On Fri, Oct 5, 2012 at 11:55 AM, Richard Sandiford
 rdsandif...@googlemail.com wrote:
 Richard Guenther richard.guent...@gmail.com writes:
 As far as the wide_ints recording a mode or precision goes: we're in
 the lucky position of having tried both options.  Trees record the
 type (and thus precision) of all compile-time integer constants.
 RTL doesn't.  And the RTL way is a right pain.  It leads to interfaces
 in which rtx arguments often have to be accompanied by a mode.  And it
 leads to clunky differences like those between simplify_unary_operation
 (which takes two mode arguments) and simplify_binary_operation
 (which with our current set of binary operations requires only one).

 But the issue here is not that double_int doesn't record a mode
 or precision (or a sign).  The issue is that CONST_DOUBLE and CONST_INT
 don't!  The _tree_ INTEGER_CST contains of a type and a double-int.
 I see double-int (and the proposed wide-int) as a common building-block
 used for kind of arbitrary precision (as far as the target needs) integer
 constants on both tree and RTL.  And having a common worker implementation
 requires it to operate on something different than tree types or RTL mode
 plus signedness.

 To put it another way: every wide_int operation needs to know
 the precision of its arguments and the precision of its result.

 That's not true.  Every tree or RTL operation does, not every
 wide_int operation.  double_int's are just twos-complement numbers
 of precision 2 * HOST_BITS_PER_WIDE_INT.  wide_int's should
 be just twos-complement numbers of precision len *
 WHATEVER_HOST_COMPONENT_TYPE_IS_SUITABLE_FOR_A_FAST_IMPLEMENTATION.
 Operations on double_int and wide_int are bare-metal,
 in nearly all of the times you should use routines that do
 proper sign-/zero-extending to a possibly smaller precision.  When
 using bare-metal you are supposed to do that yourself.

 Which is why I was suggesting to add double_sint, double_uint,
 double_sintp (with precision), etc., wrappers and operations.

 Not storing the precision in the wide_int is putting the onus
 on the caller to keep track of the precision separately.

 But that's a matter of providing something high-level ontop of
 the bare-metal double-int/wide-int (something shared between
 RTL and trees).  Duplicating information in the bare-metal
 doesn't make sense (and no, INTEGER_CSTs on the tree level
 are _not_ short-lived, and how can a double-int make sense on
 the tree level when you say it doesn't make sense on the RTL level?)

 I think we're talking at cross purposes here.  To the extent that
 I'm not really sure where to begin. :-)  Just in case this is it:
 the idea is that wide_int is the type used to _process_ integers.
 It is not suppoed to be the type used to store integers in the IRs.
 The way trees store integers and the way that RTL stores integers
 are up to them.  For RTL the idea is that we continue to store integers
 that happen to fit into a HWI as a CONST_INT (regardless of the size of
 the CONST_INT's context-determined mode).  Wider integers are stored
 as a CONST_DOUBLE (for unconverted targets) or a CONST_WIDE_INT
 (for converted targets).  None of the three use the wide_int type;
 they use more compact representations instead.  And as Kenny says,
 using CONST_INT continues to be an important way of reducing the
 IR footprint.

 Whenever we want to do something interesting with the integers,
 we construct a wide_int for them and use the same processing code
 for all three rtx types.  This avoids having the separate single-HWI
 and double-HWI paths that we have now.  It also copes naturally with
 cases where we start out with single-HWI values but end up with wider
 ones.

 But the operations that we do on these wide_ints will all be to a mode
 or precision.  Shifts of QImode integers are different from shifts of
 HImode integers, etc.

 If you knew all that already (you probably did) and I've completely
 missed the point, please say so. :-)

 I'm not sure what you mean by bare metal.

 The issue is that unlike RTL where we construct double-ints from
 CONST_INT/CONST_DOUBLE right now, tree has the double-int
 _embedded_.  That's why I say that the thing we embed in
 a CONST_WIDE_INT or a tree INTEGER_CST needs to be
 bare metal, and that's what I would call wide-int.

 OK, but that's deliberately not what Kenny's patch calls wide int.
 The whole idea is that the bare metal thing we embed in a
 CONST_WIDE_INT or tree isn't (and doesn't need to be) the same
 as the type that we use to operate on integers.  That bare-metal
 thing doesn't even have to have a fixed width.  (CONST_WIDE_INT
 doesn't have a fixed width, it's only as big as the integer
 needs it to be.)

Ok, let's rephrase it this way then: the bare metal thing used
for the storage should ideally be the same in the tree IL and the RTL

Re: [PATCH] Improve debug info for partial inlining (PR debug/54519, take 2)

2012-10-05 Thread Richard Guenther
On Wed, 3 Oct 2012, Jakub Jelinek wrote:

 On Tue, Sep 11, 2012 at 03:59:56PM +0200, Jakub Jelinek wrote:
  As discussed in the PR, right now we do a very bad job for debug info
  of partially inlined functions (both when they are kept only partially
  inlined, or when partial inlining is performed, but doesn't seem to be
  useful and foo.part.N is inlined back, either into the original function, or
  into a function into which foo has been inlined first).
  
  This patch improves that by doing something similar to what ipa-prop.c does,
  in particular for arguments that aren't actually passed to foo.part.N
  we add debug args and corresponding debug bind and debug source bind stmts
  to provide better debug info (if foo.part.N isn't inlined, then
  DW_OP_GNU_parameter_ref is going to be used together with corresponding call
  site arguments).
  
  Bootstrapped/regtested on x86_64-linux and i686-linux, some of the tests
  still fail with some option combinations, am going to file a DF VTA PR for
  that momentarily.  Ok for trunk?
 
 Now that Alex fixed the DF VTA issue, most of the non-LTO FAILs are gone,
 the remaining one on x86_64 looks like a GDB bug, and there is one -Os
 failure on pr54519-3.c which could be either alias oracle or var-tracking
 bug/missing feature - basically there is a non-addressable parameter
 in stack slot, and vt_canon_true_dep - canon_true_dependence thinks an
 argument push insn might alias with it, because it doesn't have a MEM_EXPR
 and ao_ref_from_mem fails.
 
 Posting an updated patch, because last night I found that even when the
 patch should have fixed
 static inline void foo (int x, int y) { asm volatile (nop); }
 static inline void bar (int z) { foo (z, 0); foo (z, 1); }
 int main ()
 {
   bar (0);
   bar (1);
   return 0;
 }
 it didn't, there was a confusion when DECL_ORIGIN should be used and when
 not.  This version of the patch fixes that, on this testcase (adjusted
 added as pr54519-6.c) p x, p y and up; p z now work well.
 
 Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?
 
 2012-10-03  Jakub Jelinek  ja...@redhat.com
 
   PR debug/54519
   * ipa-split.c (split_function): Add debug args and
   debug source and normal stmts for args_to_skip which are
   gimple regs.
   * tree-inline.c (copy_debug_stmt): When inlining, adjust
   source debug bind stmts to debug binds of corresponding
   DEBUG_EXPR_DECL.
 
   * gcc.dg/guality/pr54519-1.c: New test.
   * gcc.dg/guality/pr54519-2.c: New test.
   * gcc.dg/guality/pr54519-3.c: New test.
   * gcc.dg/guality/pr54519-4.c: New test.
   * gcc.dg/guality/pr54519-5.c: New test.
   * gcc.dg/guality/pr54519-6.c: New test.
 
 --- gcc/ipa-split.c.jj2012-09-25 14:26:52.612821323 +0200
 +++ gcc/ipa-split.c   2012-10-02 17:41:31.030357922 +0200
 @@ -1059,6 +1059,7 @@ split_function (struct split_point *spli
gimple last_stmt = NULL;
unsigned int i;
tree arg, ddef;
 +  VEC(tree, gc) **debug_args = NULL;
  
if (dump_file)
  {
 @@ -1232,6 +1233,65 @@ split_function (struct split_point *spli
gimple_set_block (call, DECL_INITIAL (current_function_decl));
VEC_free (tree, heap, args_to_pass);
  

The following could use a comment on what you are doing ...

 +  if (args_to_skip)
 +for (parm = DECL_ARGUMENTS (current_function_decl), num = 0;
 +  parm; parm = DECL_CHAIN (parm), num++)
 +  if (bitmap_bit_p (args_to_skip, num)
 +is_gimple_reg (parm))
 + {
 +   tree ddecl;
 +   gimple def_temp;
 +
 +   arg = get_or_create_ssa_default_def (cfun, parm);
 +   if (!MAY_HAVE_DEBUG_STMTS)
 + continue;
 +   if (debug_args == NULL)
 + debug_args = decl_debug_args_insert (node-symbol.decl);
 +   ddecl = make_node (DEBUG_EXPR_DECL);
 +   DECL_ARTIFICIAL (ddecl) = 1;
 +   TREE_TYPE (ddecl) = TREE_TYPE (parm);
 +   DECL_MODE (ddecl) = DECL_MODE (parm);
 +   VEC_safe_push (tree, gc, *debug_args, DECL_ORIGIN (parm));
 +   VEC_safe_push (tree, gc, *debug_args, ddecl);
 +   def_temp = gimple_build_debug_bind (ddecl, unshare_expr (arg),
 +   call);
 +   gsi_insert_after (gsi, def_temp, GSI_NEW_STMT);
 + }
 +  if (debug_args != NULL)
 +{
 +  unsigned int i;
 +  tree var, vexpr;
 +  gimple_stmt_iterator cgsi;
 +  gimple def_temp;
 +
 +  push_cfun (DECL_STRUCT_FUNCTION (node-symbol.decl));

What do you need the push/pop_cfun for?  I see ENTRY_BLOCK_PTR
(easily fixable), but else?

 +  var = BLOCK_VARS (DECL_INITIAL (node-symbol.decl));
 +  i = VEC_length (tree, *debug_args);
 +  cgsi = gsi_after_labels (single_succ (ENTRY_BLOCK_PTR));
 +  do
 + {
 +   i -= 2;
 +   while (var != NULL_TREE
 +   DECL_ABSTRACT_ORIGIN (var)
 + != VEC_index (tree, *debug_args, i))
 + var = TREE_CHAIN (var);
 +   if (var == NULL_TREE)
 + break;
 +

Re: patch to fix constant math

2012-10-05 Thread Richard Guenther
On Fri, Oct 5, 2012 at 2:26 PM, Richard Sandiford
rdsandif...@googlemail.com wrote:
 Richard Guenther richard.guent...@gmail.com writes:
 On Fri, Oct 5, 2012 at 1:24 PM, Richard Sandiford
 rdsandif...@googlemail.com wrote:
 Richard Guenther richard.guent...@gmail.com writes:
 The issue is that unlike RTL where we construct double-ints from
 CONST_INT/CONST_DOUBLE right now, tree has the double-int
 _embedded_.  That's why I say that the thing we embed in
 a CONST_WIDE_INT or a tree INTEGER_CST needs to be
 bare metal, and that's what I would call wide-int.

 OK, but that's deliberately not what Kenny's patch calls wide int.
 The whole idea is that the bare metal thing we embed in a
 CONST_WIDE_INT or tree isn't (and doesn't need to be) the same
 as the type that we use to operate on integers.  That bare-metal
 thing doesn't even have to have a fixed width.  (CONST_WIDE_INT
 doesn't have a fixed width, it's only as big as the integer
 needs it to be.)

 Ok, let's rephrase it this way then: the bare metal thing used
 for the storage should ideally be the same in the tree IL and the RTL
 IL _and_ the higher-level abstract wide-int.

 Hmm, OK, that's a straight disagreement then.

 So to me wide-ints provide the higher-level abstraction ontop of
 double-ints (which would remain the storage entity).

 Such higher-level abstraction is very useful, also for double-ints and
 also on the tree level.  There is no requirement to provide bigger
 double-int (or wide-int) for this.  Let's call this abstraction
 wide-int (as opposed to my suggested more piecemail double_sint,
 double_uint).  You can perfectly model it ontop of the existing
 double_int storage.

 As of providing larger double-ints - there is not much code left
 (ok, quite an overstatement ;)) that relies on the implementation
 detail of double-int containing exactly two HOST_WIDE_INTs.
 The exact purpose of double-ints was to _hide_ this (previously
 we only had routines like mul_double_with_sign which take
 two HOST_WIDE_INT components).  Most code dealing with
 the implementation detail is code interfacing with middle-end
 routines that take a HOST_WIDE_INT argument (thus the
 double_int_fits_hwi_p predicates) - even wide_int has to support
 this kind of interfacing.

 So, after introducing wide_int that just wraps double_int and
 changing all user code (hopefully all, I guess mostly all), we'd
 tackle the issue that the size of double_int's is host dependent.
 A simple solution is to make it's size dependent on a target macro
 (yes, macro ...), so on a 32bit HWI host targeting a 64bit 'HWI' target
 you'd simply have four HWIs in the 'double-int' storage (and
 arithmetic needs to be generalized to support this).

 I just don't see why this is a good thing.  The constraints
 are different when storing integers and when operating on them.
 When operating on them, we want something that is easily constructed
 on the stack, so we can create temporary structures very cheaply,
 and can easily pass and return them.  We happen to know at GCC's
 compile time how big the biggest integer will ever be, so it makes
 sense for wide_int to be that wide.

 I'm not arguing against this.  I'm just saying that the internal
 representation will depend on the host - not the number of total
 bits, but the number of pieces.

 Sure, but Kenny already has a macro to specify how many bits we need
 (MAX_BITSIZE_MODE_ANY_INT).  We can certainly wrap:

   HOST_WIDE_INT val[MAX_BITSIZE_MODE_ANY_INT / HOST_BITS_PER_WIDE_INT];

 in a typedef if that's what you prefer.

I'd prefer it to be initially double_int, and later fixed to double_int
with a member like the above.  Possibly renamed as well.

 But when storing integers we want something compact.  If your
 machine supports 256-bit integers, but the code you're compiling
 makes heavy use of 128-bit integers, why would you want to waste
 128 of 256 bits on every stored integer?  Which is why even
 CONST_WIDE_INT doesn't have a fixed width.

 You seem to be arguing that the type used to store integers in the IR
 has to be same as the one that we use when performing compile-time
 arithmetic, but I think it's better to have an abstraction between
 the two.

 Well, you don't want to pay the cost dividing 256bit numbers all
 the time when most of your numbers are only 128bit.  So we don't
 really want to perform compile-time arithmetic on the biggest
 possible precision either.

 wide_int doesn't perform them in the biggest possible precision.
 It performs them in the precision of the result.  It just happens
 to have enough storage to cope with all possible precisions (because
 the actual precision of the result is usually only known at GCC's runtime).

 So if you think a pair of HWIs continues to be a useful way of
 storing integers at the tree level, then we can easily continue
 to use a pair of HWIs there.

 How do you store larger ints there then?

 You'd need another tree code for wider integers.  I'm not saying that's
 a good idea

Re: patch to fix constant math

2012-10-05 Thread Richard Guenther
On Fri, Oct 5, 2012 at 2:39 PM, Richard Guenther
richard.guent...@gmail.com wrote:

 Ok, I see where you are going.  Let me look at the patch again.

* The introduction and use of CONST_SCALAR_INT_P could be split out
  (obvious and good)

* DEF_RTL_EXPR(CONST_WIDE_INT, const_wide_int, , RTX_CONST_OBJ)
  defining that new RTX is orthogonal to providing the wide_int abstraction
  for operating on CONST_INT and CONST_DOUBLE, right?

@@ -144,6 +144,7 @@ static bool
 prefer_and_bit_test (enum machine_mode mode, int bitnum)
 {
   bool speed_p;
+  wide_int mask = wide_int::set_bit_in_zero (bitnum, mode);

set_bit_in_zero ... why use this instead of
wide_int::zero (mode).set_bit (bitnum) that would match the old
double_int_zero.set_bit interface and be more composed of primitives?

   if (and_test == 0)
 {
@@ -164,8 +165,7 @@ prefer_and_bit_test (enum machine_mode m
 }

   /* Fill in the integers.  */
-  XEXP (and_test, 1)
-= immed_double_int_const (double_int_zero.set_bit (bitnum), mode);
+  XEXP (and_test, 1) = immed_wide_int_const (mask);

I suppose immed_wide_int_const may create a CONST_INT, a CONST_DOUBLE
or a CONST_WIDE_INT?

+#if TARGET_SUPPORTS_WIDE_INT
+/* Returns 1 if OP is an operand that is a CONST_INT or CONST_WIDE_INT.  */
+int
+const_scalar_int_operand (rtx op, enum machine_mode mode)
+{

why is it necessary to condition anything on TARGET_SUPPORTS_WIDE_INT?
It seems testing/compile coverage of this code will be bad ...

Can't you simply map CONST_WIDE_INT to CONST_DOUBLE for targets
not supporting CONST_WIDE_INT (what does it mean to support
CONST_WIDE_INT?  Does a target that supports CONST_WIDE_INT no
longer use CONST_DOUBLEs for integers?)

+  if (!CONST_WIDE_INT_P (op))
+return 0;

hm, that doesn't match the comment (CONST_INT is supposed to return 1 as well).
The comment doesn't really tell what the function does it seems,

+  int prec = GET_MODE_PRECISION (mode);
+  int bitsize = GET_MODE_BITSIZE (mode);
+
+  if (CONST_WIDE_INT_NUNITS (op) * HOST_BITS_PER_WIDE_INT  bitsize)
+   return 0;

it's mode argument is not documented.  And this doesn't even try to see if
the constant would fit the mode anyway (like if it's zero).  Not sure what
the function is for.

+   {
+ /* Multiword partial int.  */
+ HOST_WIDE_INT x
+   = CONST_WIDE_INT_ELT (op, CONST_WIDE_INT_NUNITS (op) - 1);
+ return (wide_int::sext (x, prec  (HOST_BITS_PER_WIDE_INT - 1))
+ == x);

err - so now we have wide_int with a mode but we pass in another mode
and see if they have the same value?  Same value as-in signed value?
Which probably is why you need to rule out different size constants above?
Because you don't know whether to sign or zero extend?


+/* Returns 1 if OP is an operand that is a CONST_WIDE_INT.  */
+int
+const_wide_int_operand (rtx op, enum machine_mode mode)
+{

similar comments, common code should be factored out at least.
Instead of conditioning a whole set of new function on supports-wide-int
please add cases to the existing functions to avoid diverging in pieces
that both kind of targets support.

@@ -2599,7 +2678,7 @@ constrain_operands (int strict)
break;

  case 'n':
-   if (CONST_INT_P (op) || CONST_DOUBLE_AS_INT_P (op))
+   if (CONST_SCALAR_INT_P (op))
  win = 1;

factoring out this changes would really make the patch less noisy.

skipping to rtl.h bits

+struct GTY((variable_size)) hwivec_def {
+  int num_elem;/* number of elements */
+  HOST_WIDE_INT elem[1];
+};

num_elem seems redundant and computable from GET_MODE_SIZE.
Or do you want to optimize encoding like for CONST_INT (and unlike
CONST_DOUBLE)?  I doubt the above packs nicely into rtx_def?
A general issue of it though - we waste 32bits on 64bit hosts in
rtx_def between the bits and the union.  Perfect place for num_elem
(if you really need it) and avoids another 4 byte hole.  Similar issue
exists in rtvec_def.

back to where I was

@@ -645,6 +673,10 @@ iterative_hash_rtx (const_rtx x, hashval
   return iterative_hash_object (i, hash);
 case CONST_INT:
   return iterative_hash_object (INTVAL (x), hash);
+case CONST_WIDE_INT:
+  for (i = 0; i  CONST_WIDE_INT_NUNITS (x); i++)
+   hash = iterative_hash_object (CONST_WIDE_INT_ELT (x, i), hash);
+  return hash;

I see CONST_DOUBLEs value is not hashed.  Why hash wide-ints value?

Seeing

+#define HWI_GET_NUM_ELEM(HWIVEC)   ((HWIVEC)-num_elem)
+#define HWI_PUT_NUM_ELEM(HWIVEC, NUM)  ((HWIVEC)-num_elem = (NUM))

skipping to

+#if TARGET_SUPPORTS_WIDE_INT
+  {
+rtx value = const_wide_int_alloc (len);
+unsigned int i;
+
+/* It is so tempting to just put the mode in here.  Must control
+   myself ... */
+PUT_MODE (value, VOIDmode);
+HWI_PUT_NUM_ELEM (CONST_WIDE_INT_VEC (value), len);

why is NUM_ELEM not set in const_wide_int_alloc, and only there?

+extern rtx rtx_alloc_stat_v

Re: patch to fix constant math

2012-10-05 Thread Richard Guenther
On Fri, Oct 5, 2012 at 3:18 PM, Richard Sandiford
rdsandif...@googlemail.com wrote:
 Richard Sandiford rdsandif...@googlemail.com writes:
 How is CONST_WIDE_INT variable size?

 It's just the usual trailing variable-length array thing.

 Good.  Do you get rid of CONST_DOUBLE (for integers) at the same time?

 Yeah.  I initially thought it might be OK to keep them and have
 CONST_INT, integer CONST_DOUBLEs and CONST_WIDE_INT alongside
 each other.  (The way the patch is structured means that the
 choice of whether to keep integer CONST_DOUBLEs can be changed
 very easily.)  But Kenny convinced me it was a bad idea.

 Sorry to follow up on myself, but to clarify: I was talking about
 converted targets here.  (As in, I originally thought even converted
 targets could continue to use integer CONST_DOUBLEs.)

 Unconverted targets continue to use CONST_DOUBLE.

Why is it that not all targets are converted?  What's the difficulty
with that?
I really do not like partially transitioning there.

Richard.

 Richard


Re: [PATCH] Improve debug info for partial inlining (PR debug/54519, take 2)

2012-10-05 Thread Richard Guenther
On Fri, Oct 5, 2012 at 2:49 PM, Jakub Jelinek ja...@redhat.com wrote:
 On Fri, Oct 05, 2012 at 02:20:13PM +0200, Richard Guenther wrote:
 The following could use a comment on what you are doing ...

 Will add something.

  +  if (args_to_skip)
  +for (parm = DECL_ARGUMENTS (current_function_decl), num = 0;
  +parm; parm = DECL_CHAIN (parm), num++)
  +  if (bitmap_bit_p (args_to_skip, num)
  +  is_gimple_reg (parm))
  +   {
  + tree ddecl;
  + gimple def_temp;
  +
  + arg = get_or_create_ssa_default_def (cfun, parm);
  + if (!MAY_HAVE_DEBUG_STMTS)
  +   continue;
  + if (debug_args == NULL)
  +   debug_args = decl_debug_args_insert (node-symbol.decl);
  + ddecl = make_node (DEBUG_EXPR_DECL);
  + DECL_ARTIFICIAL (ddecl) = 1;
  + TREE_TYPE (ddecl) = TREE_TYPE (parm);
  + DECL_MODE (ddecl) = DECL_MODE (parm);
  + VEC_safe_push (tree, gc, *debug_args, DECL_ORIGIN (parm));
  + VEC_safe_push (tree, gc, *debug_args, ddecl);
  + def_temp = gimple_build_debug_bind (ddecl, unshare_expr (arg),
  + call);
  + gsi_insert_after (gsi, def_temp, GSI_NEW_STMT);
  +   }
  +  if (debug_args != NULL)
  +{
  +  unsigned int i;
  +  tree var, vexpr;
  +  gimple_stmt_iterator cgsi;
  +  gimple def_temp;
  +
  +  push_cfun (DECL_STRUCT_FUNCTION (node-symbol.decl));

 What do you need the push/pop_cfun for?  I see ENTRY_BLOCK_PTR
 (easily fixable), but else?

 I believe that gsi_insert_before in another function
 isn't going to work well.
 E.g. update_modified_stmt starts with
   if (!ssa_operands_active (cfun))
 return;

Hmm, indeed.

 Or is it ok to use gsi_insert_before_without_update and expect that
 when compilation resumes for the modified callee, it will
 update all modified stmts?  There is not much that needs
 to be updated on the stmts, source bind has not setters nor uses
 of any SSA_NAMEs or virtual defs/sets, the normal bind has a DEBUG_EXPR_DECL
 on the RHS and so is not a use or def of anything as well.

I don't think we want to rely on that ... so just keep the push/pop_cfun.

Thanks,
Richard.

 If push_cfun/pop_cfun is not needed, guess the two loops could be merged,
 the reason for two of them was just that I didn't want to push_cfun/pop_cfun
 for each optimized away parameter.

  +  var = BLOCK_VARS (DECL_INITIAL (node-symbol.decl));
  +  i = VEC_length (tree, *debug_args);
  +  cgsi = gsi_after_labels (single_succ (ENTRY_BLOCK_PTR));
  +  do
  +   {
  + i -= 2;
  + while (var != NULL_TREE
  + DECL_ABSTRACT_ORIGIN (var)
  +   != VEC_index (tree, *debug_args, i))
  +   var = TREE_CHAIN (var);
  + if (var == NULL_TREE)
  +   break;
  + vexpr = make_node (DEBUG_EXPR_DECL);
  + parm = VEC_index (tree, *debug_args, i);
  + DECL_ARTIFICIAL (vexpr) = 1;
  + TREE_TYPE (vexpr) = TREE_TYPE (parm);
  + DECL_MODE (vexpr) = DECL_MODE (parm);
  + def_temp = gimple_build_debug_source_bind (vexpr, parm,
  +NULL);
  + gsi_insert_before (cgsi, def_temp, GSI_SAME_STMT);
  + def_temp = gimple_build_debug_bind (var, vexpr, NULL);
  + gsi_insert_before (cgsi, def_temp, GSI_SAME_STMT);
  +   }
  +  while (i);
  +  pop_cfun ();
  +}
  +
  +   t = VEC_index (tree, *debug_args, i + 1);
  +   gimple_debug_source_bind_set_value (stmt, t);
  +   stmt-gsbase.subcode = GIMPLE_DEBUG_BIND;

 That looks fishy ... shouldn't it be at least the other way around?

  stmt-gsbase.subcode = GIMPLE_DEBUG_BIND;
  gimple_debug_bind_set_value (stmt, t);

 It surely can.  I was considering whether to create a new wrapper
 in gimple.h for the subcode change of a debug stmt, but if there are
 no further places needing this (don't see them right now), that might be
 overkill.

 Otherwise this looks ok.

 Thanks.

 Jakub


Re: patch to fix constant math

2012-10-05 Thread Richard Guenther
On Fri, Oct 5, 2012 at 4:14 PM, Richard Sandiford
rdsandif...@googlemail.com wrote:
 Richard Guenther richard.guent...@gmail.com writes:
 On Fri, Oct 5, 2012 at 3:18 PM, Richard Sandiford
 rdsandif...@googlemail.com wrote:
 Richard Sandiford rdsandif...@googlemail.com writes:
 How is CONST_WIDE_INT variable size?

 It's just the usual trailing variable-length array thing.

 Good.  Do you get rid of CONST_DOUBLE (for integers) at the same time?

 Yeah.  I initially thought it might be OK to keep them and have
 CONST_INT, integer CONST_DOUBLEs and CONST_WIDE_INT alongside
 each other.  (The way the patch is structured means that the
 choice of whether to keep integer CONST_DOUBLEs can be changed
 very easily.)  But Kenny convinced me it was a bad idea.

 Sorry to follow up on myself, but to clarify: I was talking about
 converted targets here.  (As in, I originally thought even converted
 targets could continue to use integer CONST_DOUBLEs.)

 Unconverted targets continue to use CONST_DOUBLE.

 Why is it that not all targets are converted?  What's the difficulty
 with that?
 I really do not like partially transitioning there.

 The problem is that CONST_DOUBLE as it exists today has two meanings:
 a floating-point meaning and an integer meaning.  Ports that handle
 CONST_DOUBLEs are aware of this and expect the two things to have
 the same rtx code.  Whereas in a converted port, the two things
 have different rtx codes, and the integers have a different
 representation from the current low/high pair.

 So to take the first hit in config/alpha/alpha.c,
 namely alpha_rtx_costs:

 case CONST_DOUBLE:
   if (x == CONST0_RTX (mode))
 *total = 0;
   else if ((outer_code == PLUS  add_operand (x, VOIDmode))
|| (outer_code == AND  and_operand (x, VOIDmode)))
 *total = 0;
   else if (add_operand (x, VOIDmode) || and_operand (x, VOIDmode))
 *total = 2;
   else
 *total = COSTS_N_INSNS (2);
   return true;

 What could the patch do to make this work without modification?
 The middle two cases are for integers, but the first and last
 presumably apply to both integers and floats.

I didn't say it does not require changes, just that the transition should be
finished.  Some ports have little CONST_DOUBLE uses (which is what
I am grepping for), and if the max-wide-int-size matches that of the
current CONST_DOUBLE there is little chance for code generation
differences (in fact there should be none, correct?).

In the current patch no target is converted (maybe that's just going to
be 5/n?), I'd like to see at least one commonly tested target converted
and if we don't convert all targets another commonly tested target
to stay unconverted (just check gcc-testresults for which pair of targets
that would work).  Definitely at the end of stage3 we should have zero
unconverted targets (but I doubt converting them is a huge task - I have
converted the VECTOR_CST representation as well and we've had
the location + BLOCK merge and other changes affecting all targets).

Richard.

 Richard


Re: Variable DECL_SIZE

2012-10-04 Thread Richard Guenther
On Wed, Oct 3, 2012 at 7:05 PM, Ilya Enkovich enkovich@gmail.com wrote:
 Hi,

 I fall into ssa verification failure when try to pass field's
 DECL_SIZE as an operand for CALL_EXPR. The fail occurs if field's size
 is not a constant. In such case DECL_SIZE holds a VAR_DECL and I need
 to find it's proper SSA_NAME. I thought it may be the default one but
 gimple_default_def returns NULL. What is the right way to obtain
 correct ssa_name in such case?

There is no way.  You have to know that you need it's size at the point of
gimplification.  Later there is no way to recover it.

Richard.


 Thanks,
 Ilya


Re: Functions that are CSEable but not pure

2012-10-04 Thread Richard Guenther
On Wed, Oct 3, 2012 at 9:00 PM, Jason Merrill ja...@redhat.com wrote:
 In C++ there is a common idiom called initialize on first use.  In its
 simplest form it looks like

 int lazy_i()
 {
   static int i = init;
   return i;
 }

 If the initialization is expensive or order-sensitive, this is a useful
 alternative to initialization on load
 (http://www.parashift.com/c++-faq/static-init-order-on-first-use.html).

 An interesting property of such functions is that they only have
 side-effects the first time they are called, so subsequent calls can be
 optimized away to just use the return value of the first call.

If the result is not needed, are we allowed to remove a call to this
function?  Basically, is the function allowed to detect if it was
called before?
If we can remove it, thus the function may not exploit knowledge on wheter
it was called before then we can also perform partial redundancy elimination in

 if ()
   lazy_i ();
 lazy_i ();

so I really hope we can also DCE these calls - otherwise their property would
be really weird, as CSE also involves a DCE piece:

 i = lazy_i ();
j = lazy_j ();

-
  i = lazy_i ();
  lazy_j ();
  j = i;

-

 i = lazy_i ();
 j = i;

But if we are allowed to DCE a lazy_i call with unused result then it really is
pure.  Basically with this property we consider the static variable
it writes to
as local, as it cannot be inspected in any other way than by calling the
function or inspecting its return value.

So - what's wrong with using pure?  That it doesn't feel correct?
The I suppose
we should simply re-word its definition ;)

 Currently there is no way to express this in GCC so that the optimizers know
 that multiple calls are redundant.  Marking the function as pure causes
 calls to be CSEd as desired, but also asserts that the first call has no
 side-effects, which is not the case.

 My implementation of dynamic initialization of TLS variables as mandated by
 the C++11 and OpenMP standards uses this idiom implicitly, so I'd really
 like to be able to optimize away multiple calls.

 Right now we have the following ECF flags (among others):

 const = no read, no write, no side-effects, cseable
 pure = no write, no side-effects, cseable
 novops = no read, no write

note that I consider novops a red-herring.  novops is basically
'const' volatile, but we don't have  a volatile flag on calls (we
have it, but that's for the store to the result).  I'd rather get rid
of novops ...

 looping_const_or_pure = wait, actually there are side-effects

 Seems like the difference between novops and looping_const_or_pure is
 whether calls are subject to CSE.  Is that right?

No, novops and looping_const_or_pure tell you whether they are subject
to DCE, all const/pure functions are subject to CSE, novops functions
are not (as said above, novops is really a volatile const).

 Rather than checking these flag bundles, it seems to me that we ought to
 break them up into

 no reads
 no writes
 no side-effects
 cseable side-effects

no side-effects and cseable side-effects doesn't sound like a good
distinction to me - side-effects are not 'cse'-able or they would not
be side-effects.  But I suppose it all boils down to my question above,
where the PRE case is really the most interesting (can the formerly
second call detect it used the initialized path or the non-initialized
path?)

 So const  would be noread|nowrite|noside
pure   would be nowrite|noside
 const|looping would be noread|nowrite|cseside
 pure|looping  would be nowrite|cseside
novops would be noread|nowrite

 and the behavior I want would be just cseside.

 Does this make sense?  Anyone more familiar with this area of the code want
 to implement it for me?  :}

I'd say it's already implemented, just poorly documented.  Use pure!

Richard.

 Jason


Re: reverse bitfield patch

2012-10-04 Thread Richard Guenther
On Wed, Oct 3, 2012 at 12:07 AM, DJ Delorie d...@redhat.com wrote:

 Here's my current patch for the bitfield reversal feature I've been
 working on for a while, with an RX-specific pragma to apply it
 globally.  Could someone please review this?  It would be nice
 to get it in before stage1 closes again...

ChangeLog missing, new functions need a toplevel comment documenting
function, argument and return value as per coding conventions.

Richard.


 Index: gcc/doc/extend.texi
 ===
 --- gcc/doc/extend.texi (revision 192009)
 +++ gcc/doc/extend.texi (working copy)
 @@ -5427,12 +5427,74 @@ Note that the type visibility is applied
  associated with the class (vtable, typeinfo node, etc.).  In
  particular, if a class is thrown as an exception in one shared object
  and caught in another, the class must have default visibility.
  Otherwise the two shared objects will be unable to use the same
  typeinfo node and exception handling will break.

 +@item bit_order
 +Normally, GCC allocates bitfields from either the least significant or
 +most significant bit in the underlying type, such that bitfields
 +happen to be allocated from lowest address to highest address.
 +Specifically, big-endian targets allocate the MSB first, where
 +little-endian targets allocate the LSB first.  The @code{bit_order}
 +attribute overrides this default, allowing you to force allocation to
 +be MSB-first, LSB-first, or the opposite of whatever gcc defaults to.  The
 +@code{bit_order} attribute takes an optional argument:
 +
 +@table @code
 +
 +@item native
 +This is the default, and also the mode when no argument is given.  GCC
 +allocates LSB-first on little endian targets, and MSB-first on big
 +endian targets.
 +
 +@item swapped
 +Bitfield allocation is the opposite of @code{native}.
 +
 +@item lsb
 +Bits are allocated LSB-first.
 +
 +@item msb
 +Bits are allocated MSB-first.
 +
 +@end table
 +
 +A short example demonstrates bitfield allocation:
 +
 +@example
 +struct __attribute__((bit_order(msb))) @{
 +  char a:3;
 +  char b:3;
 +@} foo = @{ 3, 5 @};
 +@end example
 +
 +With LSB-first allocation, @code{foo.a} will be in the 3 least
 +significant bits (mask 0x07) and @code{foo.b} will be in the next 3
 +bits (mask 0x38).  With MSB-first allocation, @code{foo.a} will be in
 +the 3 most significant bits (mask 0xE0) and @code{foo.b} will be in
 +the next 3 bits (mask 0x1C).
 +
 +Note that it is entirely up to the programmer to define bitfields that
 +make sense when swapped.  Consider:
 +
 +@example
 +struct __attribute__((bit_order(msb))) @{
 +  short a:7;
 +  char b:6;
 +@} foo = @{ 3, 5 @};
 +@end example
 +
 +On some targets, or if the struct is @code{packed}, GCC may only use
 +one byte of storage for A despite it being a @code{short} type.
 +Swapping the bit order of A would cause it to overlap B.  Worse, the
 +bitfield for B may span bytes, so ``swapping'' would no longer be
 +defined as there is no ``char'' to swap within.  To avoid such
 +problems, the programmer should either fully-define each underlying
 +type, or ensure that their target's ABI allocates enough space for
 +each underlying type regardless of how much of it is used.
 +
  @end table

  To specify multiple attributes, separate them by commas within the
  double parentheses: for example, @samp{__attribute__ ((aligned (16),
  packed))}.

 Index: gcc/c-family/c-common.c
 ===
 --- gcc/c-family/c-common.c (revision 192009)
 +++ gcc/c-family/c-common.c (working copy)
 @@ -310,12 +310,13 @@ struct visibility_flags visibility_optio

  static tree c_fully_fold_internal (tree expr, bool, bool *, bool *);
  static tree check_case_value (tree);
  static bool check_case_bounds (tree, tree, tree *, tree *);

  static tree handle_packed_attribute (tree *, tree, tree, int, bool *);
 +static tree handle_bitorder_attribute (tree *, tree, tree, int, bool *);
  static tree handle_nocommon_attribute (tree *, tree, tree, int, bool *);
  static tree handle_common_attribute (tree *, tree, tree, int, bool *);
  static tree handle_noreturn_attribute (tree *, tree, tree, int, bool *);
  static tree handle_hot_attribute (tree *, tree, tree, int, bool *);
  static tree handle_cold_attribute (tree *, tree, tree, int, bool *);
  static tree handle_noinline_attribute (tree *, tree, tree, int, bool *);
 @@ -601,12 +602,14 @@ const unsigned int num_c_common_reswords
  const struct attribute_spec c_common_attribute_table[] =
  {
/* { name, min_len, max_len, decl_req, type_req, fn_type_req, handler,
 affects_type_identity } */
{ packed, 0, 0, false, false, false,
   handle_packed_attribute , false},
 +  { bit_order,  0, 1, false, true, false,
 + handle_bitorder_attribute , false},
{ nocommon,   0, 0, true,  false, false,
   

Re: inlined memcpy/memset degradation in gcc 4.6 or later

2012-10-04 Thread Richard Guenther
On Tue, Oct 2, 2012 at 4:19 PM, Walter Lee w...@tilera.com wrote:

 On TILE-Gx, I'm observing a degradation in inlined memcpy/memset in
 gcc 4.6 and later versus gcc 4.4.  Though I find the problem on
 TILE-Gx, I think this is a problem for any architectures with
 SLOW_UNALIGNED_ACCESS set to 1.

 Consider the following program:

 struct foo {
   int x;
 };

 void copy(struct foo* f0, struct foo* f1)
 {
   memcpy (f0, f1, sizeof(struct foo));
 }

 In gcc 4.4, I get the desired inline memcpy:

 copy:
 ld4sr1, r1
 st4 r0, r1
 jrp lr

 In gcc 4.7, however, I get inlined byte-by-byte copies:

 copy:
 ld1u_add r10, r1, 1
 st1_add  r0, r10, 1
 ld1u_add r10, r1, 1
 st1_add  r0, r10, 1
 ld1u_add r10, r1, 1
 st1_add  r0, r10, 1
 ld1u r10, r1
 st1  r0, r10
 jrp  lr

 The inlining of memcpy is done in expand_builtin_memcpy in builtins.c.
 Tracing through that, I see that the alignment of src_align and
 dest_align, which is computed by get_pointer_alignment, has degraded:
 in gcc 4.4 they are 32 bits, but in gcc 4.7 they are 8 bits.  This
 causes the loads generated by the inlined memcopy to be per-byte
 instead of per-4-byte.

 Looking further, gcc 4.7 uses the align field in struct
 ptr_info_def to compute the alignment.  This field appears to be
 initialized in get_ptr_info in tree-ssanames.c but it is always
 initialized to 1 byte and does not appear to change.  gcc 4.4 computes
 its alignment information differently.

 I get the same byte-copies with gcc 4.8 and gcc 4.6.

 I see a couple related open PRs: 50417, 53535, but no suggested fixes
 for them yet.  Can anyone advise on how this can be fixed?  Should I
 file a new bug, or add this info to one of the existing PRs?

There is no way to fix it.  memcpy does not require aligned arguments
and the merely presence of a typed pointer contains zero information
of alignment for the middle-end.  If you want to excercise C's notion
of alignemnt requirements then do not use memcpy but

 *f0 = *f1;

which works equally well.  ISTR Tom had various patches improving
alignment info for pointers but we cannot improve this degenerate
case.  Btw, the new beavior even fixed bugs.

Richard.

 Thanks,

 Walter



Re: Functions that are CSEable but not pure

2012-10-04 Thread Richard Guenther
On Thu, Oct 4, 2012 at 2:56 PM, Jason Merrill ja...@redhat.com wrote:
 On 10/04/2012 08:45 AM, Jakub Jelinek wrote:

 On Thu, Oct 04, 2012 at 10:42:59AM +0200, Richard Guenther wrote:

 On Wed, Oct 3, 2012 at 9:00 PM, Jason Merrill ja...@redhat.com wrote:
 If the result is not needed, are we allowed to remove a call to this
 function?


 No.  Unless you know the same function has been already called.


 Yes, we are.  If we optimize away a use of the variable, we can also remove
 the call to the function.  We can also hoist the call so its side-effects
 occur before other side-effects, as the standard only requires that the
 initialization of the variable occur some time between thread creation and
 the first use of the variable.


 So - what's wrong with using pure?  That it doesn't feel correct?


 No, it can modify memory.


 Right, that's the issue.  The back end assumes that calls to pure functions
 won't clobber memory; calls to these functions can clobber arbitrary memory,
 but only on the first call.

Ugh.  Especially with the above (you can DCE those calls) makes this
severly mis-specified ... and any implementation error-prone (look what
mess our losely defined 'malloc' attribute opened ...).

I thought of a testcase like

 int *p = get_me ();
 .. = *p;
 int *q = get_me ();
 .. = *q;

and get_me allocating/initalizing and returning a singleton.

But you tell me it's more complicated and get_me () needs to
be a barrier for any load/store (as it may modify arbitrary memory,
but only on the first call).

I think that may modify arbitrary memory isn't going to fly and
my answer would be, better don't try to optimize anything here,
at least not in generic terms.  How would you handle this in
the alias oracle?  How would you then make CSE recognize
two functions return the same value and are CSEable?

 I think the plan was for these functions not to return any value,


 No, I'm talking about the wrapper function which returns a reference to the
 variable (as in my example).

Can you come up with a short but complete testcase illustrating the issue
better (preferably C, so I don't need to read-in magic points where construction
happens)?

Thanks,
Richard.

 Jason



Re: Functions that are CSEable but not pure

2012-10-04 Thread Richard Guenther
On Thu, Oct 4, 2012 at 5:22 PM, Jason Merrill ja...@redhat.com wrote:
 On 10/04/2012 09:07 AM, Richard Guenther wrote:

 Ugh.  Especially with the above (you can DCE those calls) makes this
 severly mis-specified ... and any implementation error-prone (look what
 mess our losely defined 'malloc' attribute opened ...).

 I thought of a testcase like

   int *p = get_me ();
   .. = *p;
   int *q = get_me ();
   .. = *q;

 and get_me allocating/initalizing and returning a singleton.


 Right.


 But you tell me it's more complicated and get_me () needs to
 be a barrier for any load/store (as it may modify arbitrary memory,
 but only on the first call).


 Yes, because the initialization is user-written code.


 I think that may modify arbitrary memory isn't going to fly and
 my answer would be, better don't try to optimize anything here,
 at least not in generic terms.  How would you handle this in
 the alias oracle?  How would you then make CSE recognize
 two functions return the same value and are CSEable?


 For aliasing purposes, the call is like a call to a normal function. For CSE
 purposes, we want to recognize identical calls and combine them.  I don't
 know the GCC bits well enough to be any more specific.


 Can you come up with a short but complete testcase illustrating the issue
 better (preferably C, so I don't need to read-in magic points where
 construction
 happens)?


 int init_count;
 int data;

 void init()
 {
   static int initialized;
   if (!initialized)
 {
   data = ++init_count;
   initialized = 1;
 }
 }

 inline int *get_me() __attribute ((pure));
 inline int *get_me()
 {
   init ();
   return data;
 }

 int sink;

 int main()
 {
   sink = init_count;
   int *p = get_me();
   if (init_count != 1)
 __builtin_abort();
   int *q = get_me();
   if (init_count != 1)
 __builtin_abort();
   return *p + *q;
 }

 On this testcase, gcc -O2 doesn't reload init_count after the call to get_me
 because it thinks that the call can't have modified init_count. I want the
 compiler to know that it is safe to discard the redundant assignment, but
 not make assumptions about memory.

But isn't it a fact that it _cannot_ modify init_count?  If the second call
is CSEable then it cannot have side-effects that are observable at
the call site.  Is the following an example you would consider to fall
under your CSEing?

int init_count;
int data;
int initialized;

void init()
{
  if (!initialized)
{
  data = ++init_count;
  initialized = 1;
}
}

inline int *get_me() __attribute ((pure));
inline int *get_me()
{
  init ();
  return data;
}

int sink;

int main()
{
  sink = init_count;
  int *p = get_me();
  if (init_count != 1)
__builtin_abort();
  initialized = 0;
  int *q = get_me();
  if (init_count != 2)
__builtin_abort();
  return *p + *q;
}

?  If so, then why can we assume get_me returns the same pointer even here?

Richard.


 On 10/04/2012 08:59 AM, Jakub Jelinek wrote: On Thu, Oct 04, 2012 at
 08:56:03AM -0400, Jason Merrill wrote:
 Sure, but I thought you want to inline the wrapper function as soon as
 possible.  Or do you want to keep it as some kind of builtin that gets
 expanded during expansion to the test and call?

 Ah, I see your point; if get_me is inlined we end up with two calls to init,
 so it would be good to mark init with the same hypothetical attribute.

 Jason



Re: Functions that are CSEable but not pure

2012-10-04 Thread Richard Guenther
On Thu, Oct 4, 2012 at 7:08 PM, Richard Guenther
richard.guent...@gmail.com wrote:
 On Thu, Oct 4, 2012 at 5:22 PM, Jason Merrill ja...@redhat.com wrote:
 On 10/04/2012 09:07 AM, Richard Guenther wrote:

 Ugh.  Especially with the above (you can DCE those calls) makes this
 severly mis-specified ... and any implementation error-prone (look what
 mess our losely defined 'malloc' attribute opened ...).

 I thought of a testcase like

   int *p = get_me ();
   .. = *p;
   int *q = get_me ();
   .. = *q;

 and get_me allocating/initalizing and returning a singleton.


 Right.


 But you tell me it's more complicated and get_me () needs to
 be a barrier for any load/store (as it may modify arbitrary memory,
 but only on the first call).


 Yes, because the initialization is user-written code.


 I think that may modify arbitrary memory isn't going to fly and
 my answer would be, better don't try to optimize anything here,
 at least not in generic terms.  How would you handle this in
 the alias oracle?  How would you then make CSE recognize
 two functions return the same value and are CSEable?


 For aliasing purposes, the call is like a call to a normal function. For CSE
 purposes, we want to recognize identical calls and combine them.  I don't
 know the GCC bits well enough to be any more specific.


 Can you come up with a short but complete testcase illustrating the issue
 better (preferably C, so I don't need to read-in magic points where
 construction
 happens)?


 int init_count;
 int data;

 void init()
 {
   static int initialized;
   if (!initialized)
 {
   data = ++init_count;
   initialized = 1;
 }
 }

 inline int *get_me() __attribute ((pure));
 inline int *get_me()
 {
   init ();
   return data;
 }

 int sink;

 int main()
 {
   sink = init_count;
   int *p = get_me();
   if (init_count != 1)
 __builtin_abort();
   int *q = get_me();
   if (init_count != 1)
 __builtin_abort();
   return *p + *q;
 }

 On this testcase, gcc -O2 doesn't reload init_count after the call to get_me
 because it thinks that the call can't have modified init_count. I want the
 compiler to know that it is safe to discard the redundant assignment, but
 not make assumptions about memory.

 But isn't it a fact that it _cannot_ modify init_count?  If the second call
 is CSEable then it cannot have side-effects that are observable at
 the call site.  Is the following an example you would consider to fall
 under your CSEing?

 int init_count;
 int data;
 int initialized;

 void init()
 {
   if (!initialized)
 {
   data = ++init_count;
   initialized = 1;
 }
 }

 inline int *get_me() __attribute ((pure));
 inline int *get_me()
 {
   init ();
   return data;
 }

 int sink;

 int main()
 {
   sink = init_count;
   int *p = get_me();
   if (init_count != 1)
 __builtin_abort();
   initialized = 0;
   int *q = get_me();
   if (init_count != 2)
 __builtin_abort();
   return *p + *q;
 }

 ?  If so, then why can we assume get_me returns the same pointer even here?

Or similar:

int main()
{
  sink = init_count;
  int *p = get_me();
  if (init_count != 1)
__builtin_abort();
}

is this required to not abort?  p is unused and with 'pure' you'd DCE the call.

That is, I am confused about the distinction you seem to make between
the static variable 'initialized' and the global 'init_count'.  You seem to
imply that the attribute would mean that we can CSE side-effects to
global memory but that those side-effects may be value-changing!

In practice for any CSE implementation this hypotetical cse-side-effects
attribute would only allow us to CSE if there are no intermediate
side-effects between two such calls, and as soon as the second get_me
call would be CSEd we'd also CSE the init_count value (which you
didn't want to CSE?!)

Still confused ;)

Richard.

 Richard.


 On 10/04/2012 08:59 AM, Jakub Jelinek wrote: On Thu, Oct 04, 2012 at
 08:56:03AM -0400, Jason Merrill wrote:
 Sure, but I thought you want to inline the wrapper function as soon as
 possible.  Or do you want to keep it as some kind of builtin that gets
 expanded during expansion to the test and call?

 Ah, I see your point; if get_me is inlined we end up with two calls to init,
 so it would be good to mark init with the same hypothetical attribute.

 Jason



Re: Functions that are CSEable but not pure

2012-10-04 Thread Richard Guenther
On Thu, Oct 4, 2012 at 7:15 PM, Jakub Jelinek ja...@redhat.com wrote:
 On Thu, Oct 04, 2012 at 07:08:02PM +0200, Richard Guenther wrote:
 But isn't it a fact that it _cannot_ modify init_count?  If the second call
 is CSEable then it cannot have side-effects that are observable at
 the call site.  Is the following an example you would consider to fall
 under your CSEing?

 int init_count;
 int data;
 int initialized;

 void init()
 {
   if (!initialized)
 {
   data = ++init_count;
   initialized = 1;
 }
 }

 inline int *get_me() __attribute ((pure));
 inline int *get_me()
 {
   init ();
   return data;
 }

 int sink;

 int main()
 {
   sink = init_count;
   int *p = get_me();
   if (init_count != 1)
 __builtin_abort();
   initialized = 0;
   int *q = get_me();
   if (init_count != 2)
 __builtin_abort();
   return *p + *q;
 }

 The above isn't a singleton function, as you are clearing initialized,
 therefore it doesn't have the properties the thread_local var get_address
 wrapper has.  The singleton function really is
 void singleton (void)
 {
   static __thread bool initialized;
   if (!initialized) {
 initialized = true;
 call_some_function_that_may_modify_memory ();
   }
 }
 and has side effects just first time in a thread.

So I suppose the testcase that would be valid but break with using
pure would be instead

int init_count;
int data;

void init()
{
  static int initialized;
  if (!initialized)
{
  data = ++init_count;
  initialized = 1;
}
}

inline int *get_me() __attribute ((pure));
inline int *get_me()
{
  init ();
  return data;
}

int main()
{
  int x = init_count;
  int *p = get_me();
  if (init_count == x)
__builtin_abort();
  int *q = get_me();
  if (init_count == x)
__builtin_abort();
}

here when get_me is pure we CSE init_count over the _first_ call of get_me.
Technically we only might CSE it over the second call of get_me.  Thus there
is flow-sensitive information coming out of nowhere when we try to skip
the second get_me () call in looking up the init_count load after it (coming
out of nowhere for the alias-oracle query which only includes 'init_count'
and the stmt 'int *q = get_me ()').

Thus my answer is, for the alias-oracle the hypothetical 'singleton_init'
attribute provides no useful information.  Thus CSE and/or DCE of
'singleton_init'
calls has to use special code (which I can think of being added to DCE, harder
to existing SCCVN, maybe easier to DOM) which will usually not fit into
the algorithm (so you might be able to teach FRE elimination of how to
elminiate the 2nd call to get_me () but you will need a 2nd FRE pass to
then figure out that init_count can be CSEd as well - which makes this
all very undesirable).  Inlining will also make this harder (you lose
the attribute and fall into the lucas situation ...).

At some point this singleton-ness should be auto-detected (there is
inlined code in SPEC 2k lucas that would benefit from constant propagation,
and SPEC 2k6 libquantum that uses a related concept - change / restore
of a global in a function to the net effect that the function should not be
considered clobbering it).

Richard.

 Jakub


Re: RFC: Using DECL_NO_LIMIT_STACK as a backend specific flag

2012-10-04 Thread Richard Guenther
On Wed, Oct 3, 2012 at 5:20 PM, nick clifton ni...@redhat.com wrote:
 Hi Ian,


 Can't you just keep a list of the decls for which you have issued the
 warning?


 Yes - that would work too.  In fact I agree that this would be cleaner
 solution in my particular case.  I'll create a new patch...



 How many decls do you expect to be on that list?


 Not very many.  Maybe two or three at most.

 But I am interested to know if there is a way for targets to add their own
 information to function decls (and decls in general).  If not for this
 particular case, then for problems to come in the future.

To the decl not.  But I suppose you are only looking at defined functions,
so it should suffice to amend struct function.  It's reasonable to allow
the target to associate extra info with struct function and we already
have a way to do that via the init_machine_status target hook.

Richard.

 Cheers
   Nick




Re: RFA: add lock_length attribute to break branch-shortening cycles

2012-10-04 Thread Richard Guenther
On Wed, Oct 3, 2012 at 8:22 PM, Joern Rennecke
joern.renne...@embecosm.com wrote:
 The ARCompact architecture has some pipelining features that result in
 the vanilla branch shortening not always converging.

 Moreover, there are some short, complex branch instructions with very short
 offsets; replacing them with a multi-insn sequence when the offset doesn't
 reach makes the code significantly longer.  Thus, when starting branch
 shortening with pessimistic assumptions, the short branches are often not
 used because of the pessimistic branch length causing the offsets going out
 of range.
 This problem can be avoided when starting with a low estimate and working
 upwards.  However, that makes the incidence of infinite branch shortening
 cycles higher, and also makes it impossible to just break out after some
 iteration count.

 To address these issues, I've made the generator programs recognize the
 optional lock_length attribute.

 To quote from the documentation added for this feature:

  If you define the `lock_length' attribute, branch shortening will work
 the other way round: it starts out assuming minimum instruction lengths
 and iterates from there.  In addition, the value of the `lock_length'
 attribute does not decrease across iterations, and the value computed
 for the `length' attribute will be no smaller than that of the
 `lock_length' attribute.

I miss a few things in this description:
- what is the value of lock_length supposed to be?  From the lock
  prefix it sounds like it is something unchanging, maybe even constant,
  thus a maximum?
- the length attribute still needs to be specified when lock_length is?
  how do they relate?  Is lock_length always smaller / bigger than length?
- what happens if you have patterns with lock_length and patterns without?
- what patterns does lock_length apply to?

In general optimistically attacking this kind of problem should be always
better - did you try simply switching this for all targets?  It shouldn't be
slower and the only thing you need to guarantee is that during iteration
you never make insn-lenghts smaller again.

Richard.

 bootstrapped and regression tested on i686-pc-linux-gnu

 2012-10-03  Joern Rennecke  joern.renne...@embecosm.com

 * final.c (get_attr_length_1): Use direct recursion rather than
 calling get_attr_length.
 (get_attr_lock_length): New function.
 (INSN_VARIABLE_LENGTH_P): Define.
 (shorten_branches): Take HAVE_ATTR_lock_length into account.
 Don't overwrite non-delay slot insn lengths with the lengths of
 delay slot insns with same uid.
 * genattrtab.c (lock_length_str): New variable.
 (make_length_attrs): New parameter base.
 (main): Initialize lock_length_str.
 Generate lock_lengths attributes.
 * genattr.c (gen_attr): Emit declarations for lock_length attribute
 related functions.
 * doc/md.texi (node Insn Lengths): Document lock_length attribute.

 Index: doc/md.texi
 ===
 --- doc/md.texi (revision 192036)
 +++ doc/md.texi (working copy)
 @@ -8004,6 +8004,20 @@ (define_insn jump
(const_int 6)))])
  @end smallexample

 +@cindex lock_length
 +Usually, branch shortening is done assuming the worst case (i.e. longest)
 +lengths, and then iterating (if optimizing) to smaller lengths till
 +no further changed occur.  This does not work so well for architectures
 +that have very small minimum offsets and considerable jumps in instruction
 +lengths.
 +
 +If you define the @code{lock_length} attribute, branch shortening will
 +work the other way round: it starts out assuming minimum instruction
 +lengths and iterates from there.  In addition, the value of the
 +@code{lock_length} attribute does not decrease across iterations, and
 +the value computed for the @code{length} attribute will be no smaller
 +than that of the @code{lock_length} attribute.
 +
  @end ifset
  @ifset INTERNALS
  @node Constant Attributes
 Index: final.c
 ===
 --- final.c (revision 192036)
 +++ final.c (working copy)
 @@ -312,6 +312,7 @@ dbr_sequence_length (void)
 `insn_current_length'.  */

  static int *insn_lengths;
 +static char *uid_lock_length;

  VEC(int,heap) *insn_addresses_;

 @@ -447,6 +448,20 @@ get_attr_length (rtx insn)
return get_attr_length_1 (insn, insn_default_length);
  }

 +#ifdef HAVE_ATTR_lock_length
 +int
 +get_attr_lock_length (rtx insn)
 +{
 +  if (uid_lock_length  insn_lengths_max_uid  INSN_UID (insn))
 +return uid_lock_length[INSN_UID (insn)];
 +  return get_attr_length_1 (insn, insn_min_lock_length);
 +}
 +#define INSN_VARIABLE_LENGTH_P(INSN) \
 +  (insn_variable_length_p (INSN) || insn_variable_lock_length_p (INSN))
 +#else
 +#define INSN_VARIABLE_LENGTH_P(INSN) (insn_variable_length_p (INSN))
 +#endif
 +
  /* Obtain the current length of an insn. 

Re: opts.c, gcc.c: Plug some memory leaks - and an out-of-bounds memory access

2012-10-04 Thread Richard Guenther
On Wed, Oct 3, 2012 at 11:01 PM, Tobias Burnus bur...@net-b.de wrote:
 Found using http://scan5.coverity.com/

 Build on x86-64-gnu-linux with C/C++/Fortran. I will now do an all-language
 build/regtest.
 OK when it passes?

 (Note to the save_string call: I reduced it by 2: The +1 in the call makes
 it long (out of bounds) and the +1 in temp_filename_length is not needed
 (but also doesn't harm) as tmp is null terminated and save_string adds
 another '\0' after copying len bytes.)

- prefix = concat (target_sysroot_suffix, prefix, NULL);
-  prefix = concat (sysroot_no_trailing_dir_separator, prefix, NULL);
+   {
+ char *tmp;
+ tmp = concat (target_sysroot_suffix, prefix, NULL);
+ prefix = concat (sysroot_no_trailing_dir_separator, tmp, NULL);
+ free (tmp);
+   }

prefix = concat (sysroot_no_trailing_dir_separator,
target_sysroot_suffix, prefix, NULL);

should be equivalent and easier to read, no?

+  else
+   prefix = concat (sysroot_no_trailing_dir_separator, prefix, NULL);
+

btw, we're not careing too much about memleaks in the driver ...

Otherwise the patch looks ok with the above change.

Thanks,
Richard.

 Tobias


Re: [Ping]FW: [PATCH] Cilk Plus merging to trunk (2 of n)

2012-10-04 Thread Richard Guenther
On Thu, Oct 4, 2012 at 2:22 AM, Iyer, Balaji V balaji.v.i...@intel.com wrote:
 Hi Joseph,
 Did you get a chance to look at this submission? I think I have fixed 
 all the changes you have mentioned. Is it OK for trunk?

 Thanks,

 Balaji V. Iyer.

-Original Message-
From: gcc-patches-ow...@gcc.gnu.org [mailto:gcc-patches-
ow...@gcc.gnu.org] On Behalf Of Iyer, Balaji V
Sent: Wednesday, September 26, 2012 7:16 PM
To: Joseph Myers
Cc: gcc-patches@gcc.gnu.org; al...@redhat.com; r...@redhat.com;
l...@redhat.com
Subject: RE: [PATCH] Cilk Plus merging to trunk (2 of n)

Hello Joseph,
   In my last patch, I forgot to add the change Richard Guenther wanted me
to make. He wanted me to move the ARRAY_NOTATION_REF node from tree.def
to c-family/c-common.def.  Here is a new one that has this change. I am sorry 
for
this.

Here are ChangeLog entries:

gcc/ChangeLog
2012-09-26  Balaji V. Iyer  balaji.v.i...@intel.com

* tree.h (array_notation_reduce_type): Added new enumerator.

This should be moved to c-tree.h then, and ...

* Makefile.in (OBJS): Added array-notation-common.o.
* doc/passes.texi (Cilk Plus Transformation): Documented array
notation and overall transformations for Cilk Plus.
* doc/invoke.texi (C Dialect Options): Documented -fcilkplus flag.
* doc/generic.texi (Storage References): Documented
ARRAY_NOTATION_REF
tree addition.
* tree-pretty-pretty.c (dump_generic_node): Added ARRAY_NOTATION_REF
case.

... this to c-pretty-print.c and

* array-notation-common.c: New file.

... this to the c-common/ directory.

Basically this should be a completely frontend-only patch.

Richard.

gcc/c-family/ChangeLog
2012-09-26  Balaji V. Iyer  balaji.v.i...@intel.com

* c-common.h (build_array_notation_expr): New function declaration.
(ARRAY_NOTATION_ARRAY): Added new #define.
(ARRAY_NOTATION_CHECK): Likewise.
(ARRAY_NOTATION_START): Likewise.
(ARRAY_NOTATION_LENGTH): Likewise.
(ARRAY_NOTATION_STRIDE): Likewise.
(ARRAY_NOTATION_TYPE): Likewise.
* c-common.def: Added new tree ARRAY_NOTATION_REF.
* c-common.c (c_define_builtins): Added a call to initialize array
notation builtin functions.
(c_common_init_ts): Set ARRAY_NOTATION_REF as typed.
* c.opt (-fcilkplus): Define new command line switch.

gcc/c/ChangeLog
2012-09-26  Balaji V. Iyer  balaji.v.i...@intel.com

* c-typeck.c (convert_arguments): Added a check if tree contains
array notation expressions before throwing errors or doing anything.
* Make-lang.in (C_AND_OBJC_OBJS): Added c-array-notation.o.
* c-parser.c (c_parser_compound_statement): Check if array notation 
 code
is used in tree, if so, then transform them into appropriate C code.
(c_parser_expr_no_commas): Check if array notation is used in LHS or
RHS, if so, then build array notation expression instead of regular
modify.
(c_parser_postfix_expression_after_primary): Added a check for 
 colon(s)
after square braces, if so then handle it like an array notation.  
 Also,
break up array notations in unary op if found.
(c_parser_array_notation): New function.
* c-array-notation.c: New file.

gcc/testsuite/ChangeLog
2012-09-26  Balaji V. Iyer  balaji.v.i...@intel.com

* gcc.dg/cilk-plus/array_notation/execute/execute.exp: New script.
* gcc.dg/cilk-plus/array_notation/compile/compile.exp: Likewise.
* gcc.dg/cilk-plus/array_notation/errors/errors.exp: Likewise.
* gcc.dg/cilk-plus/array_notation/execute/sec_implicit_ex.c: New test.
* gcc.dg/cilk-plus/array_notation/execute/if_test.c: Likewise.
* gcc.dg/cilk-plus/array_notation/execute/gather_scatter.c: Likewise.
* gcc.dg/cilk-plus/array_notation/execute/builtin_func_double2.c:
Likewise.
* gcc.dg/cilk-plus/array_notation/execute/builtin_func_double.c:
Likewise.
* gcc.dg/cilk-plus/array_notation/execute/builtin_fn_custom.c: 
 Likewise.
* gcc.dg/cilk-plus/array_notation/execute/builtin_fn_mutating.c:
Likewise.
* gcc.dg/cilk-plus/array_notation/execute/array_test_ND.c: Likewise.
* gcc.dg/cilk-plus/array_notation/execute/array_test2.c: Likewise.
* gcc.dg/cilk-plus/array_notation/execute/array_test1.c: Likewise.
* gcc.dg/cilk-plus/array_notation/compile/sec_implicit_ex.c: Likewise.
* gcc.dg/cilk-plus/array_notation/compile/gather_scatter.c: Likewise.
* gcc.dg/cilk-plus/array_notation/compile/builtin_func_double2.c:
Likewise.
* gcc.dg/cilk-plus/array_notation/compile/array_test_ND.c: Likewise.
* gcc.dg/cilk-plus/array_notation/compile/if_test.c: Likewise.
* gcc.dg/cilk-plus/array_notation/compile/builtin_func_double.c:
Likewise.
* gcc.dg/cilk-plus/array_notation/compile/array_test1.c

Re: Fix -fdump-ada-spec

2012-10-04 Thread Richard Guenther
On Thu, Oct 4, 2012 at 10:26 AM, Arnaud Charlet char...@adacore.com wrote:
 After changes by Sharad (Add option for dumping to stderr (issue6190057)),
 -fdump-ada-spec is broken, and is now a no-op.

 Admittedly, this is because -fdump-ada-spec is handled differently from
 other -fdump-* switches, so this patch fixes support for -fdump-ada-spec
 by using an approach similar to -fdump-go-spec, and use regular switches
 via c/c.opt. I've removed the handling of TDF_RAW, which was a debugging
 option, and never really used, so can be simply deleted.

 Change is mostly trivial/mechanical.

 Tested on x86_64-pc-linux-gnu, OK for trunk?

Much cleaner indeed.

Ok,
Thanks,
Richard.

 gcc/

 2012-10-04  Arnaud Charlet  char...@adacore.com

 * dumpfile.h, dumpfile.c: Remove TDI_ada.

 c-family/

 2012-10-04  Arnaud Charlet  char...@adacore.com

 * c-ada-spec.c (print_ada_declaration): Remove handling of TDF_RAW.
 * c.opt (-fdump-ada-spec, -fdump-ada-spec-slim): Move switch 
 definition
 out of dumpfile.h.

 c/

 2012-10-04  Arnaud Charlet  char...@adacore.com

 * c-decl.c (c_write_global_declarations): Fix handling of
 -fdump-ada-spec*.

 cp/

 2012-10-04  Arnaud Charlet  char...@adacore.com

 * decl2.c (cp_write_global_declarations): Fix handling of
 -fdump-ada-spec*.

 Arno
 --
 Index: c-family/c.opt
 ===
 --- c-family/c.opt  (revision 192062)
 +++ c-family/c.opt  (working copy)
 @@ -799,6 +799,14 @@ fdollars-in-identifiers
  C ObjC C++ ObjC++
  Permit '$' as an identifier character

 +fdump-ada-spec
 +C ObjC C++ ObjC++ RejectNegative Var(flag_dump_ada_spec)
 +Write all declarations as Ada code transitively
 +
 +fdump-ada-spec-slim
 +C ObjC C++ ObjC++ RejectNegative Var(flag_dump_ada_spec_slim)
 +Write all declarations as Ada code for the given file only
 +
  felide-constructors
  C++ ObjC++ Var(flag_elide_constructors) Init(1)

 Index: c-family/c-ada-spec.c
 ===
 --- c-family/c-ada-spec.c   (revision 192062)
 +++ c-family/c-ada-spec.c   (working copy)
 @@ -2535,7 +2535,6 @@ print_ada_declaration (pretty_printer *b
int is_class = false;
tree name = TYPE_NAME (TREE_TYPE (t));
tree decl_name = DECL_NAME (t);
 -  bool dump_internal = get_dump_file_info (TDI_ada)-pflags  TDF_RAW;
tree orig = NULL_TREE;

if (cpp_check  cpp_check (t, IS_TEMPLATE))
 @@ -2705,8 +2704,7 @@ print_ada_declaration (pretty_printer *b
  }
else
  {
 -  if (!dump_internal
 -  TREE_CODE (t) == VAR_DECL
 +  if (TREE_CODE (t) == VAR_DECL
decl_name
*IDENTIFIER_POINTER (decl_name) == '_')
 return 0;
 @@ -2796,8 +2794,7 @@ print_ada_declaration (pretty_printer *b

/* If this function has an entry in the dispatch table, we cannot
  omit it.  */
 -  if (!dump_internal  !DECL_VINDEX (t)
 -  *IDENTIFIER_POINTER (decl_name) == '_')
 +  if (!DECL_VINDEX (t)  *IDENTIFIER_POINTER (decl_name) == '_')
 {
   if (IDENTIFIER_POINTER (decl_name)[1] == '_')
 return 0;
 Index: c/c-decl.c
 ===
 --- c/c-decl.c  (revision 192062)
 +++ c/c-decl.c  (working copy)
 @@ -10079,10 +10079,10 @@ c_write_global_declarations (void)
gcc_assert (!current_scope);

/* Handle -fdump-ada-spec[-slim]. */
 -  if (dump_initialized_p (TDI_ada))
 +  if (flag_dump_ada_spec || flag_dump_ada_spec_slim)
  {
/* Build a table of files to generate specs for */
 -  if (get_dump_file_info (TDI_ada)-pflags  TDF_SLIM)
 +  if (flag_dump_ada_spec_slim)
 collect_source_ref (main_input_filename);
else
 for_each_global_decl (collect_source_ref_cb);
 Index: cp/decl2.c
 ===
 --- cp/decl2.c  (revision 192062)
 +++ cp/decl2.c  (working copy)
 @@ -3698,9 +3698,9 @@ cp_write_global_declarations (void)
cgraph_process_same_body_aliases ();

/* Handle -fdump-ada-spec[-slim] */
 -  if (dump_initialized_p (TDI_ada))
 +  if (flag_dump_ada_spec || flag_dump_ada_spec_slim)
  {
 -  if (get_dump_file_info (TDI_ada)-pflags  TDF_SLIM)
 +  if (flag_dump_ada_spec_slim)
 collect_source_ref (main_input_filename);
else
 collect_source_refs (global_namespace);
 Index: dumpfile.c
 ===
 --- dumpfile.c  (revision 192062)
 +++ dumpfile.c  (working copy)
 @@ -57,8 +57,7 @@ static struct dump_file_info dump_files[
 0, 0, 0, 5},
{.vcg, tree-vcg, NULL, NULL, NULL, NULL, NULL, TDF_TREE,
 0, 0, 0, 6},
 -  {.ads, ada-spec, NULL, NULL, NULL, NULL, NULL, 0, 0, 0, 0, 7},
 -#define FIRST_AUTO_NUMBERED_DUMP 8
 +#define FIRST_AUTO_NUMBERED_DUMP 7

{NULL, tree-all, NULL, NULL, NULL, NULL, NULL, TDF_TREE,

Re: [patch][lra] Improve initial program point density in lra-lives.c (was: Re: RFC: LRA for x86/x86-64 [7/9])

2012-10-04 Thread Richard Guenther
On Thu, Oct 4, 2012 at 11:43 AM, Steven Bosscher stevenb@gmail.com wrote:
 On Wed, Oct 3, 2012 at 5:35 PM, Steven Bosscher stevenb@gmail.com wrote:
 The worst result is this:
 Compressing live ranges: from 726174 to 64496 - 8%, pre_count 40476128, 
 post_count 12483414

 But that's still a lot better than before the patch for the same function:
 Compressing live ranges: from 1742569 to 73069 - 4%, pre_count 40842330, 
 post_count 12479992

 Walking basic blocks with FOR_EACH_BB_REVERSE gives:

 Only FOR_EACH_BB_REVERSE:
 Compressing live ranges: from 1742579 to 429746 - 24% pre_count
 41106212, post_count 34376494
 Compressing live ranges: from 1742569 to 63000 - 3% pre_count
 40835340, post_count 11055747

 FOR_EACH_BB_REVERSE + need_curr_point_incr:
 Compressing live ranges: from 726184 to 416529 - 57% pre_count
 40743516, post_count 34376846
 Compressing live ranges: from 726174 to 61840 - 8% pre_count 40472806,
 post_count 11055747

 The combination of the two changes takes ~20s off the ~180s for LRA
 create live ranges.

Isn't _REVERSE vs. non-_RESERVE still kind-of random order?  Thus, doesn't
the above show there exists an optimal order for processing which we could use?
(I realize _REVERSE is a simple solution, but might there not exist a
pathological
case where _REVERSE is even worse than non-_REVERSE?)

Richard.

 Ciao!
 Steven


Re: patch to fix

2012-10-04 Thread Richard Guenther
On Wed, Oct 3, 2012 at 7:15 PM, Kenneth Zadeck zad...@naturalbridge.com wrote:
 The enclosed patch is the third of at least four patches that fix the
 problems associated with supporting integers on the target that are
 wider than two HOST_WIDE_INTs.

 While GCC claims to support OI mode, and we have two public ports that
 make minor use of this mode, in practice, compilation that uses OImode
 mode commonly gives the wrong result or ices.  We have a private port
 of GCC for an architecture that is further down the road to needing
 comprehensive OImode and we have discovered that this is unusable. We
 have decided to fix it in a general way that so that it is most
 beneficial to the GCC community.  It is our belief that we are just a
 little ahead of the X86 and the NEON and these patches will shortly be
 essential.

 The first two of these patches were primarily lexigraphical and have
 already been committed.They transformed the uses of CONST_DOUBLE
 so that it is easy to tell what the intended usage is.

 The underlying structures in the next two patches are very general:
 once they are added to the compiler, the compiler will be able to
 support targets with any size of integer from hosts of any size
 integer.

 The patch enclosed deals with the portable RTL parts of the compiler.
 The next patch, which is currently under construction deals with the
 tree level.  However, this patch can be put on the trunk as is, and it
 will eleviate many, but not all of the current limitations in the rtl
 parts of the compiler.

 Some of the patch is conditional, depending on a port defining the
 symbol 'TARGET_SUPPORTS_WIDE_INT' to be non zero.  Defining this
 symbol to be non zero is declaring that the port has been converted to
 use the new form or integer constants.  However, the patch is
 completely backwards compatible to allow ports that do not need this
 immediately to convert at their leasure.  The conversion process is
 not difficult, but it does require some knowledge of the port, so we
 are not volinteering to do this for all ports.

 OVERVIEW OF THE PATCH:

 The patch defines a new datatype, a 'wide_int' (defined in
 wide-int.[ch], and this datatype will be used to perform all of the
 integer constant math in the compiler.  Externally, wide-int is very
 similar to double-int except that it does not have the limitation that
 math must be done on exactly two HOST_WIDE_INTs.

 Internally, a wide_int is a structure that contains a fixed sized
 array of HOST_WIDE_INTs, a length field and a mode.  The size of the

That it has a mode sounds odd to me and makes it subtly different
from HOST_WIDE_INT and double-int.  Maybe the patch will tell
why this is so.

 array is determined at generation time by dividing the number of bits
 of the largest integer supported on the target by the number of bits
 in a HOST_WIDE_INT of the host.  Thus, with this format, any length of
 integer can be supported on any host.

 A new rtx type is created, the CONST_WIDE_INT, which contains a
 garbage collected array of HOST_WIDE_INTS that is large enough to hold
 the constant.  For the targets that define TARGET_SUPPORTS_WIDE_INT to
 be non zero, CONST_DOUBLES are only used to hold floating point
 values.  If the target leaves TARGET_SUPPORTS_WIDE_INT defined as 0,
 CONST_WIDE_INTs are not used and CONST_DOUBLEs are as they were
 before.

 CONST_INT does not change except that it is defined to hold all
 constants that fit in exactly one HOST_WIDE_INT.  Note that is slightly
 different than the current trunk.  Before this patch, the TImode
 constant '5' could either be in a CONST_INT or CONST_DOUBLE depending
 on which code path was used to create it.  This patch changes this so
 that if the constant fits in a CONST_INT then it is represented in a
 CONST_INT no matter how it is created.

 For the array inside a CONST_WIDE_INT, and internally in wide-int, we
 use a compressed form for integers that need more than one
 HOST_WIDE_INT.  Higher elements of the array are not needed if they
 are just a sign extension of the elements below them.  This does not
 imply that constants are signed or are sign extended, this is only a
 compression technique.

 While it might seem to be more esthetically pleasing to have not
 introduced the CONST_WIDE_INT and to have changed the representation
 of the CONST_INT to accomodate larger numbers, this would have both
 used more space and would be a time consuming change for the port
 maintainers.  We believe that most ports can be quickly converted with
 the current scheme because there is just not a lot of code in the back
 ends that cares about large constants.  Furthermore, the CONST_INT is
 very space efficient and even in a program that was heavy in large
 values, most constants would still fit in a CONST_INT.

 All of the parts of the rtl level that deal with CONST_DOUBLE as an
 now conditionally work with CONST_WIDE_INTs depending on the value
 of TARGET_SUPPORTS_WIDE_INT.  We believe that this 

Re: [Patch] Fix PR53397

2012-10-04 Thread Richard Guenther
On Tue, Oct 2, 2012 at 6:40 PM, Kumar, Venkataramanan
venkataramanan.ku...@amd.com wrote:
 Hi Richi,

 (Snip)
 + (!cst_and_fits_in_hwi (step))
 +{
 +  if( loop-inner != NULL)
 +{
 +  if (dump_file  (dump_flags  TDF_DETAILS))
 +{
 +  fprintf (dump_file, Reference %p:\n, (void *) ref);
 +  fprintf (dump_file, (base  );
 +  print_generic_expr (dump_file, base, TDF_SLIM);
 +  fprintf (dump_file, , step );
 +  print_generic_expr (dump_file, step, TDF_TREE);
 +  fprintf (dump_file, )\n);

 No need to repeat this - all references are dumped when we gather them.
 (Snip)

 The dumping happens at record_ref which is called after these statements to 
 record these references.

 When the step is invariant  we return from the function without recording the 
 references.

  so I thought of dumping the references here.

 Is there a cleaner way to dump the references at one place?

Yes, call dump_mem_ref then, instead of repeating parts of its body.

Richard.

 Regards,
 Venkat.



 -Original Message-
 From: Richard Guenther [mailto:rguent...@suse.de]
 Sent: Tuesday, October 02, 2012 5:42 PM
 To: Kumar, Venkataramanan
 Cc: gcc-patches@gcc.gnu.org
 Subject: Re: [Patch] Fix PR53397

 On Mon, 1 Oct 2012, venkataramanan.ku...@amd.com wrote:

 Hi,

 The below patch fixes the FFT/Scimark regression caused by useless
 prefetch generation.

 This fix tries to make prefetch less aggressive by prefetching arrays
 in the inner loop, when the step is invariant in the entire loop nest.

 GCC currently tries to prefetch invariant steps when they are in the
 inner loop. But does not check if the step is variant in outer loops.

 In the scimark FFT case, the trip count of the inner loop varies by a
 non constant step, which is invariant in the inner loop.
 But the step variable is varying in outer loop. This makes inner loop
 trip count small (at run time varies sometimes as small as 1
 iteration)

 Prefetching ahead x iteration when the inner loop trip count is
 smaller than x leads to useless prefetches.

 Flag used: -O3 -march=amdfam10

 Before
 **  **
 ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark **
 ** for details. (Results can be submitted to p...@nist.gov) **
 **  **
 Using   2.00 seconds min time per kenel.
 Composite Score:  550.50
 FFT Mflops:38.66(N=1024)
 SOR Mflops:   617.61(100 x 100)
 MonteCarlo: Mflops:   173.74
 Sparse matmult  Mflops:   675.63(N=1000, nz=5000)
 LU  Mflops:  1246.88(M=100, N=100)


 After
 **  **
 ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark **
 ** for details. (Results can be submitted to p...@nist.gov) **
 **  **
 Using   2.00 seconds min time per kenel.
 Composite Score:  639.20
 FFT Mflops:   479.19(N=1024)
 SOR Mflops:   617.61(100 x 100)
 MonteCarlo: Mflops:   173.18
 Sparse matmult  Mflops:   679.13(N=1000, nz=5000)
 LU  Mflops:  1246.88(M=100, N=100)

 GCC regression make check -k passes with x86_64-unknown-linux-gnu
 New tests that PASS:

 gcc.dg/pr53397-1.c scan-assembler prefetcht0 gcc.dg/pr53397-1.c
 scan-tree-dump aprefetch Issued prefetch
 gcc.dg/pr53397-1.c (test for excess errors) gcc.dg/pr53397-2.c
 scan-tree-dump aprefetch loop variant step
 gcc.dg/pr53397-2.c scan-tree-dump aprefetch Not prefetching
 gcc.dg/pr53397-2.c (test for excess errors)


 Checked CPU2006 and polyhedron on latest AMD processor, no regressions noted.

 Ok to commit in trunk?

 regards,
 Venkat

 gcc/ChangeLog
 +2012-10-01  Venkataramanan Kumar  venkataramanan.ku...@amd.com
 +
 +   * tree-ssa-loop-prefetch.c (gather_memory_references_ref):$
 +   Perform non constant step prefetching in inner loop, only $
 +   when it is invariant in the entire loop nest.  $
 +   * testsuite/gcc.dg/pr53397-1.c: New test case $
 +   Checks we are prefecthing for loop invariant steps$
 +   * testsuite/gcc.dg/pr53397-2.c: New test case$
 +   Checks we are not prefecthing for loop variant steps
 +


 Index: gcc/testsuite/gcc.dg/pr53397-1.c
 ===
 --- gcc/testsuite/gcc.dg/pr53397-1.c  (revision 0)
 +++ gcc/testsuite/gcc.dg/pr53397-1.c  (revision 0)
 @@ -0,0 +1,28 @@
 +/* Prefetching when the step is loop invariant.  */
 +
 +/* { dg-do compile } */
 +/* { dg-options -O3 -fprefetch-loop-arrays
 +-fdump-tree-aprefetch-details --param min-insn-to-prefetch-ratio=3
 +--param simultaneous-prefetches=10 -fdump-tree-aprefetch-details }
 +*/
 +
 +
 +double data[16384];
 +void prefetch_when_non_constant_step_is_invariant(int step, int n) {
 + int

Re: patch to fix constant math

2012-10-04 Thread Richard Guenther
 this last night.   it is an artifact of the way
 i produced the patch and responsible people have been sacked.   However,
 it shows that you read the patch carefully, and i really appreciate that.
 i owe you a beer (not that you need another at this time of year).

You also didn't mention the missing tree bits ... was this just a 1/n patch
or is it at all usable for you in this state?  Where do the large integers
magically come from?

Richard.

 Kenny



 On 10/04/2012 08:48 AM, Richard Guenther wrote:

 On Wed, Oct 3, 2012 at 7:15 PM, Kenneth Zadeck zad...@naturalbridge.com
 wrote:

 The enclosed patch is the third of at least four patches that fix the
 problems associated with supporting integers on the target that are
 wider than two HOST_WIDE_INTs.

 While GCC claims to support OI mode, and we have two public ports that
 make minor use of this mode, in practice, compilation that uses OImode
 mode commonly gives the wrong result or ices.  We have a private port
 of GCC for an architecture that is further down the road to needing
 comprehensive OImode and we have discovered that this is unusable. We
 have decided to fix it in a general way that so that it is most
 beneficial to the GCC community.  It is our belief that we are just a
 little ahead of the X86 and the NEON and these patches will shortly be
 essential.

 The first two of these patches were primarily lexigraphical and have
 already been committed.They transformed the uses of CONST_DOUBLE
 so that it is easy to tell what the intended usage is.

 The underlying structures in the next two patches are very general:
 once they are added to the compiler, the compiler will be able to
 support targets with any size of integer from hosts of any size
 integer.

 The patch enclosed deals with the portable RTL parts of the compiler.
 The next patch, which is currently under construction deals with the
 tree level.  However, this patch can be put on the trunk as is, and it
 will eleviate many, but not all of the current limitations in the rtl
 parts of the compiler.

 Some of the patch is conditional, depending on a port defining the
 symbol 'TARGET_SUPPORTS_WIDE_INT' to be non zero.  Defining this
 symbol to be non zero is declaring that the port has been converted to
 use the new form or integer constants.  However, the patch is
 completely backwards compatible to allow ports that do not need this
 immediately to convert at their leasure.  The conversion process is
 not difficult, but it does require some knowledge of the port, so we
 are not volinteering to do this for all ports.

 OVERVIEW OF THE PATCH:

 The patch defines a new datatype, a 'wide_int' (defined in
 wide-int.[ch], and this datatype will be used to perform all of the
 integer constant math in the compiler.  Externally, wide-int is very
 similar to double-int except that it does not have the limitation that
 math must be done on exactly two HOST_WIDE_INTs.

 Internally, a wide_int is a structure that contains a fixed sized
 array of HOST_WIDE_INTs, a length field and a mode.  The size of the

 That it has a mode sounds odd to me and makes it subtly different
 from HOST_WIDE_INT and double-int.  Maybe the patch will tell
 why this is so.

 array is determined at generation time by dividing the number of bits
 of the largest integer supported on the target by the number of bits
 in a HOST_WIDE_INT of the host.  Thus, with this format, any length of
 integer can be supported on any host.

 A new rtx type is created, the CONST_WIDE_INT, which contains a
 garbage collected array of HOST_WIDE_INTS that is large enough to hold
 the constant.  For the targets that define TARGET_SUPPORTS_WIDE_INT to
 be non zero, CONST_DOUBLES are only used to hold floating point
 values.  If the target leaves TARGET_SUPPORTS_WIDE_INT defined as 0,
 CONST_WIDE_INTs are not used and CONST_DOUBLEs are as they were
 before.

 CONST_INT does not change except that it is defined to hold all
 constants that fit in exactly one HOST_WIDE_INT.  Note that is slightly
 different than the current trunk.  Before this patch, the TImode
 constant '5' could either be in a CONST_INT or CONST_DOUBLE depending
 on which code path was used to create it.  This patch changes this so
 that if the constant fits in a CONST_INT then it is represented in a
 CONST_INT no matter how it is created.

 For the array inside a CONST_WIDE_INT, and internally in wide-int, we
 use a compressed form for integers that need more than one
 HOST_WIDE_INT.  Higher elements of the array are not needed if they
 are just a sign extension of the elements below them.  This does not
 imply that constants are signed or are sign extended, this is only a
 compression technique.

 While it might seem to be more esthetically pleasing to have not
 introduced the CONST_WIDE_INT and to have changed the representation
 of the CONST_INT to accomodate larger numbers, this would have both
 used more space and would be a time consuming change for the port
 maintainers.  We

Re: [lra] patch to solve most scalability problems for LRA

2012-10-04 Thread Richard Guenther
On Thu, Oct 4, 2012 at 7:07 PM, Steven Bosscher stevenb@gmail.com wrote:
 On Thu, Oct 4, 2012 at 5:37 PM, Vladimir Makarov vmaka...@redhat.com wrote:
   The only issue now is PR54146 compilation time for IRA+LRA although it
 was improved significantly.  I will continue work on PR54146.  But now I
 am going to focus on proposals from reviews.

 Right, there still are opportunities to improve things.

 (The real solution may be to stop SRA from creating so many
 simultaneously live pseudos in the first place...)

 +  lra_simple_p
 += (ira_use_lra_p  max_reg_num () = (1  26) / last_basic_block);

 I think you should use n_basic_blocks here instead of
 last_basic_block, in case this runs without compacting the cfg first
 (n_basic_blocks is the real number of basic blocks in the cfg,
 last_basic_block is the highest index, so last_basic_block =
 n_basic_blocks).

I also noticed that switching to IRA_REGION_ONE improves things
when we have a large number of loops (profile points to some
loop code in IRA).  Note that the magic number above should
be a new --param, and once we have a diagnostic flag that
shows whenever we back off like this it should notify the
user of that fact (and the params we have overflown) - this
just reminded me of that idea from somebody else ;)

 Thanks for working on this!

Indeed ;)

It, btw, also applies to IRA + reload ...

Richard.

 Ciao!
 Steven


Re: PATCH trunk: gengtype honoring mark_hook-s inside struct inside union-s

2012-10-04 Thread Richard Guenther
On Thu, Oct 4, 2012 at 7:24 PM, Basile Starynkevitch
bas...@starynkevitch.net wrote:
 On Thu, Oct 04, 2012 at 06:51:35PM +0300, Laurynas Biveinis wrote:
  2012-10-03  Basile Starynkevitch  bas...@starynkevitch.net
 
  * gengtype.c (walk_type): Emit mark_hook when inside a
struct of a union member.

 This is OK.

 thanks, Committed revision 192092 to trunk.


 I believe this patch should be backported into GCC 4.7 and 4.6

I see no reason for this unless it is a regression.

Richard.

 Regards.
 --
 Basile STARYNKEVITCH http://starynkevitch.net/Basile/
 email: basileatstarynkevitchdotnet mobile: +33 6 8501 2359
 8, rue de la Faiencerie, 92340 Bourg La Reine, France
 *** opinions {are only mines, sont seulement les miennes} ***


Re: [PATCH] Teach VRP to handle if ((unsigned_narrowing_cast) x != 0) similarly to if ((x 0xffff) != 0) (PR tree-optimization/54810)

2012-10-04 Thread Richard Guenther
On Thu, Oct 4, 2012 at 6:31 PM, Jakub Jelinek ja...@redhat.com wrote:
 Hi!

 This patch handles unsigned narrowing casts the same as
 BIT_AND_EXPR with the unsigned narrow type's max value.

 Bootstrapped/regtested on x86_64-linux and i686-linux,
 ok for trunk?

Ok.

Thanks,
Richard.

 2012-10-04  Jakub Jelinek  ja...@redhat.com

 PR tree-optimization/54810
 * tree-vrp.c (register_edge_assert_for_2): Handle
 NAME = (unsigned) NAME2; if (NAME cmp CST) for
 narrowing casts to unsigned integral type like
 NAME = NAME2  CST2; if (NAME cmp CST) where CST2
 is the max value of the unsigned integral type.

 --- gcc/tree-vrp.c.jj   2012-09-25 14:45:48.0 +0200
 +++ gcc/tree-vrp.c  2012-10-04 11:43:32.334988401 +0200
 @@ -4712,6 +4712,11 @@ register_edge_assert_for_2 (tree name, e
tree val2 = NULL_TREE;
double_int mask = double_int_zero;
unsigned int prec = TYPE_PRECISION (TREE_TYPE (val));
 +  unsigned int nprec = prec;
 +  enum tree_code rhs_code = ERROR_MARK;
 +
 +  if (is_gimple_assign (def_stmt))
 +   rhs_code = gimple_assign_rhs_code (def_stmt);

/* Add asserts for NAME cmp CST and NAME being defined
  as NAME = (int) NAME2.  */
 @@ -4721,7 +4726,7 @@ register_edge_assert_for_2 (tree name, e
gimple_assign_cast_p (def_stmt))
 {
   name2 = gimple_assign_rhs1 (def_stmt);
 - if (CONVERT_EXPR_CODE_P (gimple_assign_rhs_code (def_stmt))
 + if (CONVERT_EXPR_CODE_P (rhs_code)
INTEGRAL_TYPE_P (TREE_TYPE (name2))
TYPE_UNSIGNED (TREE_TYPE (name2))
prec == TYPE_PRECISION (TREE_TYPE (name2))
 @@ -4767,8 +4772,7 @@ register_edge_assert_for_2 (tree name, e
  NAME = NAME2  CST2.

  Extract CST2 from the right shift.  */
 -  if (is_gimple_assign (def_stmt)
 -  gimple_assign_rhs_code (def_stmt) == RSHIFT_EXPR)
 +  if (rhs_code == RSHIFT_EXPR)
 {
   name2 = gimple_assign_rhs1 (def_stmt);
   cst2 = gimple_assign_rhs2 (def_stmt);
 @@ -4840,21 +4844,37 @@ register_edge_assert_for_2 (tree name, e
/* Add asserts for NAME cmp CST and NAME being defined as
  NAME = NAME2  CST2.

 -Extract CST2 from the and.  */
 +Extract CST2 from the and.
 +
 +Also handle
 +NAME = (unsigned) NAME2;
 +casts where NAME's type is unsigned and has smaller precision
 +than NAME2's type as if it was NAME = NAME2  MASK.  */
names[0] = NULL_TREE;
names[1] = NULL_TREE;
cst2 = NULL_TREE;
 -  if (is_gimple_assign (def_stmt)
 -  gimple_assign_rhs_code (def_stmt) == BIT_AND_EXPR)
 +  if (rhs_code == BIT_AND_EXPR
 + || (CONVERT_EXPR_CODE_P (rhs_code)
 +  TREE_CODE (TREE_TYPE (val)) == INTEGER_TYPE
 +  TYPE_UNSIGNED (TREE_TYPE (val))
 +  TYPE_PRECISION (TREE_TYPE (gimple_assign_rhs1 (def_stmt)))
 + prec
 +  !retval))
 {
   name2 = gimple_assign_rhs1 (def_stmt);
 - cst2 = gimple_assign_rhs2 (def_stmt);
 + if (rhs_code == BIT_AND_EXPR)
 +   cst2 = gimple_assign_rhs2 (def_stmt);
 + else
 +   {
 + cst2 = TYPE_MAX_VALUE (TREE_TYPE (val));
 + nprec = TYPE_PRECISION (TREE_TYPE (name2));
 +   }
   if (TREE_CODE (name2) == SSA_NAME
INTEGRAL_TYPE_P (TREE_TYPE (name2))
TREE_CODE (cst2) == INTEGER_CST
!integer_zerop (cst2)
 -  prec = HOST_BITS_PER_DOUBLE_INT
 -  (prec  1
 +  nprec = HOST_BITS_PER_DOUBLE_INT
 +  (nprec  1
   || TYPE_UNSIGNED (TREE_TYPE (val
 {
   gimple def_stmt2 = SSA_NAME_DEF_STMT (name2);
 @@ -4881,12 +4901,12 @@ register_edge_assert_for_2 (tree name, e
   bool valid_p = false, valn = false, cst2n = false;
   enum tree_code ccode = comp_code;

 - valv = tree_to_double_int (val).zext (prec);
 - cst2v = tree_to_double_int (cst2).zext (prec);
 + valv = tree_to_double_int (val).zext (nprec);
 + cst2v = tree_to_double_int (cst2).zext (nprec);
   if (!TYPE_UNSIGNED (TREE_TYPE (val)))
 {
 - valn = valv.sext (prec).is_negative ();
 - cst2n = cst2v.sext (prec).is_negative ();
 + valn = valv.sext (nprec).is_negative ();
 + cst2n = cst2v.sext (nprec).is_negative ();
 }
   /* If CST2 doesn't have most significant bit set,
  but VAL is negative, we have comparison like
 @@ -4894,7 +4914,7 @@ register_edge_assert_for_2 (tree name, e
   if (!cst2n  valn)
 ccode = ERROR_MARK;
   if (cst2n)
 -   sgnbit = double_int_one.llshift (prec - 1, prec).zext (prec);
 +   sgnbit = double_int_one.llshift (nprec - 1, nprec).zext (nprec);

Re: How much time left till phase 3?

2012-10-02 Thread Richard Guenther
On Tue, Oct 2, 2012 at 10:44 AM, Joern Rennecke
joern.renne...@embecosm.com wrote:
 I'll have to prepare a few more patches to (supposedly) generic
 code to support the ARCompact port, which we (Synopsys and Embecosm)
 would like contribute in time for gcc 4.8.

 How much time is left till we switch from phase 1 to phase 3?

I expect stage1 to close mid to end of October (after which it lasted
for more than
7 months).

Richard.


Re: How much time left till phase 3?

2012-10-02 Thread Richard Guenther
On Tue, Oct 2, 2012 at 11:34 AM, Richard Guenther
richard.guent...@gmail.com wrote:
 On Tue, Oct 2, 2012 at 10:44 AM, Joern Rennecke
 joern.renne...@embecosm.com wrote:
 I'll have to prepare a few more patches to (supposedly) generic
 code to support the ARCompact port, which we (Synopsys and Embecosm)
 would like contribute in time for gcc 4.8.

 How much time is left till we switch from phase 1 to phase 3?

 I expect stage1 to close mid to end of October (after which it lasted
 for more than 7 months).

Btw, I realize that the aarch64 port probably also wants to merge even if
I didn't see a merge proposal or know whether they have patches to
generic code.

Anybody else with things they want to merge during stage1?

Thanks,
Richard.

 Richard.


Re: [RFC] Make vectorizer to skip loops with small iteration estimate

2012-10-02 Thread Richard Guenther
On Mon, 1 Oct 2012, Jan Hubicka wrote:

   
So the unvectorized cost is
SIC * niters
   
The vectorized path is
SOC + VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC
The scalar path of vectorizer loop is
SIC * niters + SOC
  
  Note that 'th' is used for the runtime profitability check which is
  done at the time the setup cost has already been taken (yes, we
 
 Yes, I understand that.
  probably should make it more conservative but then guard the whole
  set of loops by the check, not only the vectorized path).
  See PR53355 for the general issue.
 
 Yep, we may reduce the cost of SOC by outputting early guard for 
 non-vectorized
 path better than we do now. However...
  Of course this is very simple benchmark, in reality the vectorizatoin 
   can be
  a lot more harmful by complicating more complex control flows.
  
  So I guess we have two options
   1) go with the new formula and try to make cost model a bit more 
   realistic.
   2) stay with original formula that is quite close to reality, but I 
   think
  more by an accident.
  
  I think we need to improve it as whole, thus I'd prefer 2).
 
 ... I do not see why.
 Even if we make the check cheaper we will only distribute part of SOC to 
 vector
 prologues/epilogues.
 
 Still I think the formula is wrong, I.e. accounting SOC where it should not.
 
 The cost of scalar path without vectorization is 
   niters * SIC
 while with vectorization we have scalar path
   niters * SIC + SOC
 and vector path
   SOC + VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC
 
 So SOC cancels out in the runtime check.
 I still think we need two formulas - one determining if vectorization is
 profitable, other specifying the threshold for scalar path at runtime (that
 will generally give lower values).

True, we want two values.  But part of the scalar path right now
is all the computation required for alias and alignment runtime checks
(because the way all the conditions are combined).

I'm not much into the details of what we account for in SOC (I suppose
it's everything we insert in the preheader of the vector loop).

+  if (vect_print_dump_info (REPORT_UNVECTORIZED_LOCATIONS))
+fprintf (vect_dump, not vectorized: estimated iteration count 
too small.);
+  if (vect_print_dump_info (REPORT_DETAILS))
+fprintf (vect_dump, not vectorized: estimated iteration count 
smaller than 
+ user specified loop bound parameter or minimum 
+ profitable iterations (whichever is more 
conservative).);

this won't work anymore btw - dumping infrastructure changed.

I suppose your patch is a step in the right direction, but to really
make progress we need to re-organize the loop and predicate structure
produced by the vectorizer.

So, please update your patch, re-test and then it's ok.

   2) Even when loop iterates 2 times, it is estimated to 4 iterations by
  estimated_stmt_executions_int with the profile feedback.
  The reason is loop_ch pass.  Given a rolled loop with exit probability
  30%, proceeds by duplicating the header with original probabilities.
  This makes the loop to be executed with 60% probability.  Because the
  loop body counts remain the same (and they should), the expected number
  of iterations increase by the decrease of entry edge to the header.
   
  I wonder what to do about this.  Obviously without path profiling
  loop_ch can not really do a good job.  We can artifically make
  header to suceed more likely, that is the reality, but that requires
  non-trivial loop profile updating.
   
  We can also simply record the iteration bound into loop structure 
  and ignore that the profile is not realistic
  
  But we don't preserve loop structure from header copying ...
 
 From what time we keep loop structure? In general I would like to eventualy
 drop value histograms to loop structure specifying number of iterations with
 profile feedback.

We preserve it from the start of the tree loop optimizers (it's easy
to preserve them from earlier points as long as you don't cross inlining,
but to lower the impact of the change I placed it where it was enough
to prevent the excessive unrolling/peeling done by RTL)

  
  Finally we can duplicate loop headers before profilng.  I implemented
  that via early_ch pass executed only with profile generation or 
   feedback.
  I guess it makes sense to do, even if it breaks the assumption that
  we should do strictly -Os generation on paths where
  
  Well, there are CH cases that do not increase code size and I doubt
  that loop header copying is generally bad for -Os ... we are not
  good at handling non-copied loop headers.
 
 There is comment saying 
   /* Loop header copying usually increases size of the code.  This used not to
  be true, since quite often it is possible to verify that the condition is
  satisfied in the first 

Re: Convert more non-GTY htab_t to hash_table.

2012-10-02 Thread Richard Guenther
On Mon, 1 Oct 2012, Lawrence Crowl wrote:

 Change more non-GTY hash tables to use the new type-safe template hash table.
 Constify member function parameters that can be const.
 Correct a couple of expressions in formerly uninstantiated templates.
 
 The new code is 0.362% faster in bootstrap, with a 99.5% confidence of
 being faster.
 
 Tested on x86-64.
 
 Okay for trunk?

You are changing a hashtable used by fold checking, did you test
with fold checking enabled?

+/* Data structures used to maintain mapping between basic blocks and
+   copies.  */
+static hash_table bb_copy_hasher bb_original;
+static hash_table bb_copy_hasher bb_copy;

note that because hash_table has a constructor we now get global
CTORs for all statics :(  (and mx-protected local inits ...)
Can you please try to remove the constructor from hash_table to
avoid this overhead?  (as a followup - that is, don't initialize htab)

The cfg.c, dse.c and hash-table.h parts are ok for trunk, I'll leave the 
rest to 
respective maintainers of the pieces of the compiler.

Thanks,
Richard.

 
 Index: gcc/java/ChangeLog
 
 2012-10-01  Lawrence Crowl  cr...@google.com
 
   * Make-lang.in (JAVA_OBJS): Add dependence on hash-table.o.
   (JCFDUMP_OBJS): Add dependence on hash-table.o.
   (jcf-io.o): Add dependence on hash-table.h.
   * jcf-io.c (memoized_class_lookups): Change to use type-safe hash table.
 
 Index: gcc/c/ChangeLog
 
 2012-10-01  Lawrence Crowl  cr...@google.com
 
   * Make-lang.in (c-decl.o): Add dependence on hash-table.h.
   * c-decl.c (detect_field_duplicates_hash): Change to new type-safe
   hash table.
 
 Index: gcc/objc/ChangeLog
 
 2012-10-01  Lawrence Crowl  cr...@google.com
 
   * Make-lang.in (OBJC_OBJS): Add dependence on hash-table.o.
   (objc-act.o): Add dependence on hash-table.h.
   * objc-act.c (objc_detect_field_duplicates): Change to new type-safe
   hash table.
 
 Index: gcc/ChangeLog
 
 2012-10-01  Lawrence Crowl  cr...@google.com
 
   * Makefile.in (fold-const.o): Add depencence on hash-table.h.
   (dse.o): Likewise.
   (cfg.o): Likewise.
   * fold-const.c (fold_checksum_tree): Change to new type-safe hash table.
   * (print_fold_checksum): Likewise.
   * cfg.c (var bb_original): Likewise.
   * (var bb_copy): Likewise.
   * (var loop_copy): Likewise.
   * hash-table.h (template hash_table): Constify parameters for find...
   and remove_elt... member functions.
 (hash_table::empty) Correct size expression.
 (hash_table::clear_slot) Correct deleted entry assignment.
   * dse.c (var rtx_group_table): Change to new type-safe hash table.
 
 Index: gcc/cp/ChangeLog
 
 2012-10-01  Lawrence Crowl  cr...@google.com
 
   * Make-lang.in (class.o): Add dependence on hash-table.h.
   (tree.o): Likewise.
   (semantics.o): Likewise.
   * class.c (fixed_type_or_null): Change to new type-safe hash table.
   * tree.c (verify_stmt_tree): Likewise.
   (verify_stmt_tree_r): Likewise.
   * semantics.c (struct nrv_data): Likewise.
 
 
 Index: gcc/java/Make-lang.in
 ===
 --- gcc/java/Make-lang.in (revision 191941)
 +++ gcc/java/Make-lang.in (working copy)
 @@ -83,10 +83,10 @@ JAVA_OBJS = java/class.o java/decl.o jav
java/zextract.o java/jcf-io.o java/win32-host.o java/jcf-parse.o
 java/mangle.o \
java/mangle_name.o java/builtins.o java/resource.o \
java/jcf-depend.o \
 -  java/jcf-path.o java/boehm.o java/java-gimplify.o
 +  java/jcf-path.o java/boehm.o java/java-gimplify.o hash-table.o
 
  JCFDUMP_OBJS = java/jcf-dump.o java/jcf-io.o java/jcf-depend.o
 java/jcf-path.o \
 - java/win32-host.o java/zextract.o ggc-none.o
 + java/win32-host.o java/zextract.o ggc-none.o hash-table.o
 
  JVGENMAIN_OBJS = java/jvgenmain.o java/mangle_name.o
 
 @@ -326,7 +326,7 @@ java/java-gimplify.o: java/java-gimplify
  # jcf-io.o needs $(ZLIBINC) added to cflags.
  CFLAGS-java/jcf-io.o += $(ZLIBINC)
  java/jcf-io.o: java/jcf-io.c $(CONFIG_H) $(SYSTEM_H) coretypes.h \
 -  $(JAVA_TREE_H) java/zipfile.h
 +  $(JAVA_TREE_H) java/zipfile.h $(HASH_TABLE_H)
 
  # jcf-path.o needs a -D.
  CFLAGS-java/jcf-path.o += \
 Index: gcc/java/jcf-io.c
 ===
 --- gcc/java/jcf-io.c (revision 191941)
 +++ gcc/java/jcf-io.c (working copy)
 @@ -31,7 +31,7 @@ The Free Software Foundation is independ
  #include jcf.h
  #include tree.h
  #include java-tree.h
 -#include hashtab.h
 +#include hash-table.h
  #include dirent.h
 
  #include zlib.h
 @@ -271,20 +271,34 @@ find_classfile (char *filename, JCF *jcf
return open_class (filename, jcf, fd, dep_name);
  }
 
 -/* Returns 1 if the CLASSNAME (really a char *) matches the name
 -   stored in TABLE_ENTRY (also a char *).  */
 
 -static int
 -memoized_class_lookup_eq (const void *table_entry, const void *classname)
 +/* Hash table 

Re: [PATCH] Fix powerpc breakage, was: Add option for dumping to stderr (issue6190057)

2012-10-02 Thread Richard Guenther
On Tue, Oct 2, 2012 at 1:11 AM, Xinliang David Li davi...@google.com wrote:
 On Mon, Oct 1, 2012 at 4:05 PM, Sharad Singhai sing...@google.com wrote:
 Thanks for tracking down and fixing the powerpc port.

 The dump_kind_p () check is redundant but canonical form here. I
 think blocks of dump code guarded by if dump_kind_p (...) might be
 easier to read/maintain.


 I find it confusing to be honest. The redundant check serves no purpose.

The check should be inlined and avoid the call to the diagnostic routine,
thus speed up compile-time.  We should use this pattern, especially
if it guards multiple calls.

Richard.

 David

 Sharad
 Sharad


 On Mon, Oct 1, 2012 at 3:45 PM, Xinliang David Li davi...@google.com wrote:
 On Mon, Oct 1, 2012 at 2:37 PM, Michael Meissner
 meiss...@linux.vnet.ibm.com wrote:
 I tracked down some of the other code that previously used REPORT_DETAILS, 
 and
 MSG_NOTE is the new way to do the same thing.  This bootstraps and no
 unexpected errors occur during make check.  Is it ok to install?

 2012-10-01  Michael Meissner  meiss...@linux.vnet.ibm.com

 * config/rs6000/rs6000.c (toplevel): Include dumpfile.h.
 (rs6000_density_test): Rework to accomidate 09-30 change by Sharad
 Singhai.

 * config/rs6000/t-rs6000 (rs6000.o): Add dumpfile.h dependency.

 Index: gcc/config/rs6000/rs6000.c
 ===
 --- gcc/config/rs6000/rs6000.c  (revision 191932)
 +++ gcc/config/rs6000/rs6000.c  (working copy)
 @@ -58,6 +58,7 @@
  #include tm-constrs.h
  #include opts.h
  #include tree-vectorizer.h
 +#include dumpfile.h
  #if TARGET_XCOFF
  #include xcoffout.h  /* get declarations of xcoff_*_section_name */
  #endif
 @@ -3518,11 +3519,11 @@ rs6000_density_test (rs6000_cost_data *d
 vec_cost + not_vec_cost  DENSITY_SIZE_THRESHOLD)
  {
data-cost[vect_body] = vec_cost * (100 + DENSITY_PENALTY) / 100;
 -  if (vect_print_dump_info (REPORT_DETAILS))
 -   fprintf (vect_dump,
 -density %d%%, cost %d exceeds threshold, penalizing 
 -loop body cost by %d%%, density_pct,
 -vec_cost + not_vec_cost, DENSITY_PENALTY);
 +  if (dump_kind_p (MSG_NOTE))

 Is this check needed? Seems redundant.

 David


 +   dump_printf_loc (MSG_NOTE, vect_location,
 +density %d%%, cost %d exceeds threshold, 
 penalizing 
 +loop body cost by %d%%, density_pct,
 +vec_cost + not_vec_cost, DENSITY_PENALTY);
  }
  }

 Index: gcc/config/rs6000/t-rs6000
 ===
 --- gcc/config/rs6000/t-rs6000  (revision 191932)
 +++ gcc/config/rs6000/t-rs6000  (working copy)
 @@ -26,7 +26,7 @@ rs6000.o: $(CONFIG_H) $(SYSTEM_H) corety
$(OBSTACK_H) $(TREE_H) $(EXPR_H) $(OPTABS_H) except.h function.h \
output.h dbxout.h $(BASIC_BLOCK_H) toplev.h $(GGC_H) $(HASHTAB_H) \
$(TM_P_H) $(TARGET_H) $(TARGET_DEF_H) langhooks.h reload.h gt-rs6000.h \
 -  cfgloop.h $(OPTS_H) $(COMMON_TARGET_H)
 +  cfgloop.h $(OPTS_H) $(COMMON_TARGET_H) dumpfile.h

  rs6000-c.o: $(srcdir)/config/rs6000/rs6000-c.c \
  $(srcdir)/config/rs6000/rs6000-protos.h \

 --
 Michael Meissner, IBM
 5 Technology Place Drive, M/S 2757, Westford, MA 01886-3141, USA
 meiss...@linux.vnet.ibm.com fax +1 (978) 399-6899



Re: [PATCH] Fix test breakage, was: Add option for dumping to stderr (issue6190057)

2012-10-02 Thread Richard Guenther
On Tue, Oct 2, 2012 at 1:31 AM, Sharad Singhai sing...@google.com wrote:
 Here is a patch to fix test breakage caused by r191883. Bootstrapped
 on x86_64 and tested with
 make -k check RUNTESTFLAGS=--target_board=unix/\{,-m32\}.

 Okay for trunk?

Ok.

Thanks,
Richard.

 Thanks,
 Sharad

 2012-10-01  Sharad Singhai  sing...@google.com

 * tree-vect-stmts.c (vectorizable_operation): Add missing return.

 testsuite/Changelog

 * gfortran.dg/vect/vect.exp: Change verbose vectorizor dump options
 to fix test failures caused by r191883.
 * gcc.dg/tree-ssa/gen-vect-11.c: Likewise.
 * gcc.dg/tree-ssa/gen-vect-2.c: Likewise.
 * gcc.dg/tree-ssa/gen-vect-32.c: Likewise.
 * gcc.dg/tree-ssa/gen-vect-25.c: Likewise.
 * gcc.dg/tree-ssa/gen-vect-11a.c: Likewise.
 * gcc.dg/tree-ssa/gen-vect-26.c: Likewise.
 * gcc.dg/tree-ssa/gen-vect-11b.c: Likewise.
 * gcc.dg/tree-ssa/gen-vect-11c.c: Likewise.
 * gcc.dg/tree-ssa/gen-vect-28.c: Likewise.
 * testsuite/gcc.target/i386/vect-double-1.c: Fix test. Missing entry
 from r191883.


 Index: testsuite/gfortran.dg/vect/vect.exp
 ===
 --- testsuite/gfortran.dg/vect/vect.exp (revision 191883)
 +++ testsuite/gfortran.dg/vect/vect.exp (working copy)
 @@ -26,7 +26,7 @@ set DEFAULT_VECTCFLAGS 

  # These flags are used for all targets.
  lappend DEFAULT_VECTCFLAGS -O2 -ftree-vectorize -fno-vect-cost-model \
 -  -ftree-vectorizer-verbose=4 -fdump-tree-vect-stats
 +  -fdump-tree-vect-details

  # If the target system supports vector instructions, the default action
  # for a test is 'run', otherwise it's 'compile'.  Save current default.
 Index: testsuite/gcc.dg/tree-ssa/gen-vect-11.c
 ===
 --- testsuite/gcc.dg/tree-ssa/gen-vect-11.c (revision 191883)
 +++ testsuite/gcc.dg/tree-ssa/gen-vect-11.c (working copy)
 @@ -1,6 +1,6 @@
  /* { dg-do run { target vect_cmdline_needed } } */
 -/* { dg-options -O2 -ftree-vectorize -ftree-vectorizer-verbose=3
 -fwrapv -fdump-tree-vect-stats } */
 -/* { dg-options -O2 -ftree-vectorize -ftree-vectorizer-verbose=3
 -fwrapv -fdump-tree-vect-stats -mno-sse { target { i?86-*-*
 x86_64-*-* } } } */
 +/* { dg-options -O2 -ftree-vectorize -fwrapv -fdump-tree-vect-details } */
 +/* { dg-options -O2 -ftree-vectorize -fwrapv
 -fdump-tree-vect-details -mno-sse { target { i?86-*-* x86_64-*-* } }
 } */

  #include stdlib.h

 Index: testsuite/gcc.dg/tree-ssa/gen-vect-2.c
 ===
 --- testsuite/gcc.dg/tree-ssa/gen-vect-2.c (revision 191883)
 +++ testsuite/gcc.dg/tree-ssa/gen-vect-2.c (working copy)
 @@ -1,6 +1,6 @@
  /* { dg-do run { target vect_cmdline_needed } } */
 -/* { dg-options -O2 -ftree-vectorize -ftree-vectorizer-verbose=4
 -fdump-tree-vect-stats } */
 -/* { dg-options -O2 -ftree-vectorize -ftree-vectorizer-verbose=4
 -fdump-tree-vect-stats -mno-sse { target { i?86-*-* x86_64-*-* } } }
 */
 +/* { dg-options -O2 -ftree-vectorize -fdump-tree-vect-details } */
 +/* { dg-options -O2 -ftree-vectorize -fdump-tree-vect-details
 -mno-sse { target { i?86-*-* x86_64-*-* } } } */

  #include stdlib.h

 Index: testsuite/gcc.dg/tree-ssa/gen-vect-32.c
 ===
 --- testsuite/gcc.dg/tree-ssa/gen-vect-32.c (revision 191883)
 +++ testsuite/gcc.dg/tree-ssa/gen-vect-32.c (working copy)
 @@ -1,6 +1,6 @@
  /* { dg-do run { target vect_cmdline_needed } } */
 -/* { dg-options -O2 -ftree-vectorize -ftree-vectorizer-verbose=4
 -fdump-tree-vect-stats } */
 -/* { dg-options -O2 -ftree-vectorize -ftree-vectorizer-verbose=4
 -fdump-tree-vect-stats -mno-sse { target { i?86-*-* x86_64-*-* } } }
 */
 +/* { dg-options -O2 -ftree-vectorize -fdump-tree-vect-details } */
 +/* { dg-options -O2 -ftree-vectorize -fdump-tree-vect-details
 -mno-sse { target { i?86-*-* x86_64-*-* } } } */

  #include stdlib.h

 Index: testsuite/gcc.dg/tree-ssa/gen-vect-25.c
 ===
 --- testsuite/gcc.dg/tree-ssa/gen-vect-25.c (revision 191883)
 +++ testsuite/gcc.dg/tree-ssa/gen-vect-25.c (working copy)
 @@ -1,6 +1,6 @@
  /* { dg-do run { target vect_cmdline_needed } } */
 -/* { dg-options -O2 -ftree-vectorize -ftree-vectorizer-verbose=4
 -fdump-tree-vect-stats } */
 -/* { dg-options -O2 -ftree-vectorize -ftree-vectorizer-verbose=4
 -fdump-tree-vect-stats -mno-sse { target { i?86-*-* x86_64-*-* } } }
 */
 +/* { dg-options -O2 -ftree-vectorize -fdump-tree-vect-details } */
 +/* { dg-options -O2 -ftree-vectorize -fdump-tree-vect-details
 -mno-sse { target { i?86-*-* x86_64-*-* } } } */

  #include stdlib.h

 Index: testsuite/gcc.dg/tree-ssa/gen-vect-11a.c
 ===
 --- testsuite/gcc.dg/tree-ssa/gen-vect-11a.c (revision 191883)
 +++ 

Re: [PATCH] Add option for dumping to stderr (issue6190057)

2012-10-02 Thread Richard Guenther
On Mon, Oct 1, 2012 at 8:39 PM, Gabriel Dos Reis
g...@integrable-solutions.net wrote:
 On Mon, Oct 1, 2012 at 1:27 PM, Michael Meissner
 meiss...@linux.vnet.ibm.com wrote:
 On Mon, Oct 01, 2012 at 02:02:26PM -0400, Michael Meissner wrote:
 Your change on September 30th, breaks the powerpc port because the
 REPORT_DETAILS value in the enumeration is no longer there, and the
 rs6000_density_test function was using that.  Please in the future, when you
 are making global changes, grep for uses of enum values in all of the 
 machine
 dependent directories so we can avoid breakage like this.

 Also, in looking at the changes, given we are already up to 28 TDF_ flags, I
 would recommend immediately adding a new type that is the TDF flagword type.
 Thus it will be a lot simpler when we add 4 more TDF flags and have to change
 the type from int to HOST_WIDE_INT.

 Agreed that we need an abstraction here.

Some TLC as well - the flags have various meanings (some control dumping,
some, like TDF_TREE, seem to be unrelated - the MSG ones probably don't
need the same number-space as well, not all flags are used anymore -
TDF_MEMSYMS?).

But yes, an abstraction is needed.  But I wouldn't suggest HOST_WIDE_INT
but int - uint32_t instead (possibly going uint64_t).

Richard.

 -- Gaby


Re: [Patch] Fix PR53397

2012-10-02 Thread Richard Guenther
On Mon, 1 Oct 2012, venkataramanan.ku...@amd.com wrote:

 Hi, 
 
 The below patch fixes the FFT/Scimark regression caused by useless prefetch
 generation.
 
 This fix tries to make prefetch less aggressive by prefetching arrays in the
 inner loop, when the step is invariant in the entire loop nest.
 
 GCC currently tries to prefetch invariant steps when they are in the inner
 loop. But does not check if the step is variant in outer loops.
 
 In the scimark FFT case, the trip count of the inner loop varies by a non
 constant step, which is invariant in the inner loop. 
 But the step variable is varying in outer loop. This makes
 inner loop trip count small (at run time varies sometimes as small as 1
 iteration) 
 
 Prefetching ahead x iteration when the inner loop trip count is smaller than x
 leads to useless prefetches. 
 
 Flag used: -O3 -march=amdfam10 
 
 Before 
 **  **
 ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark **
 ** for details. (Results can be submitted to p...@nist.gov) **
 **  **
 Using   2.00 seconds min time per kenel.
 Composite Score:  550.50
 FFT Mflops:38.66(N=1024)
 SOR Mflops:   617.61(100 x 100)
 MonteCarlo: Mflops:   173.74
 Sparse matmult  Mflops:   675.63(N=1000, nz=5000)
 LU  Mflops:  1246.88(M=100, N=100)
 
 
 After 
 **  **
 ** SciMark2 Numeric Benchmark, see http://math.nist.gov/scimark **
 ** for details. (Results can be submitted to p...@nist.gov) **
 **  **
 Using   2.00 seconds min time per kenel.
 Composite Score:  639.20
 FFT Mflops:   479.19(N=1024)
 SOR Mflops:   617.61(100 x 100)
 MonteCarlo: Mflops:   173.18
 Sparse matmult  Mflops:   679.13(N=1000, nz=5000)
 LU  Mflops:  1246.88(M=100, N=100)
 
 GCC regression make check -k passes with x86_64-unknown-linux-gnu
 New tests that PASS:
 
 gcc.dg/pr53397-1.c scan-assembler prefetcht0
 gcc.dg/pr53397-1.c scan-tree-dump aprefetch Issued prefetch
 gcc.dg/pr53397-1.c (test for excess errors)
 gcc.dg/pr53397-2.c scan-tree-dump aprefetch loop variant step
 gcc.dg/pr53397-2.c scan-tree-dump aprefetch Not prefetching
 gcc.dg/pr53397-2.c (test for excess errors)
 
 
 Checked CPU2006 and polyhedron on latest AMD processor, no regressions noted.
 
 Ok to commit in trunk?
 
 regards,
 Venkat
 
 gcc/ChangeLog
 +2012-10-01  Venkataramanan Kumar  venkataramanan.ku...@amd.com
 +
 +   * tree-ssa-loop-prefetch.c (gather_memory_references_ref):$
 +   Perform non constant step prefetching in inner loop, only $
 +   when it is invariant in the entire loop nest.  $
 +   * testsuite/gcc.dg/pr53397-1.c: New test case $
 +   Checks we are prefecthing for loop invariant steps$
 +   * testsuite/gcc.dg/pr53397-2.c: New test case$
 +   Checks we are not prefecthing for loop variant steps
 +
 
 
 Index: gcc/testsuite/gcc.dg/pr53397-1.c
 ===
 --- gcc/testsuite/gcc.dg/pr53397-1.c  (revision 0)
 +++ gcc/testsuite/gcc.dg/pr53397-1.c  (revision 0)
 @@ -0,0 +1,28 @@
 +/* Prefetching when the step is loop invariant.  */
 +
 +/* { dg-do compile } */
 +/* { dg-options -O3 -fprefetch-loop-arrays -fdump-tree-aprefetch-details 
 --param min-insn-to-prefetch-ratio=3 --param simultaneous-prefetches=10 
 -fdump-tree-aprefetch-details } */
 +
 +
 +double data[16384];
 +void prefetch_when_non_constant_step_is_invariant(int step, int n)
 +{
 + int a;
 + int b;
 + for (a = 1; a  step; a++) {
 +for (b = 0; b  n; b += 2 * step) {
 +
 +  int i = 2*(b + a);
 +  int j = 2*(b + a + step);
 +
 +
 +  data[j]   = data[i];
 +  data[j+1] = data[i+1];
 +}
 + }
 +}
 +
 +/* { dg-final { scan-tree-dump Issued prefetch aprefetch } } */
 +/* { dg-final { scan-assembler prefetcht0 } } */

This (and the case below) needs to be adjusted to only run on the
appropriate hardware.  See for example gcc.dg/tree-ssa/prefetch-8.c
for how to do this.

 +/* { dg-final { cleanup-tree-dump aprefetch } } */
 Index: gcc/testsuite/gcc.dg/pr53397-2.c
 ===
 --- gcc/testsuite/gcc.dg/pr53397-2.c  (revision 0)
 +++ gcc/testsuite/gcc.dg/pr53397-2.c  (revision 0)
 @@ -0,0 +1,29 @@
 +/* Not prefetching when the step is loop variant.  */
 +
 +/* { dg-do compile } */
 +/* { dg-options -O3 -fprefetch-loop-arrays -fdump-tree-aprefetch-details 
 --param min-insn-to-prefetch-ratio=3 --param simultaneous-prefetches=10 
 -fdump-tree-aprefetch-details } */
 +
 +
 +double data[16384];
 +void donot_prefetch_when_non_constant_step_is_variant(int step, int n)
 +{ 
 + int a;
 + int b;
 + for (a = 1; a  step; a++,step*=2) {
 

Re: vec_cond_expr adjustments

2012-10-02 Thread Richard Guenther
On Mon, Oct 1, 2012 at 5:57 PM, Marc Glisse marc.gli...@inria.fr wrote:
 [merging both threads, thanks for the answers]


 On Mon, 1 Oct 2012, Richard Guenther wrote:

 optabs should be fixed instead, an is_gimple_val condition is
 implicitely
 val != 0.


 For vectors, I think it should be val  0 (with an appropriate cast of
 val
 to a signed integer vector type if necessary). Or (val  highbit) != 0,
 but
 that's longer.


 I don't think so.  Throughout the compiler we generally assume false == 0
 and anything else is true.  (yes, for FP there is STORE_FLAG_VALUE, but
 it's scope is quite limited - if we want sth similar for vectors we'd have
 to
 invent it).


 See below.


 If we for example have

 predicate = a  b;
 x = predicate ? d : e;
 y = predicate ? f : g;

 we ideally want to re-use the predicate computation on targets where
 that would be optimal (and combine should be able to recover the
 case where it is not).


 That I don't understand. The vcond instruction implemented by targets
 takes
 as arguments d, e, cmp, a, b and emits the comparison itself. I don't see
 how I can avoid sending to the targets both (d,e,,a,b) and (f,g,,a,b).
 They will notice eventually that ab is computed twice and remove one of
 the
 two, but I don't see how to do that in optabs.c. Or I can compute x = a 
 b,
 use x  0 as the comparison passed to the targets, and expect targets
 (those
 for which it is true) to recognize that  0 is useless in a vector
 condition
 (PR54700), or is useless on a comparison result.


 But that's a limitation of how vcond works.  ISTR there is/was a vselect
 instruction as well, taking a mask and two vectors to select from.  At
 least
 that's how vcond works internally for some sub-targets.


 vselect seems to only appear in config/. Would it be defined as:
 vselect(m,a,b)=(am)|(b~m) ? I would almost be tempted to just define a
 pattern in .md files and let combine handle it, although it might be one
 instruction too long for that (and if m is xy, ~m might look like x=y).
 Or would it match the OpenCL select: For each component of a vector type,
 result[i] = if MSB of c[i] is set ? b[i] : a[i].? Or the pattern with 
 and | but with a precondition that the value of each element of the mask
 must be 0 or ±1?

 I don't find vcond that bad, as long as targets check for trivial
 comparisons in the expansion (what trivial means may depend on the
 platform). It is quite flexible for targets.

Well, ok.


 On Mon, 1 Oct 2012, Richard Guenther wrote:

 tmp = fold_build2_loc (gimple_location (def_stmt),
code,
 -  boolean_type_node,
 +  TREE_TYPE (cond),


 That's obvious.


 Ok, I'll test and commit that line separately.

 +  if (TREE_CODE (op0) == VECTOR_CST  TREE_CODE (op1) == VECTOR_CST)
 +{
 +  int count = VECTOR_CST_NELTS (op0);
 +  tree *elts =  XALLOCAVEC (tree, count);
 +  gcc_assert (TREE_CODE (type) == VECTOR_TYPE);
 +
 +  for (int i = 0; i  count; i++)
 +   {
 + tree elem_type = TREE_TYPE (type);
 + tree elem0 = VECTOR_CST_ELT (op0, i);
 + tree elem1 = VECTOR_CST_ELT (op1, i);
 +
 + elts[i] = fold_relational_const (code, elem_type,
 +  elem0, elem1);
 +
 + if(elts[i] == NULL_TREE)
 +   return NULL_TREE;
 +
 + elts[i] = fold_negate_const (elts[i], elem_type);


 I think you need to invent something new similar to STORE_FLAG_VALUE
 or use STORE_FLAG_VALUE here.  With the above you try to map
 {0, 1} to {0, -1} which is only true if the operation on the element types
 returns {0, 1} (thus, STORE_FLAG_VALUE is 1).


 Er, seems to me that constant folding of a scalar comparison in the
 front/middle-end only returns {0, 1}.

The point is we need to define some semantics for vector comparison
results.  One variant is to make it target independent which in turn
would inhibit (or make it more difficult) to exploit some target features.
You for example use {0, -1} for truth values - probably to exploit target
features - even though the most natural middle-end way would be to
use {0, 1} as for everything else (caveat: there may be both signed
and unsigned bools, we don't allow vector components with non-mode precision,
thus you could argue that a signed bool : 1 is just sign-extended
for your solution).  A different variant is to make it target dependent
to leverage optimization opportunities - that's why STORE_FLAG_VALUE
exists.  For example with vector comparisons a  v result, when
performing bitwise operations on it, you either have to make the target
expand code to produce {0, -1} even if the natural compare instruction
would, say, produce {0, 0x8} - or not constrain the possible values
of its result (like forwprop would do with your patch).  In general we
want constant folding to yield the same results as if the HW carried
out the operation to make -O0 code not diverge from

Re: [PATCH] Vector CONSTRUCTOR verifier

2012-10-02 Thread Richard Guenther
On Tue, Oct 2, 2012 at 3:01 PM, Jakub Jelinek ja...@redhat.com wrote:
 Hi!

 As discussed in the PR and on IRC, this patch verifies that vector
 CONSTRUCTOR in GIMPLE is either empty CONSTRUCTOR, or contains scalar
 elements of type compatible with vector element type (then the verification
 is less strict, allows less than TYPE_VECTOR_SUBPARTS elements and allows
 non-NULL indexes if they are consecutive (no holes); this is because
 from FEs often CONSTRUCTORs with those properties leak into the IL, and
 a change in the gimplifier to canonicalize them wasn't enough, they keep
 leaking even from non-gimplified DECL_INITIAL values etc.), or
 contains vector elements (element types must be compatible, the vector
 elements must be of the same type and their number must fill the whole
 wider vector - these are created/used by tree-vect-generic lowering if
 HW supports only shorter vectors than what is requested in source).

 Bootstrapped/regtested on x86_64-linux and i686-linux, ok for trunk?

Ok with ...

 2012-10-02  Jakub Jelinek  ja...@redhat.com

 PR tree-optimization/54713
 * expr.c (categorize_ctor_elements_1): Don't assume purpose is
 non-NULL.
 * tree-cfg.c (verify_gimple_assign_single): Add verification of
 vector CONSTRUCTORs.
 * tree-ssa-sccvn.c (vn_reference_lookup_3): For VECTOR_TYPE
 CONSTRUCTORs, don't do anything if element type is VECTOR_TYPE,
 and don't check index.
 * tree-vect-slp.c (vect_get_constant_vectors): VIEW_CONVERT_EXPR
 ctor elements first if their type isn't compatible with vector
 element type.

 --- gcc/expr.c.jj   2012-09-27 12:45:53.0 +0200
 +++ gcc/expr.c  2012-10-01 18:21:40.885122833 +0200
 @@ -5491,7 +5491,7 @@ categorize_ctor_elements_1 (const_tree c
  {
HOST_WIDE_INT mult = 1;

 -  if (TREE_CODE (purpose) == RANGE_EXPR)
 +  if (purpose  TREE_CODE (purpose) == RANGE_EXPR)
 {
   tree lo_index = TREE_OPERAND (purpose, 0);
   tree hi_index = TREE_OPERAND (purpose, 1);
 --- gcc/tree-cfg.c.jj   2012-10-01 17:28:17.469921927 +0200
 +++ gcc/tree-cfg.c  2012-10-02 11:24:11.686155889 +0200
 @@ -4000,6 +4000,80 @@ verify_gimple_assign_single (gimple stmt
return res;

  case CONSTRUCTOR:
 +  if (TREE_CODE (rhs1_type) == VECTOR_TYPE)
 +   {
 + unsigned int i;
 + tree elt_i, elt_v, elt_t = NULL_TREE;
 +
 + if (CONSTRUCTOR_NELTS (rhs1) == 0)
 +   return res;
 + /* For vector CONSTRUCTORs we require that either it is empty
 +CONSTRUCTOR, or it is a CONSTRUCTOR of smaller vector elements
 +(then the element count must be correct to cover the whole
 +outer vector and index must be NULL on all elements, or it is
 +a CONSTRUCTOR of scalar elements, where we as an exception allow
 +smaller number of elements (assuming zero filling) and
 +consecutive indexes as compared to NULL indexes (such
 +CONSTRUCTORs can appear in the IL from FEs).  */
 + FOR_EACH_CONSTRUCTOR_ELT (CONSTRUCTOR_ELTS (rhs1), i, elt_i, elt_v)
 +   {
 + if (elt_t == NULL_TREE)
 +   {
 + elt_t = TREE_TYPE (elt_v);
 + if (TREE_CODE (elt_t) == VECTOR_TYPE)
 +   {
 + tree elt_t = TREE_TYPE (elt_v);
 + if (!useless_type_conversion_p (TREE_TYPE (rhs1_type),
 + TREE_TYPE (elt_t)))
 +   {
 + error (incorrect type of vector CONSTRUCTOR
 + elements);
 + debug_generic_stmt (rhs1);
 + return true;
 +   }
 + else if (CONSTRUCTOR_NELTS (rhs1)
 +  * TYPE_VECTOR_SUBPARTS (elt_t)
 +  != TYPE_VECTOR_SUBPARTS (rhs1_type))
 +   {
 + error (incorrect number of vector CONSTRUCTOR
 + elements);
 + debug_generic_stmt (rhs1);
 + return true;
 +   }
 +   }
 + else if (!useless_type_conversion_p (TREE_TYPE (rhs1_type),
 +  elt_t))
 +   {
 + error (incorrect type of vector CONSTRUCTOR elements);
 + debug_generic_stmt (rhs1);
 + return true;
 +   }
 + else if (CONSTRUCTOR_NELTS (rhs1)
 +   TYPE_VECTOR_SUBPARTS (rhs1_type))
 +   {
 + error (incorrect number of vector CONSTRUCTOR 
 elements);
 + debug_generic_stmt (rhs1);
 + return true;
 +   

[PATCH] Fix PR54735

2012-10-02 Thread Richard Guenther

This fixes PR54735 - a bad interaction of non-up-to-date virtual
SSA form, update-SSA and cfg cleanup.  Morale of the story:
cfg cleanup can remove blocks and thus release SSA names - SSA
update is rightfully confused when such released SSA name is
still used at update time.

The following patch fixes the case in question simply by making
sure to run update-SSA before cfg cleanup.

[eventually release_ssa_name could treat virtual operands the
same as regular SSA names when they are scheduled for a re-write,
that would make this whole mess more robust - I am thinking of this]

Bootstrapped on x86_64-unknown-linux-gnu, testing in progress.

Richard.

2012-10-02  Richard Guenther  rguent...@suse.de

PR middle-end/54735
* tree-ssa-pre.c (do_pre): Make sure to update virtual SSA form before
cleaning up the CFG.

* g++.dg/torture/pr54735.C: New testcase.

Index: gcc/tree-ssa-pre.c
===
--- gcc/tree-ssa-pre.c  (revision 191969)
+++ gcc/tree-ssa-pre.c  (working copy)
@@ -4820,6 +4820,13 @@ do_pre (void)
 
   free_scc_vn ();
 
+  /* Tail merging invalidates the virtual SSA web, together with
+ cfg-cleanup opportunities exposed by PRE this will wreck the
+ SSA updating machinery.  So make sure to run update-ssa
+ manually, before eventually scheduling cfg-cleanup as part of
+ the todo.  */
+  update_ssa (TODO_update_ssa_only_virtuals);
+
   return todo;
 }
 
@@ -4845,8 +4852,7 @@ struct gimple_opt_pass pass_pre =
   0,   /* properties_provided */
   0,   /* properties_destroyed */
   TODO_rebuild_alias,  /* todo_flags_start */
-  TODO_update_ssa_only_virtuals  | TODO_ggc_collect
-  | TODO_verify_ssa /* todo_flags_finish */
+  TODO_ggc_collect | TODO_verify_ssa   /* todo_flags_finish */
  }
 };
 
Index: gcc/testsuite/g++.dg/torture/pr54735.C
===
--- gcc/testsuite/g++.dg/torture/pr54735.C  (revision 0)
+++ gcc/testsuite/g++.dg/torture/pr54735.C  (working copy)
@@ -0,0 +1,179 @@
+// { dg-do compile }
+
+class Gmpfr
+{};
+class M : Gmpfr
+{
+public:
+  Gmpfr infconst;
+  M(int);
+};
+templatetypenamestruct A;
+templatetypename, int, int, int = 0 ? : 0, int = 0, int = 0class N;
+templatetypenameclass O;
+templatetypenamestruct B;
+struct C
+{
+  enum
+  { value };
+};
+class D
+{
+public:
+  enum
+  { ret };
+};
+struct F
+{
+  enum
+  { ret = 0 ? : 0 };
+};
+templatetypename Derivedstruct G
+{
+  typedef ODerivedtype;
+};
+struct H
+{
+  void operator * ();
+};
+struct I
+{
+  enum
+  { RequireInitialization = C::value ? : 0, ReadCost };
+};
+templatetypename Derivedstruct J
+{
+  enum
+  { ret = ADerived::InnerStrideAtCompileTime };
+};
+templatetypename Derivedstruct K
+{
+  enum
+  { ret = ADerived::OuterStrideAtCompileTime };
+};
+templatetypename Derivedclass P : H
+{
+public:
+  using H::operator *;
+  typedef typename ADerived::Scalar Scalar;
+  enum
+  { RowsAtCompileTime=
+  ADerived::RowsAtCompileTime, ColsAtCompileTime   =
+  ADerived::ColsAtCompileTime, SizeAtCompileTime   =
+  F::ret, MaxRowsAtCompileTime   =
+  ADerived::MaxRowsAtCompileTime, MaxColsAtCompileTime =
+  ADerived::MaxColsAtCompileTime, MaxSizeAtCompileTime =
+  F::ret, Flags  =
+  ADerived::Flags ? : 0 ? : 0, CoeffReadCost   =
+  ADerived::CoeffReadCost, InnerStrideAtCompileTime=
+  JDerived::ret, OuterStrideAtCompileTime  = KDerived::ret 
};
+  BDerived operator  (const Scalar);
+};
+
+templatetypename Derivedclass O : public PDerived
+{};
+
+templateint _Colsclass L
+{
+public:
+
+  int cols()
+  {
+return _Cols;
+  }
+};
+templatetypename Derivedclass Q : public GDerived::type
+{
+public:
+  typedef typename GDerived::type   Base;
+  typedef typename ADerived::Index  Index;
+  typedef typename ADerived::Scalar Scalar;
+  LBase::ColsAtCompileTime m_storage;
+  Index cols()
+  {
+return m_storage.cols();
+  }
+
+  Scalar coeffRef(Index,
+   Index);
+};
+
+templatetypename _Scalar, int _Rows, int _Cols, int _Options, int _MaxRows,
+ int _MaxColsstruct AN_Scalar, _Rows, _Cols, _Options, _MaxRows,
+ _MaxCols 
+{
+  typedef _Scalar Scalar;
+  typedef int Index;
+  enum
+  { RowsAtCompileTime, ColsAtCompileTime  =
+  _Cols, MaxRowsAtCompileTime, MaxColsAtCompileTime, Flags=
+  D::ret, CoeffReadCost   =
+  I::ReadCost, InnerStrideAtCompileTime, OuterStrideAtCompileTime =
+  0 ? : 0 };
+};
+templatetypename _Scalar, int, int _Cols, int, int,
+ intclass N : public QN_Scalar, 0, _Cols 
+{
+public:
+  QN Base;
+  templatetypename T0, typename T1N(const T0

Re: [patch] Minor TARGET_MEM_REF cleanup

2012-10-01 Thread Richard Guenther
On Sat, Sep 29, 2012 at 1:17 PM, Eric Botcazou ebotca...@adacore.com wrote:
 Hi,

 for simple loops like:

 extern int a[];
 extern int b[];

 void foo (int l)
 {
   int i;

   for (i = 0; i  l; i++)
 a[i] = b [i];
 }

 you get in the .lim3 dump:

 Unanalyzed memory reference 0: _5 = MEM[symbol: b, index: ivtmp.3_1, step: 4,
 offset: 0B];
 Memory reference 1: MEM[symbol: a, index: ivtmp.3_1, step: 4, offset: 0B]

 so the pass analyzes the store but not the load, which seems an oversight.
 The patch also folds copy_mem_ref_info into its only user and removes it.

 Tested on x86_64-suse-linux, OK for mainline?

Please take the opportunity to clean up simple_mem_ref_in_stmt some more.
Both loads and stores in assigns require gimple_assign_single_p, thus do

 if (!gimple_assign_single_p (stmt))
  return NULL;

before deciding on store/load.  To decide that the stmt is a load then do

  if (TREE_CODE (*lhs) == SSA_NAME
   gimple_vuse (stmt))

it is a store if

   gimple_vdef (stmt)
(TREE_CODE (*rhs) == SSA_NAME
  || is_gimple_min_invariant (*rhs))

else it may still be an aggregate copy but LIM doesn't handle those
(though they may still be interesting for disambiguation ...)

The tree-ssa-address parts are ok as-is.

Thanks,
Richard.



 2012-09-29  Eric Botcazou  ebotca...@adacore.com

 * tree.h (copy_mem_ref_info): Delete.
 * tree-ssa-address.c (copy_mem_ref_info): Likewise.
 (maybe_fold_tmr): Copy flags manually.
 * tree-ssa-loop-im.c (simple_mem_ref_in_stmt): Accept TARGET_MEM_REF
 on the RHS as well.


 --
 Eric Botcazou


Re: [PATCH] Fix PR middle-end/54759

2012-10-01 Thread Richard Guenther
On Sun, Sep 30, 2012 at 9:03 PM, Dehao Chen de...@google.com wrote:
 Hi,

 This patch fixes the bug when comparing location to UNKNOWN_LOC.

 Bootstrapped and passed gcc regression test.

 Okay for trunk?

Ok.

Thanks,
Richard.

 Thanks,
 Dehao

 2012-09-30  Dehao Chen  de...@google.com

 PR middle-end/54759
 * gcc/tree-vect-loop-manip.c (slpeel_make_loop_iterate_ntimes): Use
 LOCATION_LOCUS to compare with UNKNOWN_LOCATION.
 (slpeel_tree_peel_loop_to_edge): Likewise.
 * gcc/tree-vectorizer.c (vectorize_loops): Likewise.


Re: RFC: LRA for x86/x86-64 [0/9]

2012-10-01 Thread Richard Guenther
On Mon, Oct 1, 2012 at 7:48 AM, Jakub Jelinek ja...@redhat.com wrote:
 On Sun, Sep 30, 2012 at 06:50:50PM -0400, Vladimir Makarov wrote:
   But I think that LRA cpu time problem for this test can be fixed.
 But I don't think I can fix it for 2 weeks.  So if people believe
 that current LRA behaviour on this PR is a stopper to include it
 into gcc4.8 than we should postpone its inclusion until gcc4.9 when
 I hope to fix it.

 I think this testcase shouldn't be a show stopper for LRA inclusion into
 4.8, but something to look at for stage3.

I agree here.

 I think a lot of GCC passes have scalability issues on that testcase,
 that is why it must be compiled with -O1 and not higher optimization
 options, so perhaps it would be enough to choose a faster algorithm
 generating worse code for the huge functions and -O1.

Yes, we spent quite some time in making basic optimization work for
insane testcases (basically avoid quadratic or bigger complexity in any
IL size variable (number of basic-blocks, edges, instructions, pseudos, etc.)).

And indeed if you use -O2 we do have issues with existing passes (and even
at -O1 points-to analysis can wreck things, or even profile guessing - see
existing bugs for that).  Basically I would tune -O1 towards being able to
compile and optimize insane testcases with memory and compile-time
requirements that are linear in any of the above complexity measures.

Thus, falling back to the -O0 register allocating strathegy at certain
thresholds for the above complexity measures is fine (existing IRA
for example has really bad scaling on the number of loops in the function,
but you can tweak with flags to make it not consider that).

 And I agree it is primarily a bug in the generator that it creates such huge
 functions, that can't perform very well.

Well, not for -O2+, yes, but at least we should try(!) hard.

Thanks,
Richard.

 Jakub


Re: Profile housekeeping 5/n (make RTL loop optimizers to use loop bounds better)

2012-10-01 Thread Richard Guenther
On Mon, Oct 1, 2012 at 11:37 AM, Jan Hubicka hubi...@ucw.cz wrote:
 Hi,
 this patch commonizes the maximal iteration estimate logic in between SCEV and
 loop-iv.  Both are now using loop-nb_iterations_upper_bound.  I decided to
 keep same API for SCEV code as for RTL code, so I made
 estimated_loop_iterations and max_loop_iterations to not try to recompute
 bounds and ICE when invoked without SCEV fired on.

 The patch updates RTL optimizers to use the estimated_loop_iterations and
 max_loop_iterations. This has few advantages:
   1) loop unroller can now take into account estimates stored into
  loop structure by earlier pass (I think none exist though)
  It is however better then using expected_loop_iterations since
  profile might get out of date with expansion.

   2) loop peeling code now use max iterations bounds. This makes it i.e.
  to peel vectorizer prologues/epilogues/scalar loops so -fpeel-loops
  now improves my low iteration count testcase by about 10%

   3) Same for loop unswithcing.

 I am not really friend with the new double_int API. I copied some existing
 examples but find it ugly.  Why do we miss operators for comparsions and
 division? Why from_*/to_* can't be a cast at least for basic integer types?

 Regtested/bootstrapped x86_64-linux, seems sane?

Yes.  Can you add a testcase or two?  I tweaked RTL unroll/peel after preserving
loops to not blindly unroll/peel everything 8 times (not sure if _I_
added testcases ...).

 I also wonder if loop vectorizer should not update the estimates after
 loop iteration count is reduced by vectorizing.

Probably yes.

Thanks,
Richard.

 Honza
 * loop-unswitch.c (unswitch_single_loop): Use
 estimated_loop_iterations_int to prevent unswitching when loop
 is known to not roll.
 * tree-ssa-loop-niter.c (estimated_loop_iterations): Do not segfault
 when SCEV is not initialized.
 (max_loop_iterations): Likewise.
 * tree-ssa-loop-unswitch.c (tree_ssa_unswitch_loops): Use
 estimated_loop_iterations_int to prevent unswithcing when
 loop is known to not roll.
 * tree-scalar-evolution.c (scev_initialized_p): New function.
 * tree-scalar-evolution.h (scev_initialized_p): Likewise.
 * loop-unroll.c (decide_peel_once_rolling): Use
 max_loop_iterations_int.
 (unroll_loop_constant_iterations): Update
 nb_iterations_upper_bound and nb_iterations_estimate.
 (decide_unroll_runtime_iterations): Use
 estimated_loop_iterations or max_loop_iterations;
 (unroll_loop_runtime_iterations): fix profile updating.
 (decide_peel_simple): Use estimated_loop_iterations
 and max_loop_iterations.
 (decide_unroll_stupid): Use estimated_loop_iterations
 ad max_loop_iterations.
 * loop-doloop.c (doloop_modify): Use max_loop_iterations_int.
 (doloop_optimize): Likewise.
 * loop-iv.c (iv_number_of_iterations): Use record_niter_bound.
 (find_simple_exit): Likewise.
 * cfgloop.h (struct niter_desc): Remove niter_max.

 Index: loop-unswitch.c
 ===
 *** loop-unswitch.c (revision 191867)
 --- loop-unswitch.c (working copy)
 *** unswitch_single_loop (struct loop *loop,
 *** 257,262 
 --- 257,263 
 rtx cond, rcond = NULL_RTX, conds, rconds, acond, cinsn;
 int repeat;
 edge e;
 +   HOST_WIDE_INT iterations;

 /* Do not unswitch too much.  */
 if (num  PARAM_VALUE (PARAM_MAX_UNSWITCH_LEVEL))
 *** unswitch_single_loop (struct loop *loop,
 *** 299,305 
   }

 /* Nor if the loop usually does not roll.  */
 !   if (expected_loop_iterations (loop)  1)
   {
 if (dump_file)
 fprintf (dump_file, ;; Not unswitching, loop iterations  1\n);
 --- 300,307 
   }

 /* Nor if the loop usually does not roll.  */
 !   iterations = estimated_loop_iterations_int (loop);
 !   if (iterations = 0  iterations = 1)
   {
 if (dump_file)
 fprintf (dump_file, ;; Not unswitching, loop iterations  1\n);
 Index: tree-ssa-loop-niter.c
 ===
 *** tree-ssa-loop-niter.c   (revision 191867)
 --- tree-ssa-loop-niter.c   (working copy)
 *** estimate_numbers_of_iterations_loop (str
 *** 3012,3020 
   bool
   estimated_loop_iterations (struct loop *loop, double_int *nit)
   {
 !   estimate_numbers_of_iterations_loop (loop);
 if (!loop-any_estimate)
 ! return false;

 *nit = loop-nb_iterations_estimate;
 return true;
 --- 3012,3034 
   bool
   estimated_loop_iterations (struct loop *loop, double_int *nit)
   {
 !   /* When SCEV information is available, try to update loop iterations
 !  estimate.  Otherwise just return whatever we recorded earlier.  */
 !   if (scev_initialized_p ())
 ! 

Re: Constant-fold vector comparisons

2012-10-01 Thread Richard Guenther
On Sat, Sep 29, 2012 at 3:25 PM, Marc Glisse marc.gli...@inria.fr wrote:
 Hello,

 this patch does 2 things (I should have split it in 2, but the questions go
 together):

 1) it handles constant folding of vector comparisons,

 2) it fixes another place where vectors are not expected (I'll probably wait
 to have front-end support and testcases to do more of those, but there is
 something to discuss).

 I wasn't sure what integer_truep should test exactly. For integer: == 1 or
 != 0? For vectors: == -1 or  0? I chose the one that worked best for the
 forwprop case where I used it.

 It seems that before this patch, the middle-end didn't know how comparison
 results were encoded (a good reason for VEC_COND_EXPR to require a
 comparison as its first argument). I am using the OpenCL encoding that what
 matters is the high bit of each vector element. I am not quite sure what
 happens for targets (are there any?) that use a different encoding. When
 expanding vcond, they can do the comparison as they like. When expanding an
 isolated comparison, I expect they have to expand it as vcond(ab,-1,0). So
 it should be ok, but I could easily have missed something.

Comments below


 2012-10-01  Marc Glisse  marc.gli...@inria.fr

 gcc/
 * tree.c (integer_truep): New function.
 * tree.h (integer_truep): Declare.
 * tree-ssa-forwprop.c (forward_propagate_into_cond): Call it.
 Don't use boolean_type_node for vectors.
 * fold-const.c (fold_relational_const): Handle VECTOR_CST.

 gcc/testsuite/
 * gcc.dg/tree-ssa/foldconst-6.c: New testcase.

 --
 Marc Glisse
 Index: gcc/tree.h
 ===
 --- gcc/tree.h  (revision 191850)
 +++ gcc/tree.h  (working copy)
 @@ -5272,20 +5272,25 @@ extern int integer_zerop (const_tree);

  /* integer_onep (tree x) is nonzero if X is an integer constant of value 1.
 */

  extern int integer_onep (const_tree);

  /* integer_all_onesp (tree x) is nonzero if X is an integer constant
 all of whose significant bits are 1.  */

  extern int integer_all_onesp (const_tree);

 +/* integer_truep (tree x) is nonzero if X is an integer constant of value
 1,
 +   or a vector constant of value  0.  */
 +
 +extern bool integer_truep (const_tree);
 +
  /* integer_pow2p (tree x) is nonzero is X is an integer constant with
 exactly one bit 1.  */

  extern int integer_pow2p (const_tree);

  /* integer_nonzerop (tree x) is nonzero if X is an integer constant
 with a nonzero value.  */

  extern int integer_nonzerop (const_tree);

 Index: gcc/tree-ssa-forwprop.c
 ===
 --- gcc/tree-ssa-forwprop.c (revision 191850)
 +++ gcc/tree-ssa-forwprop.c (working copy)
 @@ -564,46 +564,46 @@ forward_propagate_into_cond (gimple_stmt
enum tree_code code;
tree name = cond;
gimple def_stmt = get_prop_source_stmt (name, true, NULL);
if (!def_stmt || !can_propagate_from (def_stmt))
 return 0;

code = gimple_assign_rhs_code (def_stmt);
if (TREE_CODE_CLASS (code) == tcc_comparison)
 tmp = fold_build2_loc (gimple_location (def_stmt),
code,
 -  boolean_type_node,
 +  TREE_TYPE (cond),

That's obvious.

gimple_assign_rhs1 (def_stmt),
gimple_assign_rhs2 (def_stmt));
else if ((code == BIT_NOT_EXPR
  TYPE_PRECISION (TREE_TYPE (cond)) == 1)
|| (code == BIT_XOR_EXPR
 -   integer_onep (gimple_assign_rhs2 (def_stmt
 +   integer_truep (gimple_assign_rhs2 (def_stmt

See below.

 {
   tmp = gimple_assign_rhs1 (def_stmt);
   swap = true;
 }
  }

if (tmp
 is_gimple_condexpr (tmp))
  {
if (dump_file  tmp)
 {
   fprintf (dump_file,   Replaced ');
   print_generic_expr (dump_file, cond, 0);
   fprintf (dump_file, ' with ');
   print_generic_expr (dump_file, tmp, 0);
   fprintf (dump_file, '\n);
 }

 -  if (integer_onep (tmp))
 +  if (integer_truep (tmp))
 gimple_assign_set_rhs_from_tree (gsi_p, gimple_assign_rhs2 (stmt));
else if (integer_zerop (tmp))
 gimple_assign_set_rhs_from_tree (gsi_p, gimple_assign_rhs3 (stmt));
else
 {
   gimple_assign_set_rhs1 (stmt, unshare_expr (tmp));
   if (swap)
 {
   tree t = gimple_assign_rhs2 (stmt);
   gimple_assign_set_rhs2 (stmt, gimple_assign_rhs3 (stmt));
 Index: gcc/testsuite/gcc.dg/tree-ssa/foldconst-6.c
 ===
 --- gcc/testsuite/gcc.dg/tree-ssa/foldconst-6.c (revision 0)
 +++ gcc/testsuite/gcc.dg/tree-ssa/foldconst-6.c (revision 0)
 @@ -0,0 +1,14 @@
 +/* { 

[PATCH] Fix PR47799 - debug info for early-inlining with LTO

2012-10-01 Thread Richard Guenther

This tries to emit proper debug information for early-inlined
functions from LTO LTRANS phase (thus, emit DW_TAG_inlined_subroutine
and allow gdb to set breakpoints).  We need to avoid confusing
LTO and dwarf2out with the full abstract block tree, so this
patch flattens the abstract block tree by always using the
ultimate origin for BLOCK_ABSTRACT_ORIGIN on blocks which are
inlined_function_outer_scope_p.  Thus, it tries to output the
minimal info dwarf2out.c needs to emit the desired debug information.

As with LTO all abstract inline instances get generated late for
early inlined functions I had to amend the hack for extern inline
functions to always output dies for decls that come their way
through dwarf2out_abstract_function.  And further down not crash
on a NULL DECL_INITIAL (when LTO decided to output the function
body in another LTRANS unit or if it does not get output at all).

Currently LTO-bootstrapping and testing on x86_64-unknown-linux-gnu.

Jason, are the dwarf2out bits ok with you?

I've sofar toyed with examples like

int x, y;
static inline int foo (int i) { y = i; return y; }
static inline int bar (int i) { x = i; return foo (x); }
int main ()
{
  int k = 0;
  int res = bar (k);
  return res;
}

and debug information with/without LTO is now reasonably the same
and I can set breakpoints on the inlined instances.

Thanks,
Richard.

2012-10-01  Richard Guenther  rguent...@suse.de

PR lto/47788
* tree-streamer-out.c (write_ts_block_tree_pointers): For
inlined functions outer scopes write the ultimate origin
as BLOCK_ABSTRACT_ORIGIN and BLOCK_SOURCE_LOCATION.
Do not stream the fragment chains.
(lto_input_ts_block_tree_pointers): Likewise.
* dwarf2out.c (gen_subprogram_die): Handle NULL DECL_INITIAL.
(dwarf2out_decl): Always output DECL_ABSTRACT function decls.

Index: gcc/tree-streamer-in.c
===
*** gcc/tree-streamer-in.c  (revision 191824)
--- gcc/tree-streamer-in.c  (working copy)
*** static void
*** 789,810 
  lto_input_ts_block_tree_pointers (struct lto_input_block *ib,
  struct data_in *data_in, tree expr)
  {
-   /* Do not stream BLOCK_SOURCE_LOCATION.  We cannot handle debug information
-  for early inlining so drop it on the floor instead of ICEing in
-  dwarf2out.c.  */
BLOCK_VARS (expr) = streamer_read_chain (ib, data_in);
  
-   /* Do not stream BLOCK_NONLOCALIZED_VARS.  We cannot handle debug 
information
-  for early inlining so drop it on the floor instead of ICEing in
-  dwarf2out.c.  */
- 
BLOCK_SUPERCONTEXT (expr) = stream_read_tree (ib, data_in);
  
!   /* Do not stream BLOCK_ABSTRACT_ORIGIN.  We cannot handle debug information
!  for early inlining so drop it on the floor instead of ICEing in
   dwarf2out.c.  */
!   BLOCK_FRAGMENT_ORIGIN (expr) = stream_read_tree (ib, data_in);
!   BLOCK_FRAGMENT_CHAIN (expr) = stream_read_tree (ib, data_in);
  
/* We re-compute BLOCK_SUBBLOCKS of our parent here instead
   of streaming it.  For non-BLOCK BLOCK_SUPERCONTEXTs we still
--- 789,810 
  lto_input_ts_block_tree_pointers (struct lto_input_block *ib,
  struct data_in *data_in, tree expr)
  {
BLOCK_VARS (expr) = streamer_read_chain (ib, data_in);
  
BLOCK_SUPERCONTEXT (expr) = stream_read_tree (ib, data_in);
  
!   /* Stream BLOCK_ABSTRACT_ORIGIN and BLOCK_SOURCE_LOCATION for
!  the limited cases we can handle - those that represent inlined
!  function scopes.  For the rest them on the floor instead of ICEing in
   dwarf2out.c.  */
!   BLOCK_ABSTRACT_ORIGIN (expr) = stream_read_tree (ib, data_in);
!   BLOCK_SOURCE_LOCATION (expr) = lto_input_location (ib, data_in);
!   /* Do not stream BLOCK_NONLOCALIZED_VARS.  We cannot handle debug 
information
!  for early inlined BLOCKs so drop it on the floor instead of ICEing in
!  dwarf2out.c.  */
! 
!   /* BLOCK_FRAGMENT_ORIGIN and BLOCK_FRAGMENT_CHAIN is not live at LTO
!  streaming time.  */
  
/* We re-compute BLOCK_SUBBLOCKS of our parent here instead
   of streaming it.  For non-BLOCK BLOCK_SUPERCONTEXTs we still
Index: gcc/tree-streamer-out.c
===
*** gcc/tree-streamer-out.c (revision 191824)
--- gcc/tree-streamer-out.c (working copy)
*** write_ts_exp_tree_pointers (struct outpu
*** 682,702 
  static void
  write_ts_block_tree_pointers (struct output_block *ob, tree expr, bool ref_p)
  {
-   /* Do not stream BLOCK_SOURCE_LOCATION.  We cannot handle debug information
-  for early inlining so drop it on the floor instead of ICEing in
-  dwarf2out.c.  */
streamer_write_chain (ob, BLOCK_VARS (expr), ref_p);
  
/* Do not stream BLOCK_NONLOCALIZED_VARS.  We cannot handle debug 
information
!  for early inlining so drop it on the floor

Re: [RFC] Make vectorizer to skip loops with small iteration estimate

2012-10-01 Thread Richard Guenther
On Sun, 30 Sep 2012, Jan Hubicka wrote:

 Hi,
 the point of the following patch is to make vectorizer to not vectorize the
 following testcase with profile feedback:
 
 int a[1];
 int i=5;
 int k=2;
 int val;
 __attribute__ ((noinline,noclone))
 test()
 {
   int j;
   for(j=0;jk;j++)
 a[j]=val;
 }
 main()
 {
   while (i)
 {
   test ();
   i--;
 }
 }
 
 Here the compiler should work out that the second loop iterates 2 times at the
 average and thus it is not good candidate for vectorizing.
 
 In my first attempt I added the following:
 @@ -1474,6 +1478,18 @@ vect_analyze_loop_operations (loop_vec_i
return false;
  }
  
 +  if ((estimated_niter = estimated_stmt_executions_int (loop)) != -1
 +   (unsigned HOST_WIDE_INT) estimated_niter = th)
 +{
 +  if (vect_print_dump_info (REPORT_UNVECTORIZED_LOCATIONS))
 +fprintf (vect_dump, not vectorized: estimated iteration count too 
 small.);
 +  if (vect_print_dump_info (REPORT_DETAILS))
 +fprintf (vect_dump, not vectorized: estimated iteration count 
 smaller than 
 + user specified loop bound parameter or minimum 
 + profitable iterations (whichever is more conservative).);
 +  return false;
 +}
 +
 
 But to my surprise it does not help.  There are two things:
 
 1) the value of TH is bit low.  In a way the cost model works is that
it finds minimal niters where vectorized loop with all the setup costs
is cheaper than the vector loop with all the setup costs.  I.e.
 
   /* Calculate number of iterations required to make the vector version
  profitable, relative to the loop bodies only.  The following condition
  must hold true:
  SIC * niters + SOC  VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC(A)
  where
  SIC = scalar iteration cost, VIC = vector iteration cost,
  VOC = vector outside cost, VF = vectorization factor,
  PL_ITERS = prologue iterations, EP_ITERS= epilogue iterations
  SOC = scalar outside cost for run time cost model check.  */
 
 This value is used for both
 1) decision if number of iterations is too low (max iterations is known)
 2) decision on runtime whether we want to take the vectorized path
 or the scalar path.
 
 The vectoried loop looks like:
   k.1_10 = k;
   if (k.1_10  0)
   {
 pretmp_2 = val;
 niters.8_4 = (unsigned int) k.1_10;
 bnd.9_13 = niters.8_4  2;
 ratio_mult_vf.10_1 = bnd.9_13  2;
 _18 = niters.8_4 = 3;
 _19 = ratio_mult_vf.10_1 == 0;
 _20 = _19 | _18;
 if (_20 != 0)
   scalar loop
 else
   vector prologue
   }
 
  So the unvectorized cost is
  SIC * niters
 
  The vectorized path is
  SOC + VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC
  The scalar path of vectorizer loop is
  SIC * niters + SOC

Note that 'th' is used for the runtime profitability check which is
done at the time the setup cost has already been taken (yes, we
probably should make it more conservative but then guard the whole
set of loops by the check, not only the vectorized path).
See PR53355 for the general issue.

It makes sense to vectorize if
SIC * niters  SOC + VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC   (B)
That is in the optimal cse where we actually vectorize the overall
speed of vectorized loop including the runtime check is better.
 
It makes sense to take the vector loop if
SIC * niters  VIC * ((niters-PL_ITERS-EP_ITERS)/VF) + VOC (C)
Because the scalar loop is taken.
 
The attached patch implements the formula (C) and uses it to deterine the
decision based on number of iterations estimate (that is usually provided 
 by
the feedback)
 
As a reality check, I tried my testcase.
 
9: Cost model analysis:
  Vector inside of loop cost: 1
  Vector prologue cost: 7
  Vector epilogue cost: 2
  Scalar iteration cost: 1
  Scalar outside cost: 6
  Vector outside cost: 9
  prologue iterations: 0
  epilogue iterations: 2
  Calculated minimum iters for profitability: 4
 
9:   Profitability threshold = 3
 
9:   Profitability estimated iterations threshold = 20
 
This is overrated. The loop starts to be benefical at about 4 iterations in
reality.  I guess the values are kind of wrong.
 
Vector inside of loop cost and Scalar iteration cost seems to ignore the
fact that the loops do contain some control flow that should account at 
 least
one extra cycle.
 
Vector prologue cost seems bit overrated for one pack operation.
 
Of course this is very simple benchmark, in reality the vectorizatoin can 
 be
a lot more harmful by complicating more complex control flows.

So I guess we have two options
 1) go with the new formula and try to make cost model a bit more 
 realistic.
 2) stay with original formula that is quite close to 

[PATCH] Fix -frounding-math builtins

2012-10-01 Thread Richard Guenther

I noticed that we attach the no-vops attribute to -frounding-math
math functions.  That's bogus as can be seen from the testcase

int fesetround(int);
double asinh(double x);

double foo (double x, int b)
{
  double y = 0.0, z;
  if (b)
y = asinh (x);
  fesetround (0x400 /*FE_DOWNWARD*/);
  z = asinh (x);
  return y + z;
}

where PRE rightfully so removes a seeming partial redundancy by
inserting a asinh call into the else block.  That's because
it exactly does _not_ get to see the rounding mode clobbering
fesetround call as asinh does not have a virtual operand.

Fixed as follows.

Bootstrapped and tested on x86_64-unknown-linux-gnu.

Richard.

2012-10-01  Richard Guenther  rguent...@suse.de

* builtins.def (ATTR_MATHFN_FPROUNDING): Do not use no-vops
with -frounding-math.
* builtin-attrs.def (ATTR_PURE_NOTHROW_NOVOPS_LIST): Remove.
(ATTR_PURE_NOTHROW_NOVOPS_LEAF_LIST): Likewise.

Index: gcc/builtins.def
===
*** gcc/builtins.def(revision 191917)
--- gcc/builtins.def(working copy)
*** along with GCC; see the file COPYING3.
*** 163,169 
 memory.  */
  #undef ATTR_MATHFN_FPROUNDING
  #define ATTR_MATHFN_FPROUNDING (flag_rounding_math ? \
!   ATTR_PURE_NOTHROW_NOVOPS_LEAF_LIST : ATTR_CONST_NOTHROW_LEAF_LIST)
  
  /* Define an attribute list for math functions that are normally
 impure because some of them may write into global memory for
--- 163,169 
 memory.  */
  #undef ATTR_MATHFN_FPROUNDING
  #define ATTR_MATHFN_FPROUNDING (flag_rounding_math ? \
!   ATTR_PURE_NOTHROW_LEAF_LIST : ATTR_CONST_NOTHROW_LEAF_LIST)
  
  /* Define an attribute list for math functions that are normally
 impure because some of them may write into global memory for
Index: gcc/builtin-attrs.def
===
*** gcc/builtin-attrs.def   (revision 191917)
--- gcc/builtin-attrs.def   (working copy)
*** DEF_ATTR_TREE_LIST (ATTR_PURE_NOTHROW_LI
*** 127,136 
ATTR_NULL, ATTR_NOTHROW_LIST)
  DEF_ATTR_TREE_LIST (ATTR_PURE_NOTHROW_LEAF_LIST, ATTR_PURE,   \
ATTR_NULL, ATTR_NOTHROW_LEAF_LIST)
- DEF_ATTR_TREE_LIST (ATTR_PURE_NOTHROW_NOVOPS_LIST, ATTR_NOVOPS,   \
-   ATTR_NULL, ATTR_PURE_NOTHROW_LIST)
- DEF_ATTR_TREE_LIST (ATTR_PURE_NOTHROW_NOVOPS_LEAF_LIST, ATTR_NOVOPS,\
-   ATTR_NULL, ATTR_PURE_NOTHROW_LEAF_LIST)
  DEF_ATTR_TREE_LIST (ATTR_NORETURN_NOTHROW_LIST, ATTR_NORETURN,\
ATTR_NULL, ATTR_NOTHROW_LIST)
  DEF_ATTR_TREE_LIST (ATTR_NORETURN_NOTHROW_LEAF_LIST, ATTR_NORETURN,\
--- 127,132 


Re: [PATCH] Add option for dumping to stderr (issue6190057)

2012-10-01 Thread Richard Guenther
On Mon, Oct 1, 2012 at 3:55 PM, Sharad Singhai sing...@google.com wrote:
 On Mon, Oct 1, 2012 at 6:52 AM, H.J. Lu hjl.to...@gmail.com wrote:
 On Mon, Oct 1, 2012 at 6:49 AM, Sharad Singhai sing...@google.com wrote:
 I am sorry, I didn't enable all the languages. Will fix the fortran
 test breakage shortly.

 It is not just Fortran.  There are some failures in C testcases.

 I checked and those files looked like generator files for Fortran
 tests and thus were not exercised in my configuration. I am really
 sorry about that. I am fixing it.

As I said, you should not enable/disable anything special but
configure with all default languages enabled (no --enable-languages)
and do toplevel make -k check, preferably also excercising
multilibs with RUNTESTFLAGS=--target_board=unix/\{,-m32\}

Richard.

 UNSUPPORTED: gcc.dg/tree-ssa/gen-vect-11.c
 UNSUPPORTED: gcc.dg/tree-ssa/gen-vect-11a.c
 UNSUPPORTED: gcc.dg/tree-ssa/gen-vect-11b.c
 UNSUPPORTED: gcc.dg/tree-ssa/gen-vect-11c.c
 UNSUPPORTED: gcc.dg/tree-ssa/gen-vect-2.c
 UNSUPPORTED: gcc.dg/tree-ssa/gen-vect-25.c
 UNSUPPORTED: gcc.dg/tree-ssa/gen-vect-26.c
 UNSUPPORTED: gcc.dg/tree-ssa/gen-vect-28.c
 UNSUPPORTED: gcc.dg/tree-ssa/gen-vect-32.c

 Thanks,
 Sharad


 Thanks,
 Sharad
 Sharad


 On Mon, Oct 1, 2012 at 4:50 AM, H.J. Lu hjl.to...@gmail.com wrote:
 On Sun, Sep 30, 2012 at 11:36 PM, Sharad Singhai sing...@google.com 
 wrote:
 Resend to gcc-patches

 I have addressed the comments by fixing all the minor issues,
 bootstrapped and tested on x86_64. I did the recommended reshuffling
 by moving non-tree code from tree-dump.c into a new file dumpfile.c.

 I committed two successive revisions
 r191883 Main patch with the dump infrastructure changes. However, I
 accidentally left out a new file, dumpfile.c.
 r191884 Added dumpfile.c, and did the renaming of dump_* functions
 from gimple_pretty_print.[ch].

 As things stand right now, r191883 is broken because of the missing
 file 'dumpfile.c', which the very next commit fixes. Anyone who got
 broken revision r191883, please svn update. I am really very sorry
 about that.

 I have a couple more minor patches which deal with renaming; I plan to
 address those later.


 It caused:

 FAIL: gcc.dg/tree-ssa/gen-vect-11.c (test for excess errors)
 FAIL: gcc.dg/tree-ssa/gen-vect-11.c scan-tree-dump-times vect
 vectorized 1 loops 1
 FAIL: gcc.dg/tree-ssa/gen-vect-11a.c (test for excess errors)
 FAIL: gcc.dg/tree-ssa/gen-vect-11a.c scan-tree-dump-times vect
 vectorized 1 loops 1
 FAIL: gcc.dg/tree-ssa/gen-vect-11b.c (test for excess errors)
 FAIL: gcc.dg/tree-ssa/gen-vect-11b.c scan-tree-dump-times vect
 vectorized 0 loops 1
 FAIL: gcc.dg/tree-ssa/gen-vect-11c.c (test for excess errors)
 FAIL: gcc.dg/tree-ssa/gen-vect-11c.c scan-tree-dump-times vect
 vectorized 0 loops 1
 FAIL: gcc.dg/tree-ssa/gen-vect-2.c (test for excess errors)
 FAIL: gcc.dg/tree-ssa/gen-vect-2.c scan-tree-dump-times vect
 vectorized 1 loops 1
 FAIL: gcc.dg/tree-ssa/gen-vect-25.c (test for excess errors)
 FAIL: gcc.dg/tree-ssa/gen-vect-25.c scan-tree-dump-times vect
 vectorized 2 loops 1
 FAIL: gcc.dg/tree-ssa/gen-vect-26.c (test for excess errors)
 FAIL: gcc.dg/tree-ssa/gen-vect-26.c scan-tree-dump-times vect
 Alignment of access forced using peeling 1
 FAIL: gcc.dg/tree-ssa/gen-vect-26.c scan-tree-dump-times vect
 vectorized 1 loops 1
 FAIL: gcc.dg/tree-ssa/gen-vect-28.c (test for excess errors)
 FAIL: gcc.dg/tree-ssa/gen-vect-28.c scan-tree-dump-times vect
 Alignment of access forced using peeling 1
 FAIL: gcc.dg/tree-ssa/gen-vect-28.c scan-tree-dump-times vect
 vectorized 1 loops 1
 FAIL: gcc.dg/tree-ssa/gen-vect-32.c (test for excess errors)
 FAIL: gcc.dg/tree-ssa/gen-vect-32.c scan-tree-dump-times vect
 vectorized 1 loops 1
 FAIL: gfortran.dg/vect/O3-pr36119.f90 (test for excess errors)
 FAIL: gfortran.dg/vect/O3-pr39595.f (test for excess errors)
 FAIL: gfortran.dg/vect/Ofast-pr50414.f90 (test for excess errors)
 FAIL: gfortran.dg/vect/cost-model-pr34445.f (test for excess errors)
 FAIL: gfortran.dg/vect/cost-model-pr34445a.f (test for excess errors)
 FAIL: gfortran.dg/vect/fast-math-pr38968.f90 (test for excess errors)
 FAIL: gfortran.dg/vect/fast-math-pr38968.f90 scan-tree-dump vect
 vectorized 1 loops
 FAIL: gfortran.dg/vect/fast-math-real8-pr40801.f90 (test for excess errors)
 FAIL: gfortran.dg/vect/fast-math-real8-pr40801.f90 (test for excess errors)
 FAIL: gfortran.dg/vect/fast-math-vect-8.f90 (test for excess errors)
 FAIL: gfortran.dg/vect/fast-math-vect-8.f90 scan-tree-dump-times vect
 vectorized 1 loops 1
 FAIL: gfortran.dg/vect/no-fre-no-copy-prop-O3-pr51704.f90 (test for
 excess errors)
 FAIL: gfortran.dg/vect/no-vfa-pr32377.f90 (test for excess errors)
 FAIL: gfortran.dg/vect/no-vfa-pr32377.f90 scan-tree-dump-times vect
 vectorized 2 loops 1
 FAIL: gfortran.dg/vect/no-vfa-pr32457.f90 (test for excess errors)
 FAIL: gfortran.dg/vect/no-vfa-pr32457.f90 scan-tree-dump-times vect
 vectorized 0 loops 1
 FAIL: gfortran.dg/vect/pr19049.f90  

Re: RFC: LRA for x86/x86-64 [0/9]

2012-09-30 Thread Richard Guenther
On Sun, Sep 30, 2012 at 6:52 PM, Steven Bosscher stevenb@gmail.com wrote:
 Hi,


 To look at it in yet another way:

  integrated RA   : 189.34 (16%) usr
  LRA non-specific:  59.82 ( 5%) usr
  LRA virtuals eliminatenon:  56.79 ( 5%) usr
  LRA create live ranges  : 175.30 (15%) usr
  LRA hard reg assignment : 130.85 (11%) usr

 The IRA pass is slower than the next-slowest pass (tree PRA) by almost
 a factor 2.5.  Each of the individually-measured *phases* of LRA is
 slower than the complete IRA *pass*. These 5 timevars together make up
 for 52% of all compile time.

That figure indeed makes IRA + LRA look bad.  Did you by chance identify
anything obvious that can be done to improve the situation?

Thanks,
Richard.

 IRA already has scalability problems, let's not add more of that with LRA.

 Ciao!
 Steven


Re: RFC: LRA for x86/x86-64 [0/9]

2012-09-28 Thread Richard Guenther
On Fri, Sep 28, 2012 at 12:56 AM, Vladimir Makarov vmaka...@redhat.com wrote:
   Originally I was to submit LRA at the very beginning of stage1 for
 gcc4.9 as it was discussed on this summer GNU Tools Cauldron.  After
 some thinking, I've decided to submit LRA now but only switched on for
 *x86/x86-64* target.  The reasons for that are
   o I am already pretty confident in LRA for this target with the
 point of reliability, performance, code size, and compiler speed.
   o I am confident that I can fix LRA bugs and pitfalls which might be
 recognized and reported during stage2 and 3 of gcc4.8.
   o Wider LRA testing for x86/x86-64 will make smoother a hard transition of
 other targets to LRA during gcc4.9 development.

   During development of gcc4.9, I'd like to switch major targets to
 LRA as it was planned before.  I hope that all targets will be
 switched for the next release after gcc4.9 (although it will be
 dependent mostly on the target maintainers).  When/if it is done,
 reload and reload oriented machine-dependent code can be removed.

   LRA project was reported on 2012 GNU Tools Cauldron
 (http://gcc.gnu.org/wiki/cauldron2012).  The presentation contains a
 high-level description of LRA and the project status.

   The following patches makes LRA working for x86/x86-64. Separately
 patches mostly do nothing until the last patch switches on LRA for
 x86/x86-64.  Although compiler is bootstrapped after applying each
 patch in given order, the division is only for review convenience.

   Any comments and proposals are appreciated.  Even if GCC community
 decides that it is too late to submit it to gcc4.8, the earlier reviews
 are always useful.

From a release-manager point of view the patch is in time for 4.8, in that
it is during stage1 (which I expect to last another two to four weeks).  Note
that there is no such thing as stage2 anymore but we go straight to
stage3 (bugfixing mode, no new features) from stage1.  After three months
of stage3 we go into regression-fixes only mode for as long as there are
release-blocking bugs (regressions with priority P1).  You will have roughly
half a year to fix LRA for 4.8.0 after stage1 closes.

Thanks,
Richard.

   The patches were successfully bootstrapped and tested for x86/x86-64.



Re: vec_cond_expr adjustments

2012-09-28 Thread Richard Guenther
On Fri, Sep 28, 2012 at 12:42 AM, Marc Glisse marc.gli...@inria.fr wrote:
 Hello,

 I have been experimenting with generating VEC_COND_EXPR from the front-end,
 and these are just a couple things I noticed.

 1) optabs.c requires that the first argument of vec_cond_expr be a
 comparison, but verify_gimple_assign_ternary only checks is_gimple_condexpr,
 like for COND_EXPR. In the long term, it seems better to also allow ssa_name
 and vector_cst (thus match the is_gimple_condexpr condition), but for now I
 just want to know early if I created an invalid vec_cond_expr.

optabs should be fixed instead, an is_gimple_val condition is implicitely
val != 0.

 2) a little refactoring of the code to find a suitable vector type for
 comparison results, and one more place where it should be used (no testcase
 yet because I don't know if that path can be taken without front-end changes
 first).

Yes, it looks fine to me.

 I did wonder, for tree-ssa-forwprop, about using directly TREE_TYPE
 (cond) without truth_type_for.

Yes, that should work.

 Hmm, now I am wondering whether I should have waited until I had front-end
 vec_cond_expr support to submit everything at once...

;)

The tree.[ch] and gimple-fold.c hunks are ok if tested properly, the
tree-ssa-forwprop.c idea of using TREE_TYPE (cond), too.

I don't like the tree-cfg.c change, instead re-factor optabs.c to
get a decomposed cond for vector_compare_rtx and appropriately
decompose a non-comparison-class cond in expand_vec_cond_expr.

If we for example have

 predicate = a  b;
 x = predicate ? d : e;
 y = predicate ? f : g;

we ideally want to re-use the predicate computation on targets where
that would be optimal (and combine should be able to recover the
case where it is not).

Thanks,
Richard.

 2012-09-27  Marc Glisse  marc.gli...@inria.fr

 * tree-cfg.c (verify_gimple_assign_ternary): Stricter check on
 first argument of VEC_COND_EXPR.
 * tree.c (truth_type_for): New function.
 * tree.h (truth_type_for): Declare.
 * gimple-fold.c (and_comparisons_1): Call it.
 (or_comparisons_1): Likewise.
 * tree-ssa-forwprop.c (forward_propagate_into_cond): Likewise.

 --
 Marc Glisse
 Index: gcc/tree-ssa-forwprop.c
 ===
 --- gcc/tree-ssa-forwprop.c (revision 191810)
 +++ gcc/tree-ssa-forwprop.c (working copy)
 @@ -549,21 +549,22 @@ static bool
  forward_propagate_into_cond (gimple_stmt_iterator *gsi_p)
  {
gimple stmt = gsi_stmt (*gsi_p);
tree tmp = NULL_TREE;
tree cond = gimple_assign_rhs1 (stmt);
bool swap = false;

/* We can do tree combining on SSA_NAME and comparison expressions.  */
if (COMPARISON_CLASS_P (cond))
  tmp = forward_propagate_into_comparison_1 (stmt, TREE_CODE (cond),
 -  boolean_type_node,
 +  truth_type_for
 +(TREE_TYPE (cond)),
TREE_OPERAND (cond, 0),
TREE_OPERAND (cond, 1));
else if (TREE_CODE (cond) == SSA_NAME)
  {
enum tree_code code;
tree name = cond;
gimple def_stmt = get_prop_source_stmt (name, true, NULL);
if (!def_stmt || !can_propagate_from (def_stmt))
 return 0;

 Index: gcc/tree-cfg.c
 ===
 --- gcc/tree-cfg.c  (revision 191810)
 +++ gcc/tree-cfg.c  (working copy)
 @@ -3758,22 +3758,24 @@ verify_gimple_assign_ternary (gimple stm
tree rhs2_type = TREE_TYPE (rhs2);
tree rhs3 = gimple_assign_rhs3 (stmt);
tree rhs3_type = TREE_TYPE (rhs3);

if (!is_gimple_reg (lhs))
  {
error (non-register as LHS of ternary operation);
return true;
  }

 -  if (((rhs_code == VEC_COND_EXPR || rhs_code == COND_EXPR)
 -   ? !is_gimple_condexpr (rhs1) : !is_gimple_val (rhs1))
 +  if (((rhs_code == COND_EXPR) ? !is_gimple_condexpr (rhs1)
 +   : (rhs_code == VEC_COND_EXPR) ? (!is_gimple_condexpr (rhs1)
 +   || is_gimple_val (rhs1))
 +   : !is_gimple_val (rhs1))
|| !is_gimple_val (rhs2)
|| !is_gimple_val (rhs3))
  {
error (invalid operands in ternary operation);
return true;
  }

/* First handle operations that involve different types.  */
switch (rhs_code)
  {
 Index: gcc/gimple-fold.c
 ===
 --- gcc/gimple-fold.c   (revision 191810)
 +++ gcc/gimple-fold.c   (working copy)
 @@ -23,21 +23,20 @@ along with GCC; see the file COPYING3.
  #include coretypes.h
  #include tm.h
  #include tree.h
  #include flags.h
  #include function.h
  #include dumpfile.h
  #include tree-flow.h
  #include tree-ssa-propagate.h
  #include target.h
  #include gimple-fold.h
 -#include 

[PATCH][LTO] properly stream PARM_DECLs

2012-09-28 Thread Richard Guenther

This avoids streaming PARM_DECLs both in the global type/decl and
the local function sections.  They are needed on the global level,
so properly stream them there.  Fixes part of the issues we have
with debug information for early inlining, too.

LTO Bootstrapped and tested on x86_64-unknown-linux-gnu, applied.

Richard.

2012-09-28  Richard Guenther  rguent...@suse.de

PR lto/47799
* lto-streamer-out.c (tree_is_indexable): Make PARM_DECLs global.
(lto_output_tree_ref): Handle references to them.
(output_function): Do not output function arguments again.
* lto-streamer-in.c (input_function): Do not input arguments
again, nor overwrite them.

Index: gcc/lto-streamer-out.c
===
*** gcc/lto-streamer-out.c.orig 2012-09-26 16:47:18.0 +0200
--- gcc/lto-streamer-out.c  2012-09-28 11:18:35.438055184 +0200
*** static bool
*** 125,131 
  tree_is_indexable (tree t)
  {
if (TREE_CODE (t) == PARM_DECL)
! return false;
else if (TREE_CODE (t) == VAR_DECL  decl_function_context (t)
!TREE_STATIC (t))
  return false;
--- 125,131 
  tree_is_indexable (tree t)
  {
if (TREE_CODE (t) == PARM_DECL)
! return true;
else if (TREE_CODE (t) == VAR_DECL  decl_function_context (t)
!TREE_STATIC (t))
  return false;
*** lto_output_tree_ref (struct output_block
*** 237,242 
--- 237,243 
  case VAR_DECL:
  case DEBUG_EXPR_DECL:
gcc_assert (decl_function_context (expr) == NULL || TREE_STATIC (expr));
+ case PARM_DECL:
streamer_write_record_start (ob, LTO_global_decl_ref);
lto_output_var_decl_index (ob-decl_state, ob-main_stream, expr);
break;
*** output_function (struct cgraph_node *nod
*** 806,814 
  
output_struct_function_base (ob, fn);
  
-   /* Output the head of the arguments list.  */
-   stream_write_tree (ob, DECL_ARGUMENTS (function), true);
- 
/* Output all the SSA names used in the function.  */
output_ssa_names (ob, fn);
  
--- 807,812 
Index: gcc/lto-streamer-in.c
===
*** gcc/lto-streamer-in.c.orig  2012-09-21 10:59:45.0 +0200
--- gcc/lto-streamer-in.c   2012-09-28 11:14:53.835068419 +0200
*** input_function (tree fn_decl, struct dat
*** 823,829 
gimple *stmts;
basic_block bb;
struct cgraph_node *node;
-   tree args, narg, oarg;
  
fn = DECL_STRUCT_FUNCTION (fn_decl);
tag = streamer_read_record_start (ib);
--- 823,828 
*** input_function (tree fn_decl, struct dat
*** 834,855 
  
input_struct_function_base (fn, data_in, ib);
  
-   /* Read all function arguments.  We need to re-map them here to the
-  arguments of the merged function declaration.  */
-   args = stream_read_tree (ib, data_in);
-   for (oarg = args, narg = DECL_ARGUMENTS (fn_decl);
-oarg  narg;
-oarg = TREE_CHAIN (oarg), narg = TREE_CHAIN (narg))
- {
-   unsigned ix;
-   bool res;
-   res = streamer_tree_cache_lookup (data_in-reader_cache, oarg, ix);
-   gcc_assert (res);
-   /* Replace the argument in the streamer cache.  */
-   streamer_tree_cache_insert_at (data_in-reader_cache, narg, ix);
- }
-   gcc_assert (!oarg  !narg);
- 
/* Read all the SSA names.  */
input_ssa_names (ib, data_in, fn);
  
--- 833,838 


  1   2   3   4   5   6   7   8   9   10   >