Re: RFA: Speedup expand_used_vars by 30 times (PR38474)

2012-07-03 Thread Michael Matz
Hi,

On Mon, 2 Jul 2012, Mike Stump wrote:

 On May 26, 2012, at 8:03 PM, Michael Matz wrote:
  ---
  PR middle-end/38474
  * cfgexpand.c (struct stack_var): Add slot_type member.
  (add_stack_var): Initialize it.
  (add_alias_set_conflicts): Remove.
  (merge_stack_vars_p, more_specific_type): New functions.
  (nonconflicting_type): New static data.
  (union_stack_vars): Use more_specific_type to update slot_type.
  (partition_stack_vars): Call merge_stack_vars_p ...
 
 So this patch seems to have killed 
 'RUNTESTFLAGS=dg.exp=class_array_3.f03' check-fortran for me.  Any 
 insight?

Hmm, no.  Works for me just fine :-/  So something is shared on the stack, 
but then reordered so that both entities are live.  If you find which file 
is miscompiled you can look at the -fdump-rtl-expand-details dumps for 
entries like:
  Partition 1: size 32 align 16
dest
  Partition 0: size 32 align 16
src

If something is shared there will be multiple decls mentioned in the 
lines.  (Ignore the lines mentioning Partition without a ':', that the 
out-of-ssa partitions, not the stack slots).

 Unfortunately the bug I think manifests as generated code in 
 the fortran runtime libraries, so I need to go track down the code.
 
 Also, git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@188667 seems to 
 not match this patch, nor any other patch I can find, but it is the 
 closest match.

The patch that went in is 
  http://gcc.gnu.org/ml/gcc-patches/2012-06/msg00367.html


Ciao,
Michael.


Re: RFA: Speedup expand_used_vars by 30 times (PR38474)

2012-07-02 Thread Mike Stump
On May 26, 2012, at 8:03 PM, Michael Matz wrote:
 ---
   PR middle-end/38474
   * cfgexpand.c (struct stack_var): Add slot_type member.
   (add_stack_var): Initialize it.
   (add_alias_set_conflicts): Remove.
   (merge_stack_vars_p, more_specific_type): New functions.
   (nonconflicting_type): New static data.
   (union_stack_vars): Use more_specific_type to update slot_type.
   (partition_stack_vars): Call merge_stack_vars_p ...

So this patch seems to have killed 'RUNTESTFLAGS=dg.exp=class_array_3.f03' 
check-fortran for me.  Any insight?  Unfortunately the bug I think manifests as 
generated code in the fortran runtime libraries, so I need to go track down the 
code.

Also, git-svn-id: svn+ssh://gcc.gnu.org/svn/gcc/trunk@188667 seems to not match 
this patch, nor any other patch I can find, but it is the closest match.


Re: RFA: Speedup expand_used_vars by 30 times (PR38474)

2012-05-29 Thread Richard Guenther
On Sun, May 27, 2012 at 5:03 AM, Michael Matz m...@suse.de wrote:
 Hi,

 [for certain test cases :-) ]

 the temp slot cleanups I just sent where actually motivated by PR38474.
 It exposes many slownesses in the compiler, but at -O0 the only remaining
 one is the expand phase.  expanding variables is the one thing (on my
 machine here, with -O0 -g f951): 30 seconds for the reduced testcase.
 expand itself (the statements) then needs 5.5 seconds and is the other
 thing.  Overall 77 seconds compile time.

 The problem with expanding variables in this case is the
 add_alias_set_conflicts routine.  It walks all NxN/2 stack variable pairs,
 looks if pair (i,j) can share the same stack slot from a type perspective
 (canshare(i,j)) and adds a conflict if not so.  Before asking that
 question it doesn't look if pair (i,j) maybe already conflict for other
 reasons.

 Now, we could simply add the conflicts(i,j) test in that loop, before
 asking canshare(i,j) and adding a new conflict.  But adding conflicts
 comes with a cost, so we can do better.  The answer to canshare(i,j)
 depends on only all components already merged into [i] and the type of j.
 With carefully choosing a type to represent all components of [i] and
 updating it in union_stack_vars we can simply ask canshare(i,j) right
 before we actually want to merge j into [i].  That's the least amount of
 work possible as long as we want to have the same results.

 This change brings the 30 seconds for var expansion down to 0.8 seconds.

 The other change in function.c further improves the temp slot machinery.
 While expanding free_temp_slots is called after each statement, and if it
 is able to free a slot it needs to update the RTX-slot mapping (a
 htab_t).  The freeing of slots was done incorrectly, it overwrote the slot
 in question with NULL, but it should have used htab_clear_slot.

That looks like a bugfix applicable to branches?

 This
 meant that the hashtable kept growing and growing even with all elements
 being empty because the size statistics aren't correct.  Additionally it
 breaks hash chains if there is one striding the nullified slot.

 And then a microoptimization: if we know there aren't any slots in use at
 all (happens often in normal circumstances) we can as well use htab_empty
 instead of traversing the hash table for the right slots.

 This brings the expansion time (i.e. for expanding the statements) down
 from 5.5 to 2.8 seconds.  All changes together reduce the compile time
 for the reduced testcase from ~ 77 seconds to ~ 44.

 A variant of this regstrapped on x86_64-linux, all languages+Ada+objc++,
 but after some changes I'm redoing that regstrap.  Okay for trunk if that
 passes?

The volatile handling is very odd - the objects_must_conflict_p code basically
says that two volatile vars may always share stack slots?!  What's the reasoning
for this, or handling volatiles special in any way?  I'd rather remove
this fishy
code (and if at all, ensure that volatile automatics are never coalesced with
any other stack slot).

Thanks,
Richard.


 Ciao,
 Michael.
 ---
        PR middle-end/38474
        * cfgexpand.c (struct stack_var): Add slot_type member.
        (add_stack_var): Initialize it.
        (add_alias_set_conflicts): Remove.
        (merge_stack_vars_p, more_specific_type): New functions.
        (nonconflicting_type): New static data.
        (union_stack_vars): Use more_specific_type to update slot_type.
        (partition_stack_vars): Call merge_stack_vars_p ...
        (expand_used_vars): ... instead of add_alias_set_conflicts.
        (fini_vars_expansion): Clear nonconflicting_type.
        * function.c (n_temp_slots_in_use): New static data.
        (make_slot_available, assign_stack_temp_for_type): Update it.
        (init_temp_slots): Zero it.
        (remove_unused_temp_slot_addresses): Use it for quicker removal.
        (remove_unused_temp_slot_addresses_1): Use htab_clear_slot.

 Index: cfgexpand.c
 ===
 --- cfgexpand.c.orig    2012-05-27 01:33:55.0 +0200
 +++ cfgexpand.c 2012-05-27 04:37:18.0 +0200
 @@ -177,6 +177,11 @@ struct stack_var
   /* The next stack variable in the partition, or EOC.  */
   size_t next;

 +  /* The most specific type of all variables merged with the slot
 +     if this is a representative.  Initially this is simply the
 +     TREE_TYPE for the variable.  */
 +  tree slot_type;
 +
   /* The numbers of conflicting stack variables.  */
   bitmap conflicts;
  };
 @@ -285,6 +290,7 @@ add_stack_var (tree decl)
   /* All variables are initially in their own partition.  */
   v-representative = stack_vars_num;
   v-next = EOC;
 +  v-slot_type = TREE_TYPE (decl);

   /* All variables initially conflict with no other.  */
   v-conflicts = NULL;
 @@ -353,45 +359,40 @@ aggregate_contains_union_type (tree type
   return false;
  }

 -/* A subroutine of expand_used_vars.  If two variables X and Y have 

Re: RFA: Speedup expand_used_vars by 30 times (PR38474)

2012-05-29 Thread Michael Matz
Hi,

On Tue, 29 May 2012, Richard Guenther wrote:

  The other change in function.c further improves the temp slot 
  machinery. While expanding free_temp_slots is called after each 
  statement, and if it is able to free a slot it needs to update the 
  RTX-slot mapping (a htab_t).  The freeing of slots was done 
  incorrectly, it overwrote the slot in question with NULL, but it 
  should have used htab_clear_slot.
 
 That looks like a bugfix applicable to branches?

Perhaps, though the results of this bug are only worse code generation.  
When the chains are broken due to the bug some RTX-slot mappings aren't 
found anymore, but the code deals conservatively with that.

  A variant of this regstrapped on x86_64-linux, all 
  languages+Ada+objc++, but after some changes I'm redoing that 
  regstrap.  Okay for trunk if that passes?
 
 The volatile handling is very odd - the objects_must_conflict_p code 
 basically says that two volatile vars may always share stack slots?!  

That's correct.  Basically everything can share a stack slot that 
conflicts already for other reasons than address-based considerations.  
Two volatile accesses always conflict and hence they'll never be swapped 
(which is what creates the problems with stack slot sharing).

 What's the reasoning for this, or handling volatiles special in any way?  
 I'd rather remove this fishy code (and if at all, ensure that volatile 
 automatics are never coalesced with any other stack slot).

Volatility is no reason for not sharing stack slots, it's not fishy code.


Ciao,
Michael.

RFA: Speedup expand_used_vars by 30 times (PR38474)

2012-05-26 Thread Michael Matz
Hi,

[for certain test cases :-) ]

the temp slot cleanups I just sent where actually motivated by PR38474.  
It exposes many slownesses in the compiler, but at -O0 the only remaining 
one is the expand phase.  expanding variables is the one thing (on my 
machine here, with -O0 -g f951): 30 seconds for the reduced testcase.  
expand itself (the statements) then needs 5.5 seconds and is the other 
thing.  Overall 77 seconds compile time.

The problem with expanding variables in this case is the 
add_alias_set_conflicts routine.  It walks all NxN/2 stack variable pairs, 
looks if pair (i,j) can share the same stack slot from a type perspective 
(canshare(i,j)) and adds a conflict if not so.  Before asking that 
question it doesn't look if pair (i,j) maybe already conflict for other 
reasons.

Now, we could simply add the conflicts(i,j) test in that loop, before 
asking canshare(i,j) and adding a new conflict.  But adding conflicts 
comes with a cost, so we can do better.  The answer to canshare(i,j) 
depends on only all components already merged into [i] and the type of j.  
With carefully choosing a type to represent all components of [i] and 
updating it in union_stack_vars we can simply ask canshare(i,j) right 
before we actually want to merge j into [i].  That's the least amount of 
work possible as long as we want to have the same results.

This change brings the 30 seconds for var expansion down to 0.8 seconds.

The other change in function.c further improves the temp slot machinery.  
While expanding free_temp_slots is called after each statement, and if it 
is able to free a slot it needs to update the RTX-slot mapping (a 
htab_t).  The freeing of slots was done incorrectly, it overwrote the slot 
in question with NULL, but it should have used htab_clear_slot.  This 
meant that the hashtable kept growing and growing even with all elements 
being empty because the size statistics aren't correct.  Additionally it 
breaks hash chains if there is one striding the nullified slot.

And then a microoptimization: if we know there aren't any slots in use at 
all (happens often in normal circumstances) we can as well use htab_empty 
instead of traversing the hash table for the right slots.

This brings the expansion time (i.e. for expanding the statements) down 
from 5.5 to 2.8 seconds.  All changes together reduce the compile time 
for the reduced testcase from ~ 77 seconds to ~ 44.

A variant of this regstrapped on x86_64-linux, all languages+Ada+objc++, 
but after some changes I'm redoing that regstrap.  Okay for trunk if that 
passes?


Ciao,
Michael.
---
PR middle-end/38474
* cfgexpand.c (struct stack_var): Add slot_type member.
(add_stack_var): Initialize it.
(add_alias_set_conflicts): Remove.
(merge_stack_vars_p, more_specific_type): New functions.
(nonconflicting_type): New static data.
(union_stack_vars): Use more_specific_type to update slot_type.
(partition_stack_vars): Call merge_stack_vars_p ...
(expand_used_vars): ... instead of add_alias_set_conflicts.
(fini_vars_expansion): Clear nonconflicting_type.
* function.c (n_temp_slots_in_use): New static data.
(make_slot_available, assign_stack_temp_for_type): Update it.
(init_temp_slots): Zero it.
(remove_unused_temp_slot_addresses): Use it for quicker removal.
(remove_unused_temp_slot_addresses_1): Use htab_clear_slot.

Index: cfgexpand.c
===
--- cfgexpand.c.orig2012-05-27 01:33:55.0 +0200
+++ cfgexpand.c 2012-05-27 04:37:18.0 +0200
@@ -177,6 +177,11 @@ struct stack_var
   /* The next stack variable in the partition, or EOC.  */
   size_t next;
 
+  /* The most specific type of all variables merged with the slot
+ if this is a representative.  Initially this is simply the
+ TREE_TYPE for the variable.  */
+  tree slot_type;
+
   /* The numbers of conflicting stack variables.  */
   bitmap conflicts;
 };
@@ -285,6 +290,7 @@ add_stack_var (tree decl)
   /* All variables are initially in their own partition.  */
   v-representative = stack_vars_num;
   v-next = EOC;
+  v-slot_type = TREE_TYPE (decl);
 
   /* All variables initially conflict with no other.  */
   v-conflicts = NULL;
@@ -353,45 +359,40 @@ aggregate_contains_union_type (tree type
   return false;
 }
 
-/* A subroutine of expand_used_vars.  If two variables X and Y have alias
-   sets that do not conflict, then do add a conflict for these variables
-   in the interference graph.  We also need to make sure to add conflicts
-   for union containing structures.  Else RTL alias analysis comes along
-   and due to type based aliasing rules decides that for two overlapping
-   union temporaries { short s; int i; } accesses to the same mem through
-   different types may not alias and happily reorders stores across
-   life-time boundaries of the temporaries (See