Re: [PATCH, PR43864] Gimple level duplicate block cleanup.

Richard Guenther Tue, 12 Jul 2011 07:08:35 -0700

On Tue, Jul 12, 2011 at 2:12 PM, Tom de Vries <vr...@codesourcery.com> wrote:
> Hi Richard,
>
> here's a new version of the pass. I attempted to address as much as possible
> your comments. The pass was bootstrapped and reg-tested on x86_64.
>
> On 06/14/2011 05:01 PM, Richard Guenther wrote:
>> On Fri, Jun 10, 2011 at 6:54 PM, Tom de Vries <vr...@codesourcery.com> wrote:
>>> Hi Richard,
>>>
>>> thanks for the review.
>>>
>>> On 06/08/2011 11:55 AM, Richard Guenther wrote:
>>>> On Wed, Jun 8, 2011 at 11:42 AM, Tom de Vries <vr...@codesourcery.com> 
>>>> wrote:
>>>>> Hi Richard,
>>>>>
>>>>> I have a patch for PR43864. The patch adds a gimple level duplicate block
>>>>> cleanup. The patch has been bootstrapped and reg-tested on x86_64, and
>>>>> reg-tested on ARM. The size impact on ARM for spec2000 is shown in the 
>>>>> following
>>>>> table (%, lower is better).
>>>>>
>>>>>                     none            pic
>>>>>                thumb1  thumb2  thumb1 thumb2
>>>>> spec2000         99.9    99.9    99.8   99.8
>>>>>
>>>>> PR43864 is currently marked as a duplicate of PR20070, but I'm not sure 
>>>>> that the
>>>>> optimizations proposed in PR20070 would fix this PR.
>>>>>
>>>>> The problem in this PR is that when compiling with -O2, the example below 
>>>>> should
>>>>> only have one call to free. The original problem is formulated in terms 
>>>>> of -Os,
>>>>> but currently we generate one call to free with -Os, although still not 
>>>>> the
>>>>> smallest code possible. I'll show here the -O2 case, since that's similar 
>>>>> to the
>>>>> original PR.
>>>>>
>>>
>>> Example A. (naming it for reference below)
>>>
>>>>> #include <stdio.h>
>>>>> void foo (char*, FILE*);
>>>>> char* hprofStartupp(char *outputFileName, char *ctx)
>>>>> {
>>>>>    char fileName[1000];
>>>>>    FILE *fp;
>>>>>    sprintf(fileName, outputFileName);
>>>>>    if (access(fileName, 1) == 0) {
>>>>>        free(ctx);
>>>>>        return 0;
>>>>>    }
>>>>>
>>>>>    fp = fopen(fileName, 0);
>>>>>    if (fp == 0) {
>>>>>        free(ctx);
>>>>>        return 0;
>>>>>    }
>>>>>
>>>>>    foo(outputFileName, fp);
>>>>>
>>>>>    return ctx;
>>>>> }
>>>>>
>>>>> AFAIU, there are 2 complementary methods of rtl optimizations proposed in 
>>>>> PR20070.
>>>>> - Merging 2 blocks which are identical expect for input registers, by 
>>>>> using a
>>>>>  conditional move to choose between the different input registers.
>>>>> - Merging 2 blocks which have different local registers, by ignoring those
>>>>>  differences
>>>>>
>>>>> Blocks .L6 and.L7 have no difference in local registers, but they have a
>>>>> difference in input registers: r3 and r1. Replacing the move to r5 by a
>>>>> conditional move would probably be benificial in terms of size, but it's 
>>>>> not
>>>>> clear what condition the conditional move should be using. Calculating 
>>>>> such a
>>>>> condition would add in size and increase the execution path.
>>>>>
>>>>> gcc -O2 -march=armv7-a -mthumb pr43864.c -S:
>>>>> ...
>>>>>        push    {r4, r5, lr}
>>>>>        mov     r4, r0
>>>>>        sub     sp, sp, #1004
>>>>>        mov     r5, r1
>>>>>        mov     r0, sp
>>>>>        mov     r1, r4
>>>>>        bl      sprintf
>>>>>        mov     r0, sp
>>>>>        movs    r1, #1
>>>>>        bl      access
>>>>>        mov     r3, r0
>>>>>        cbz     r0, .L6
>>>>>        movs    r1, #0
>>>>>        mov     r0, sp
>>>>>        bl      fopen
>>>>>        mov     r1, r0
>>>>>        cbz     r0, .L7
>>>>>        mov     r0, r4
>>>>>        bl      foo
>>>>> .L3:
>>>>>        mov     r0, r5
>>>>>        add     sp, sp, #1004
>>>>>        pop     {r4, r5, pc}
>>>>> .L6:
>>>>>        mov     r0, r5
>>>>>        mov     r5, r3
>>>>>        bl      free
>>>>>        b       .L3
>>>>> .L7:
>>>>>        mov     r0, r5
>>>>>        mov     r5, r1
>>>>>        bl      free
>>>>>        b       .L3
>>>>> ...
>>>>>
>>>>> The proposed patch solved the problem by dealing with the 2 blocks at a 
>>>>> level
>>>>> when they are still identical: at gimple level. It detect that the 2 
>>>>> blocks are
>>>>> identical, and removes one of them.
>>>>>
>>>>> The following table shows the impact of the patch on the example in terms 
>>>>> of
>>>>> size for -march=armv7-a:
>>>>>
>>>>>          without     with    delta
>>>>> Os      :     108      104       -4
>>>>> O2      :     120      104      -16
>>>>> Os thumb:      68       64       -4
>>>>> O2 thumb:      76       64      -12
>>>>>
>>>>> The gain in size for -O2 is that of removing the entire block, plus the
>>>>> replacement of 2 moves by a constant set, which also decreases the 
>>>>> execution
>>>>> path. The patch ensures optimal code for both -O2 and -Os.
>>>>>
>>>>>
>>>>> By keeping track of equivalent definitions in the 2 blocks, we can ignore 
>>>>> those
>>>>> differences in comparison. Without this feature, we would only match 
>>>>> blocks with
>>>>> resultless operations, due the the ssa-nature of gimples.
>>>>> For example, with this feature, we reduce the following function to its 
>>>>> minimum
>>>>> at gimple level, rather than at rtl level.
>>>>>
>>>
>>> Example B. (naming it for reference below)
>>>
>>>>> int f(int c, int b, int d)
>>>>> {
>>>>>  int r, e;
>>>>>
>>>>>  if (c)
>>>>>    r = b + d;
>>>>>  else
>>>>>    {
>>>>>      e = b + d;
>>>>>      r = e;
>>>>>    }
>>>>>
>>>>>  return r;
>>>>> }
>>>>>
>>>>> ;; Function f (f)
>>>>>
>>>>> f (int c, int b, int d)
>>>>> {
>>>>>  int e;
>>>>>
>>>>> <bb 2>:
>>>>>  e_6 = b_3(D) + d_4(D);
>>>>>  return e_6;
>>>>>
>>>>> }
>>>>>
>>>>> I'll send the patch with the testcases in a separate email.
>>>>>
>>>>> OK for trunk?
>>>>
>>>> I don't like that you hook this into cleanup_tree_cfg - that is called
>>>> _way_ too often.
>>>>
>>>
>>> Here is a reworked patch that addresses several concerns, particularly the
>>> compile time overhead.
>>>
>>> Changes:
>>> - The optimization is now in a separate file.
>>> - The optimization is now a pass rather than a cleanup. That allowed me to
>>>  remove the test for pass-local flags.
>>>  New is the pass driver tail_merge_optimize, based on
>>>  tree-cfgcleanup.c:cleanup_tree_cfg_1.
>>> - The pass is run once, on SSA. Before, the patch would
>>>  fix example A only before SSA and example B only on SSA.
>>>  In order to fix example A on SSA, I added these changes:
>>>  - handle the vop state at entry of bb1 and bb2 as equal (gimple_equal_p)
>>>  - insert vop phi in bb2, and use that one (update_vuses)
>>>  - complete pt_solutions_equal_p.
>>>
>>> Passed x86_64 bootstrapping and regression testing, currently regtesting on 
>>> ARM.
>>>
>>> I placed the pass at the earliest point where it fixes example B: After copy
>>> propagation and dead code elimination, specifically, after the first 
>>> invocation
>>> of pass_cd_dce. Do you know (any other points) where the pass should be 
>>> scheduled?
>>
>> It's probably reasonable to run it after IPA inlining has taken place which
>> means insert it somewhen after the second pass_fre (I'd suggest after
>> pass_merge_phi).
>>
>
> I placed it there, but I ran into some interaction with
> pass_late_warn_uninitialized.  Addition of the pass makes test
> gcc.dg/uninit-pred-2_c.c fail.
>
> FAIL: gcc.dg/uninit-pred-2_c.c bogus uninitialized var warning
>                               (test for bogus messages, line 43)
> FAIL: gcc.dg/uninit-pred-2_c.c real uninitialized var warning
>                               (test for warnings, line 45)
>
>   int foo_2 (int n, int m, int r)
>   {
>     int flag = 0;
>     int v;
>
>     if (n)
>       {
>         v = r;
>         flag = 1;
>       }
>
>     if (m) g++;
>     else bar ();
>
>     if (flag)
>       blah (v); { dg-bogus "uninitialized" "bogus uninitialized var warning" }
>     else
>       blah (v); { dg-warning "uninitialized" "real uninitialized var warning" 
> }
>
>     return 0;
>   }
>
> The pass replaces the second call to blah with the first one, and eliminates
> the if.  After that, the uninitialized warning is issued for the line number
> of the first call to blah, while at source level the warning only makes sense
> for the second call to blah.
>
> Shall I try putting the pass after pass_late_warn_uninitialized?


No, simply pass -fno-tree-tail-merge in the testcase.

>> But my general comment still applies - I don't like the structural
>> comparison code at all and you should really use the value-numbering
>> machineries we have
>
> I now use sccvn.

Good.

>> or even better, merge this pass with FRE itself
>> (or DOM if that suits you more).  For FRE you'd want to hook into
>> tree-ssa-pre.c:eliminate().
>>
>
> If we need to do the transformation after pass_late_warn_uninitialized, it 
> needs
> to stay on its own, I suppose.

I suppose part of the high cost of the pass is running SCCVN, so it
makes sense to share that with the existing FRE run.  Any reason
you use VN_NOWALK?

>>>> This also duplicates the literal matching done on the RTL level - instead
>>>> I think this optimization would be more related to value-numbering
>>>> (either that of SCCVN/FRE/PRE or that of DOM which also does
>>>> jump-threading).
>>>
>>> The pass currently does just duplicate block elimination, not cross-jumping.
>>> If we would like to extend this to cross-jumping, I think we need to do the
>>> reverse of value numbering: walk backwards over the bb, and keep track of 
>>> the
>>> way values are used rather than defined. This will allows us to make a cut
>>> halfway a basic block.
>>
>> I don't understand - I propose to do literal matching but using 
>> value-numbering
>> for tracking equivalences to avoid literal matching for stuff we know is
>> equivalent.  In fact I think it will be mostly calls and stores where we
>> need to do literal matching, but never intermediate computations on
>> registers.
>>
>
> I tried to implement that scheme now.
>
>> But maybe I miss something here.
>>
>>> In general, we cannot do cut halfway a basic block in the current 
>>> implementation
>>> (of value numbering and forward matching), since we assume equivalence of 
>>> the
>>> incoming vops at bb entry. This assumption is in general only valid if we 
>>> indeed
>>> replace the entire block by another entire block.
>>
>> Why are VOPs of concern at all?
>>
>
> In the previous version, I inserted the phis for the vops manually.
> In the current version of the pass, I let TODO_update_ssa_only_virtuals deal
> with vops, so it's not relevant anymore.
>
>>> I imagine that a cross-jumping heuristic would be based on the length of the
>>> match and the amount of non-vop phis it would introduce. Then value 
>>> numbering
>>> would be something orthogonal to this optimization, which would reduce 
>>> amount of
>>> phis needed for a cross-jump.
>>> I think it would make sense to use SCCVN value numbering at the point that 
>>> we
>>> have this backward matching.
>>>
>>> I'm not sure whether it's a good idea to try to replace the current forward
>>> local value numbering with SCCVN value numbering, since we currently declare
>>> vops equal, which are, in the global sense, not equal. And once we go to
>>> backward matching, we'll need something to keep track of the uses, and we 
>>> can
>>> reuse the current infrastructure for that, but not the SCCVN value 
>>> numbering.
>>>
>>> Does that make any sense?
>>
>> Ok, let me think about this a bit.
>>
>
> I tried to to be more clear on this in the header comment of the pass.
>
>> For now about the patch in general.  The functions need renaming to
>> something more sensible now that this isn't cfg-cleanup anymore.
>>
>> I miss a general overview of the pass - it's hard to reverse engineer
>> its working for me.
>
> I added a header comment.
>
>> Like (working backwards), you are detecting
>> duplicate predecessors
>> - that obviously doesn't work for duplicates
>> without any successors, like those ending in noreturn calls.
>>
>
> Merging of blocks without successors works now.
>
>> +  n = EDGE_COUNT (bb->preds);
>> +
>> +  for (i = 0; i < n; ++i)
>> +    {
>> +      e1 = EDGE_PRED (bb, i);
>> +      if (e1->flags & EDGE_COMPLEX)
>> +        continue;
>> +      for (j = i + 1; j < n; ++j)
>> +        {
>>
>> that's quadratic in the number of predecessors.
>>
>
> The quadratic comparison is now limited by PARAM_TAIL_MERGE_MAX_COMPARISONS.
> Each bb is compared to maximally PARAM_TAIL_MERGE_MAX_COMPARISONS similar bbs
> per worklist iteration.
>
>> +          /* Block e1->src might be deleted.  If bb and e1->src are the same
>> +             block, delete e2->src instead, by swapping e1 and e2.  */
>> +          e1_swapped = (bb == e1->src) ? e2: e1;
>> +          e2_swapped = (bb == e1->src) ? e1: e2;
>>
>> is that because you incrementally merge preds two at a time?  As you
>> are deleting blocks don't you need to adjust the quadratic walking?
>> Thus, with say four equivalent preds won't your code crash anyway?
>>
>
> I think it was to make calculation of dominator info easier, but I use now
> functions from dominance.c for that, so this piece of code is gone.
>
>> I think the code needs to delay the CFG manipulation to the end
>> of this function.
>>
>
> I now delay the cfg manipulation till after each analysis phase.
>
>> +/* Returns whether for all phis in E1->dest the phi alternatives for E1 and
>> +   E2 are either:
>> +   - equal, or
>> +   - defined locally in E1->src and E2->src.
>> +   In the latter case, register the alternatives in *PHI_EQUIV.  */
>> +
>> +static bool
>> +same_or_local_phi_alternatives (equiv_t *phi_equiv, edge e1, edge e2)
>> +{
>> +  int n1 = e1->dest_idx;
>> +  int n2 = e2->dest_idx;
>> +  gimple_stmt_iterator gsi;
>> +  basic_block dest = e1->dest;
>> +  gcc_assert (dest == e2->dest);
>>
>> too many asserts in general - I'd say for this case pass in the destination
>> block as argument.
>>
>> +      gcc_assert (val1 != NULL_TREE);
>> +      gcc_assert (val2 != NULL_TREE);
>>
>> superfluous.
>>
>> +static bool
>> +cleanup_duplicate_preds_1 (equiv_t phi_equiv, edge e1, edge e2)
>> ...
>> +  VEC (edge,heap) *redirected_edges;
>> +  gcc_assert (bb == e2->dest);
>>
>> same.
>>
>> +  if (e1->flags != e2->flags)
>> +    return false;
>>
>> that's bad - it should handle EDGE_TRUE/FALSE_VALUE mismatches
>> by swapping edges in the preds.
>>
>
> That's handled now.
>
>> +  /* TODO: We could allow multiple successor edges here, as long as bb1 and 
>> bb2
>> +     have the same successors.  */
>> +  if (EDGE_COUNT (bb1->succs) != 1 || EDGE_COUNT (bb2->succs) != 1)
>> +    return false;
>>
>> hm, ok - that would need fixing, too.  Same or mergeable successors
>> of course, which makes me wonder if doing this whole transformation
>> incrementally and locally is a good idea ;)   Also
>>
>
> Also handled now.
>
>> +  /* Calculate the changes to be made to the dominator info.
>> +     Calculate bb2_dom.  */
>> ...
>>
>> wouldn't be necessary I suppose (just throw away dom info after the
>> pass).
>>
>> That is, I'd globally record BB equivalences (thus, "value-number"
>> BBs) and apply the CFG manipulations at a single point.
>>
>
> I delay the cfg manipulation till after each analysis phase. Delaying the cfg
> manipulation till the end of the pass instead might make the analysis code 
> more
> convoluted.
>
>> Btw, I miss where you insert PHI nodes for all uses that flow in
>> from the preds preds - you do that for VOPs but not for real
>> operands?
>>
>
> Indeed, inserting phis for non-vops is a todo.
>
>> +  /* Replace uses of vuse2 with uses of the phi.  */
>> +  for (gsi = gsi_start_bb (bb2); !gsi_end_p (gsi); gsi_next (&gsi))
>> +    {
>>
>> why not walk immediate uses of the old PHI and SET_USE to
>> the new one instead (for those uses in the duplicate BB of course)?
>>
>
> And I no longer insert VOP phis, but let a TODO handle that, so this code is 
> gone.

Ok.  Somewhat costly in comparison though.

>> +    case GSS_CALL:
>> +      if (!pt_solution_equal_p (gimple_call_use_set (s1),
>> +                                gimple_call_use_set (s2))
>>
>> I don't understand why you are concerned about equality of
>> points-to information.  Why not simply ior it (pt_solution_ior_into - note
>> they are shared so you need to unshare them first).
>>
>
> I let a todo handle the alias info now.

Hmm, that's not going to work if it's needed for correctness.

>> +/* Return true if p1 and p2 can be considered equal.  */
>> +
>> +static bool
>> +pt_solution_equal_p (struct pt_solution *p1, struct pt_solution *p2)
>>
>> would go into tree-ssa-structalias.c instead.
>>
>> +static bool
>> +gimple_base_equal_p (gimple s1, gimple s2)
>> +{
>> ...
>> +  if (gimple_modified_p (s1) || gimple_modified_p (s2))
>> +    return false;
>>
>> that shouldn't be of concern.
>>
>> +  if (s1->gsbase.subcode != s2->gsbase.subcode)
>> +    return false;
>>
>> for assigns that are of class GIMPLE_SINGLE_RHS we do not
>> update subcode during transformations so it can differ for now
>> equal statements.
>>
>
> handled properly now.
>
>> I'm not sure if a splay tree for the SSA name version equivalency
>> map is the best representation - I would have used a simple
>> array of num_ssa_names size and assign value-numbers
>> (the lesser version for example).
>>
>> Thus equiv_insert would do
>>
>>   value = MIN (SSA_NAME_VERSION (val1), SSA_NAME_VERSION (val2));
>>   values[SSA_NAME_VERSION (val1)] = value;
>>   values[SSA_NAME_VERSION (val2)] = value;
>>
>> if the names are not defined in bb1 resp. bb2 we would have to insert
>> a PHI node in the merged block - that would be a cost thingy for
>> doing this value-numbering in a more global way.
>>
>
> local value numbering code has been removed.
>
>> You don't seem to be concerned about the SSA names points-to
>> information, but it surely has the same issues as that of the calls
>> (so either they should be equal or they should be conservatively
>> merged).  But as-is you don't allow any values to flow into the
>> merged blocks that are not equal for both edges, no?
>>
>
> Correct, that's still a todo.
>
>> +  TV_TREE_CFG,                          /* tv_id */
>>
>> add a new timevar.  We wan to be able to turn the pass off,
>> so also add a new option (I can see it might make debugging harder
>> in some cases).
>>
>
> I added -ftree-tail-merge and TV_TREE_TAIL_MERGE.
>
>> Can you assess the effect of the patch on GCC itself (for example
>> when building cc1?)? What's the size benefit and the compile-time
>> overhead?
>>
>
> effect on building cc1:
>
>               real        user        sys
> without: 19m50.158s  19m 2.090s  0m20.860s
> with:    19m59.456s  19m17.170s  0m20.350s
>                     ----------
>                       +15.080s
>                         +1.31%

That's quite a lot of time.

> $ size without/cc1 with/cc1
>    text   data      bss       dec      hex     filename
> 17515986  41320  1364352  18921658  120b8ba  without/cc1
> 17399226  41320  1364352  18804898  11ef0a2     with/cc1
> --------
>  -116760
>  -0.67%
>
> OK for trunk, provided build & reg-testing on ARM is ok?

I miss additions to the testsuite.


+static bool
+bb_dominated_by_p (basic_block bb1, basic_block bb2)

please use

+  if (TREE_CODE (val1) == SSA_NAME)
+    {
+      if (!same_preds
           && !SSA_NAME_IS_DEFAULT_DEF (val1)
           && !dominated_by_p (bb2, gimple_bb (SSA_NAME_DEF_STMT (val1)))
              return false;

instead.  All stmts should have a BB apart from def stmts of default defs
(which are gimple_nops).

+/* Return the canonical scc_vn tree for X, if we can use scc_vn_info.
+   Otherwise, return X.  */
+
+static tree
+gvn_val (tree x)
+{
+  return ((scc_vn_ok && x != NULL && TREE_CODE (x) == SSA_NAME)
+         ? VN_INFO ((x))->valnum : x);
+}

I suppose we want to export vn_valueize from tree-ssa-sccvn.c instead
which seems to perform the same.  Do you ever call the above
when scc_vn_ok is false or x is NULL?

+static bool
+gvn_uses_equal (tree val1, tree val2, basic_block bb1,
+               basic_block bb2, bool same_preds)
+{
+  gimple def1, def2;
+  basic_block def1_bb, def2_bb;
+
+  if (val1 == NULL_TREE || val2 == NULL_TREE)
+    return false;

does this ever happen?

+  if (gvn_val (val1) != gvn_val (val2))
+    return false;

I suppose a shortcut

   if (val1 == val2)
     return true;

is possible?

+static int *bb_size;
+
+/* Init bb_size administration.  */
+
+static void
+init_bb_size (void)
+{

if you need more per-BB info you can hook it into bb->aux.  What's the
size used for (I guess I'll see below ...)?

+      for (gsi = gsi_start_nondebug_bb (bb);
+          !gsi_end_p (gsi); gsi_next_nondebug (&gsi))
+       size++;

is pretty rough.  I guess for a quick false result for comparing BBs
(which means you could initialize the info lazily?)

+struct same_succ
+{
+  /* The bbs that have the same successor bbs.  */
+  bitmap bbs;
+  /* The successor bbs.  */
+  bitmap succs;
+  /* Indicates whether the EDGE_TRUE/FALSE_VALUEs of succ_flags are swapped for
+     bb.  */
+  bitmap inverse;
+  /* The edge flags for each of the successor bbs.  */
+  VEC (int, heap) *succ_flags;
+  /* Indicates whether the struct is in the worklist.  */
+  bool in_worklist;
+};

looks somewhat odd at first sight - maybe a overall comment what this
is used for is missing.  Well, let's see.

+static hashval_t
+same_succ_hash (const void *ve)
+{
+  const_same_succ_t e = (const_same_succ_t)ve;
+  hashval_t hashval = bitmap_hash (e->succs);
+  int flags;
+  unsigned int i;
+  unsigned int first = bitmap_first_set_bit (e->bbs);
+  int size = bb_size [first];
+  gimple_stmt_iterator gsi;
+  gimple stmt;
+  basic_block bb = BASIC_BLOCK (first);
+
+  hashval = iterative_hash_hashval_t (size, hashval);
+  for (gsi = gsi_start_nondebug_bb (bb);
+          !gsi_end_p (gsi); gsi_next_nondebug (&gsi))
+    {
+      stmt = gsi_stmt (gsi);
+      hashval = iterative_hash_hashval_t (gimple_code (stmt), hashval);
+      if (!is_gimple_call (stmt))
+       continue;
+      if (gimple_call_internal_p (stmt))
+       hashval = iterative_hash_hashval_t
+         ((hashval_t) gimple_call_internal_fn (stmt), hashval);
+      else
+       hashval = iterative_hash_expr (gimple_call_fn (stmt), hashval);

you could also keep a cache of the BB hash as you keep a cache
of the size (if this function is called multiple times per BB).
The hash looks relatively weak - for all asignments it will hash
in GIMPLE_ASSIGN only ... I'd at least hash in gimple_assign_rhs_code.
The call handling OTOH looks overly complicated to me ;)

The hash will be dependent on stmt ordering even if that doesn't matter,
like

  i = i + 1;
  j = j - 1;

vs. the swapped variant.  Similar the successor edges are not sorted,
so true/false edges may be in different order.

Not sure yet if your comparison function would make those BBs
unequal anyway.

+static bool
+inverse_flags (const_same_succ_t e1, const_same_succ_t e2)
+{
+  int f1a, f1b, f2a, f2b;
+  int mask = ~(EDGE_TRUE_VALUE | EDGE_FALSE_VALUE);
+
+  if (VEC_length (int, e1->succ_flags) != 2)
+    return false;
...

I wonder why you keep a VEC of successor edges in same_succ_t
instead of  using the embedded successor edge vector in the basic_block
structure?

+      bb_to_same_succ[bb->index] = *slot;

looks like a candidate for per-BB info in bb->aux, too.

+static void
+find_same_succ (void)
+{
+  int i;
+  same_succ_t same = same_succ_alloc ();
+
+  for (i = 0; i < last_basic_block; ++i)
+    {
+      find_same_succ_bb (BASIC_BLOCK (i), &same);
+      if (same == NULL)
+       same = same_succ_alloc ();
+    }

I suppose you want FOR_EACH_BB (excluding entry/exit block) or
FOR_ALL_BB (including them).  The above also can
have BASIC_BLOCK(i) == NULL.  Similar in other places.

+  for (i = 0; i < n1; ++i)
+    {
+      ei = EDGE_PRED (bb1, i);
+      for (j = 0; j < n2; ++j)
+       {
+         ej = EDGE_PRED (bb2, j);
+         if (ei->src != ej->src)
+           continue;
+         nr_matches++;
+         break;
+       }
+    }

  FOR_EACH_EDGE (ei, iterator, bb1->preds)
     if (!find_edge (ei->src, bb2))
       return false;

is easier to parse.

+static bool
+gimple_subcode_equal_p (gimple s1, gimple s2, bool inv_cond)
+{
+  tree var, var_type;
+  bool honor_nans;
+
+  if (is_gimple_assign (s1)
+      && gimple_assign_rhs_class (s1) == GIMPLE_SINGLE_RHS)
+    return true;

the subcode for GIMPLE_SINGLE_RHS is gimple_assign_rhs_code
(TREE_CODE of gimple_assign_rhs1 actually).

+static bool
+gimple_base_equal_p (gimple s1, gimple s2, bool inv_cond)

I wonder if you still need this given ..

+static bool
+gimple_equal_p (gimple s1, gimple s2, bool same_preds, bool inv_cond)
+{
+  unsigned int i;
+  enum gimple_statement_structure_enum gss;
+  tree lhs1, lhs2;
+  basic_block bb1 = gimple_bb (s1), bb2 = gimple_bb (s2);
+
+  /* Handle omp gimples conservatively.  */
+  if (is_gimple_omp (s1) || is_gimple_omp (s2))
+    return false;
+
+  /* Handle lhs.  */
+  lhs1 = gimple_get_lhs (s1);
+  lhs2 = gimple_get_lhs (s2);
+  if (lhs1 != NULL_TREE && lhs2 != NULL_TREE)
+    return (same_preds && TREE_CODE (lhs1) == SSA_NAME
+           && TREE_CODE (lhs2) == SSA_NAME
+           && gvn_val (lhs1) == gvn_val (lhs2));
+  else if (!(lhs1 == NULL_TREE && lhs2 == NULL_TREE))
+    return false;

all lhs equivalency is defered to GVN (which means all GIMPLE_ASSIGN
and GIMPLE_CALL stmts with a lhs).

That leaves the case of calls without a lhs.  I'd rather structure this
function like

  if (gimple_code (s1) != gimple_code (s2))
    return false;
  swithc (gimple_code (s1))
    {
    case GIMPLE_CALL:
       ... compare arguments ...
       if equal ok, if not and we have a lhs use GVN.

    case GIMPLE_ASSIGN:
       ... compare GVN of the LHS ...

     case GIMPLE_COND:
        ... compare operands ...

     default:
        return false;
    }


+static bool
+bb_gimple_equal_p (basic_block bb1, basic_block bb2, bool same_preds,
+                  bool inv_cond)
+
+{

you don't do an early out by comparing the pre-computed sizes.  Mind
you can have hashtable collisions where they still differ (did you
check hashtable stats on it?  how is the collision rate?)

+static bool
+bb_has_non_vop_phi (basic_block bb)
+{
+  gimple_seq phis = phi_nodes (bb);
+  gimple phi;
+
+  if (phis == NULL)
+    return false;
+
+  if (!gimple_seq_singleton_p (phis))
+    return true;
+
+  phi = gimple_seq_first_stmt (phis);
+  return !VOID_TYPE_P (TREE_TYPE (gimple_phi_result (phi)));

return is_gimple_reg (gimple_phi_result (phi));

+static void
+update_debug_stmts (void)
+{
+  int i;
+  basic_block bb;
+
+  for (i = 0; i < last_basic_block; ++i)
+    {
+      gimple stmt;
+      gimple_stmt_iterator gsi;
+
+      bb = BASIC_BLOCK (i);

FOR_EACH_BB

it must be possible to avoid scanning basic-blocks that are not affected
by the transform, no?  In fact the only affected basic-blocks should be
those that were merged with another block?

+      /* Mark vops for updating.  Without this, TODO_update_ssa_only_virtuals
+        won't do anything.  */
+      mark_sym_for_renaming (gimple_vop (cfun));

it won't insert any PHIs, that's correct.  Still somewhat ugly, a manual
update of PHI nodes should be possible.

+      if (dump_file)
+       {
+         fprintf (dump_file, "Before TODOs.\n");

with TDF_DETAILS only please.

+      free_dominance_info (CDI_DOMINATORS);

if you keep dominance info up-to-date there is no need to free it.

+  TODO_verify_ssa | TODO_verify_stmts
+  | TODO_verify_flow | TODO_update_ssa_only_virtuals
+  | TODO_rebuild_alias

please no TODO_rebuild_alias, simply remove it - alias info in merged
paths should be compatible enough if there is value-equivalence between
SSA names.  At least you can't rely on TODO_rebuild_alias for
correctness - it is skipped if IPA PTA was run for example.

+  | TODO_cleanup_cfg

is that needed?  If so return it from your execute function if you changed
anything only.  But I doubt your transformation introduces cleanup
opportunities?

New options and params need documentation in doc/invoke.texi.

Thanks,
Richard.

> Thanks,
> - Tom
>
> 2011-07-12  Tom de Vries  <t...@codesourcery.com>
>
>        PR middle-end/43864
>        * tree-ssa-tail-merge.c: New file.
>        (bb_dominated_by_p): New function.
>        (scc_vn_ok): New var.
>        (init_gvn, delete_gvn, gvn_val, gvn_uses_equal): New function.
>        (bb_size): New var.
>        (init_bb_size, delete_bb_size): New function.
>        (struct same_succ): Define.
>        (same_succ_t, const_same_succ_t): New typedef.
>        (same_succ_print, same_succ_print_traverse, same_succ_hash)
>        (inverse_flags, same_succ_equal, same_succ_alloc, same_succ_delete)
>        (same_succ_reset): New function.
>        (same_succ_htab, bb_to_same_succ, same_succ_edge_flags)
>        (bitmap deleted_bbs, deleted_bb_preds): New vars.
>        (debug_same_succ): New function.
>        (worklist): New var.
>        (print_worklist, add_to_worklist, find_same_succ_bb, find_same_succ)
>        (init_worklist, delete_worklist, delete_basic_block_same_succ)
>        (update_worklist): New function.
>        (struct bb_cluster): Define.
>        (bb_cluster_t, const_bb_cluster_t): New typedef.
>        (print_cluster, debug_cluster, same_predecessors)
>        (add_bb_to_cluster, new_cluster, delete_cluster): New function.
>        (merge_cluster, all_clusters): New var.
>        (alloc_cluster_vectors, reset_cluster_vectors, delete_cluster_vectors)
>        (merge_clusters, set_cluster): New function.
>        (gimple_subcode_equal_p, gimple_base_equal_p, gimple_equal_p)
>        (bb_gimple_equal_p): New function.
>        (find_duplicate, same_phi_alternatives_1, same_phi_alternatives)
>        (bb_has_non_vop_phi, find_clusters_1, find_clusters): New function.
>        (replace_block_by, apply_clusters): New function.
>        (update_debug_stmt, update_debug_stmt): New function.
>        (tail_merge_optimize, gate_tail_merge): New function.
>        (pass_tail_merge): New gimple pass.
>        * tree-pass.h (pass_tail_merge): Declare new pass.
>        * passes.c (init_optimization_passes): Use new pass.
>        * Makefile.in (OBJS-common): Add tree-ssa-tail-merge.o.
>        (tree-ssa-tail-merge.o): New rule.
>        * opts.c (default_options_table): Set OPT_ftree_tail_merge by default 
> at
>        OPT_LEVELS_2_PLUS.
>        * timevar.def (TV_TREE_TAIL_MERGE): New timevar.
>        * common.opt (ftree-tail-merge): New switches.
>        * params.def (PARAM_TAIL_MERGE_MAX_COMPARISONS): New parameter.
>

Re: [PATCH, PR43864] Gimple level duplicate block cleanup.

Reply via email to