On Tue, Jul 12, 2011 at 2:12 PM, Tom de Vries <vr...@codesourcery.com> wrote: > Hi Richard, > > here's a new version of the pass. I attempted to address as much as possible > your comments. The pass was bootstrapped and reg-tested on x86_64. > > On 06/14/2011 05:01 PM, Richard Guenther wrote: >> On Fri, Jun 10, 2011 at 6:54 PM, Tom de Vries <vr...@codesourcery.com> wrote: >>> Hi Richard, >>> >>> thanks for the review. >>> >>> On 06/08/2011 11:55 AM, Richard Guenther wrote: >>>> On Wed, Jun 8, 2011 at 11:42 AM, Tom de Vries <vr...@codesourcery.com> >>>> wrote: >>>>> Hi Richard, >>>>> >>>>> I have a patch for PR43864. The patch adds a gimple level duplicate block >>>>> cleanup. The patch has been bootstrapped and reg-tested on x86_64, and >>>>> reg-tested on ARM. The size impact on ARM for spec2000 is shown in the >>>>> following >>>>> table (%, lower is better). >>>>> >>>>> none pic >>>>> thumb1 thumb2 thumb1 thumb2 >>>>> spec2000 99.9 99.9 99.8 99.8 >>>>> >>>>> PR43864 is currently marked as a duplicate of PR20070, but I'm not sure >>>>> that the >>>>> optimizations proposed in PR20070 would fix this PR. >>>>> >>>>> The problem in this PR is that when compiling with -O2, the example below >>>>> should >>>>> only have one call to free. The original problem is formulated in terms >>>>> of -Os, >>>>> but currently we generate one call to free with -Os, although still not >>>>> the >>>>> smallest code possible. I'll show here the -O2 case, since that's similar >>>>> to the >>>>> original PR. >>>>> >>> >>> Example A. (naming it for reference below) >>> >>>>> #include <stdio.h> >>>>> void foo (char*, FILE*); >>>>> char* hprofStartupp(char *outputFileName, char *ctx) >>>>> { >>>>> char fileName[1000]; >>>>> FILE *fp; >>>>> sprintf(fileName, outputFileName); >>>>> if (access(fileName, 1) == 0) { >>>>> free(ctx); >>>>> return 0; >>>>> } >>>>> >>>>> fp = fopen(fileName, 0); >>>>> if (fp == 0) { >>>>> free(ctx); >>>>> return 0; >>>>> } >>>>> >>>>> foo(outputFileName, fp); >>>>> >>>>> return ctx; >>>>> } >>>>> >>>>> AFAIU, there are 2 complementary methods of rtl optimizations proposed in >>>>> PR20070. >>>>> - Merging 2 blocks which are identical expect for input registers, by >>>>> using a >>>>> conditional move to choose between the different input registers. >>>>> - Merging 2 blocks which have different local registers, by ignoring those >>>>> differences >>>>> >>>>> Blocks .L6 and.L7 have no difference in local registers, but they have a >>>>> difference in input registers: r3 and r1. Replacing the move to r5 by a >>>>> conditional move would probably be benificial in terms of size, but it's >>>>> not >>>>> clear what condition the conditional move should be using. Calculating >>>>> such a >>>>> condition would add in size and increase the execution path. >>>>> >>>>> gcc -O2 -march=armv7-a -mthumb pr43864.c -S: >>>>> ... >>>>> push {r4, r5, lr} >>>>> mov r4, r0 >>>>> sub sp, sp, #1004 >>>>> mov r5, r1 >>>>> mov r0, sp >>>>> mov r1, r4 >>>>> bl sprintf >>>>> mov r0, sp >>>>> movs r1, #1 >>>>> bl access >>>>> mov r3, r0 >>>>> cbz r0, .L6 >>>>> movs r1, #0 >>>>> mov r0, sp >>>>> bl fopen >>>>> mov r1, r0 >>>>> cbz r0, .L7 >>>>> mov r0, r4 >>>>> bl foo >>>>> .L3: >>>>> mov r0, r5 >>>>> add sp, sp, #1004 >>>>> pop {r4, r5, pc} >>>>> .L6: >>>>> mov r0, r5 >>>>> mov r5, r3 >>>>> bl free >>>>> b .L3 >>>>> .L7: >>>>> mov r0, r5 >>>>> mov r5, r1 >>>>> bl free >>>>> b .L3 >>>>> ... >>>>> >>>>> The proposed patch solved the problem by dealing with the 2 blocks at a >>>>> level >>>>> when they are still identical: at gimple level. It detect that the 2 >>>>> blocks are >>>>> identical, and removes one of them. >>>>> >>>>> The following table shows the impact of the patch on the example in terms >>>>> of >>>>> size for -march=armv7-a: >>>>> >>>>> without with delta >>>>> Os : 108 104 -4 >>>>> O2 : 120 104 -16 >>>>> Os thumb: 68 64 -4 >>>>> O2 thumb: 76 64 -12 >>>>> >>>>> The gain in size for -O2 is that of removing the entire block, plus the >>>>> replacement of 2 moves by a constant set, which also decreases the >>>>> execution >>>>> path. The patch ensures optimal code for both -O2 and -Os. >>>>> >>>>> >>>>> By keeping track of equivalent definitions in the 2 blocks, we can ignore >>>>> those >>>>> differences in comparison. Without this feature, we would only match >>>>> blocks with >>>>> resultless operations, due the the ssa-nature of gimples. >>>>> For example, with this feature, we reduce the following function to its >>>>> minimum >>>>> at gimple level, rather than at rtl level. >>>>> >>> >>> Example B. (naming it for reference below) >>> >>>>> int f(int c, int b, int d) >>>>> { >>>>> int r, e; >>>>> >>>>> if (c) >>>>> r = b + d; >>>>> else >>>>> { >>>>> e = b + d; >>>>> r = e; >>>>> } >>>>> >>>>> return r; >>>>> } >>>>> >>>>> ;; Function f (f) >>>>> >>>>> f (int c, int b, int d) >>>>> { >>>>> int e; >>>>> >>>>> <bb 2>: >>>>> e_6 = b_3(D) + d_4(D); >>>>> return e_6; >>>>> >>>>> } >>>>> >>>>> I'll send the patch with the testcases in a separate email. >>>>> >>>>> OK for trunk? >>>> >>>> I don't like that you hook this into cleanup_tree_cfg - that is called >>>> _way_ too often. >>>> >>> >>> Here is a reworked patch that addresses several concerns, particularly the >>> compile time overhead. >>> >>> Changes: >>> - The optimization is now in a separate file. >>> - The optimization is now a pass rather than a cleanup. That allowed me to >>> remove the test for pass-local flags. >>> New is the pass driver tail_merge_optimize, based on >>> tree-cfgcleanup.c:cleanup_tree_cfg_1. >>> - The pass is run once, on SSA. Before, the patch would >>> fix example A only before SSA and example B only on SSA. >>> In order to fix example A on SSA, I added these changes: >>> - handle the vop state at entry of bb1 and bb2 as equal (gimple_equal_p) >>> - insert vop phi in bb2, and use that one (update_vuses) >>> - complete pt_solutions_equal_p. >>> >>> Passed x86_64 bootstrapping and regression testing, currently regtesting on >>> ARM. >>> >>> I placed the pass at the earliest point where it fixes example B: After copy >>> propagation and dead code elimination, specifically, after the first >>> invocation >>> of pass_cd_dce. Do you know (any other points) where the pass should be >>> scheduled? >> >> It's probably reasonable to run it after IPA inlining has taken place which >> means insert it somewhen after the second pass_fre (I'd suggest after >> pass_merge_phi). >> > > I placed it there, but I ran into some interaction with > pass_late_warn_uninitialized. Addition of the pass makes test > gcc.dg/uninit-pred-2_c.c fail. > > FAIL: gcc.dg/uninit-pred-2_c.c bogus uninitialized var warning > (test for bogus messages, line 43) > FAIL: gcc.dg/uninit-pred-2_c.c real uninitialized var warning > (test for warnings, line 45) > > int foo_2 (int n, int m, int r) > { > int flag = 0; > int v; > > if (n) > { > v = r; > flag = 1; > } > > if (m) g++; > else bar (); > > if (flag) > blah (v); { dg-bogus "uninitialized" "bogus uninitialized var warning" } > else > blah (v); { dg-warning "uninitialized" "real uninitialized var warning" > } > > return 0; > } > > The pass replaces the second call to blah with the first one, and eliminates > the if. After that, the uninitialized warning is issued for the line number > of the first call to blah, while at source level the warning only makes sense > for the second call to blah. > > Shall I try putting the pass after pass_late_warn_uninitialized?
No, simply pass -fno-tree-tail-merge in the testcase. >> But my general comment still applies - I don't like the structural >> comparison code at all and you should really use the value-numbering >> machineries we have > > I now use sccvn. Good. >> or even better, merge this pass with FRE itself >> (or DOM if that suits you more). For FRE you'd want to hook into >> tree-ssa-pre.c:eliminate(). >> > > If we need to do the transformation after pass_late_warn_uninitialized, it > needs > to stay on its own, I suppose. I suppose part of the high cost of the pass is running SCCVN, so it makes sense to share that with the existing FRE run. Any reason you use VN_NOWALK? >>>> This also duplicates the literal matching done on the RTL level - instead >>>> I think this optimization would be more related to value-numbering >>>> (either that of SCCVN/FRE/PRE or that of DOM which also does >>>> jump-threading). >>> >>> The pass currently does just duplicate block elimination, not cross-jumping. >>> If we would like to extend this to cross-jumping, I think we need to do the >>> reverse of value numbering: walk backwards over the bb, and keep track of >>> the >>> way values are used rather than defined. This will allows us to make a cut >>> halfway a basic block. >> >> I don't understand - I propose to do literal matching but using >> value-numbering >> for tracking equivalences to avoid literal matching for stuff we know is >> equivalent. In fact I think it will be mostly calls and stores where we >> need to do literal matching, but never intermediate computations on >> registers. >> > > I tried to implement that scheme now. > >> But maybe I miss something here. >> >>> In general, we cannot do cut halfway a basic block in the current >>> implementation >>> (of value numbering and forward matching), since we assume equivalence of >>> the >>> incoming vops at bb entry. This assumption is in general only valid if we >>> indeed >>> replace the entire block by another entire block. >> >> Why are VOPs of concern at all? >> > > In the previous version, I inserted the phis for the vops manually. > In the current version of the pass, I let TODO_update_ssa_only_virtuals deal > with vops, so it's not relevant anymore. > >>> I imagine that a cross-jumping heuristic would be based on the length of the >>> match and the amount of non-vop phis it would introduce. Then value >>> numbering >>> would be something orthogonal to this optimization, which would reduce >>> amount of >>> phis needed for a cross-jump. >>> I think it would make sense to use SCCVN value numbering at the point that >>> we >>> have this backward matching. >>> >>> I'm not sure whether it's a good idea to try to replace the current forward >>> local value numbering with SCCVN value numbering, since we currently declare >>> vops equal, which are, in the global sense, not equal. And once we go to >>> backward matching, we'll need something to keep track of the uses, and we >>> can >>> reuse the current infrastructure for that, but not the SCCVN value >>> numbering. >>> >>> Does that make any sense? >> >> Ok, let me think about this a bit. >> > > I tried to to be more clear on this in the header comment of the pass. > >> For now about the patch in general. The functions need renaming to >> something more sensible now that this isn't cfg-cleanup anymore. >> >> I miss a general overview of the pass - it's hard to reverse engineer >> its working for me. > > I added a header comment. > >> Like (working backwards), you are detecting >> duplicate predecessors >> - that obviously doesn't work for duplicates >> without any successors, like those ending in noreturn calls. >> > > Merging of blocks without successors works now. > >> + n = EDGE_COUNT (bb->preds); >> + >> + for (i = 0; i < n; ++i) >> + { >> + e1 = EDGE_PRED (bb, i); >> + if (e1->flags & EDGE_COMPLEX) >> + continue; >> + for (j = i + 1; j < n; ++j) >> + { >> >> that's quadratic in the number of predecessors. >> > > The quadratic comparison is now limited by PARAM_TAIL_MERGE_MAX_COMPARISONS. > Each bb is compared to maximally PARAM_TAIL_MERGE_MAX_COMPARISONS similar bbs > per worklist iteration. > >> + /* Block e1->src might be deleted. If bb and e1->src are the same >> + block, delete e2->src instead, by swapping e1 and e2. */ >> + e1_swapped = (bb == e1->src) ? e2: e1; >> + e2_swapped = (bb == e1->src) ? e1: e2; >> >> is that because you incrementally merge preds two at a time? As you >> are deleting blocks don't you need to adjust the quadratic walking? >> Thus, with say four equivalent preds won't your code crash anyway? >> > > I think it was to make calculation of dominator info easier, but I use now > functions from dominance.c for that, so this piece of code is gone. > >> I think the code needs to delay the CFG manipulation to the end >> of this function. >> > > I now delay the cfg manipulation till after each analysis phase. > >> +/* Returns whether for all phis in E1->dest the phi alternatives for E1 and >> + E2 are either: >> + - equal, or >> + - defined locally in E1->src and E2->src. >> + In the latter case, register the alternatives in *PHI_EQUIV. */ >> + >> +static bool >> +same_or_local_phi_alternatives (equiv_t *phi_equiv, edge e1, edge e2) >> +{ >> + int n1 = e1->dest_idx; >> + int n2 = e2->dest_idx; >> + gimple_stmt_iterator gsi; >> + basic_block dest = e1->dest; >> + gcc_assert (dest == e2->dest); >> >> too many asserts in general - I'd say for this case pass in the destination >> block as argument. >> >> + gcc_assert (val1 != NULL_TREE); >> + gcc_assert (val2 != NULL_TREE); >> >> superfluous. >> >> +static bool >> +cleanup_duplicate_preds_1 (equiv_t phi_equiv, edge e1, edge e2) >> ... >> + VEC (edge,heap) *redirected_edges; >> + gcc_assert (bb == e2->dest); >> >> same. >> >> + if (e1->flags != e2->flags) >> + return false; >> >> that's bad - it should handle EDGE_TRUE/FALSE_VALUE mismatches >> by swapping edges in the preds. >> > > That's handled now. > >> + /* TODO: We could allow multiple successor edges here, as long as bb1 and >> bb2 >> + have the same successors. */ >> + if (EDGE_COUNT (bb1->succs) != 1 || EDGE_COUNT (bb2->succs) != 1) >> + return false; >> >> hm, ok - that would need fixing, too. Same or mergeable successors >> of course, which makes me wonder if doing this whole transformation >> incrementally and locally is a good idea ;) Also >> > > Also handled now. > >> + /* Calculate the changes to be made to the dominator info. >> + Calculate bb2_dom. */ >> ... >> >> wouldn't be necessary I suppose (just throw away dom info after the >> pass). >> >> That is, I'd globally record BB equivalences (thus, "value-number" >> BBs) and apply the CFG manipulations at a single point. >> > > I delay the cfg manipulation till after each analysis phase. Delaying the cfg > manipulation till the end of the pass instead might make the analysis code > more > convoluted. > >> Btw, I miss where you insert PHI nodes for all uses that flow in >> from the preds preds - you do that for VOPs but not for real >> operands? >> > > Indeed, inserting phis for non-vops is a todo. > >> + /* Replace uses of vuse2 with uses of the phi. */ >> + for (gsi = gsi_start_bb (bb2); !gsi_end_p (gsi); gsi_next (&gsi)) >> + { >> >> why not walk immediate uses of the old PHI and SET_USE to >> the new one instead (for those uses in the duplicate BB of course)? >> > > And I no longer insert VOP phis, but let a TODO handle that, so this code is > gone. Ok. Somewhat costly in comparison though. >> + case GSS_CALL: >> + if (!pt_solution_equal_p (gimple_call_use_set (s1), >> + gimple_call_use_set (s2)) >> >> I don't understand why you are concerned about equality of >> points-to information. Why not simply ior it (pt_solution_ior_into - note >> they are shared so you need to unshare them first). >> > > I let a todo handle the alias info now. Hmm, that's not going to work if it's needed for correctness. >> +/* Return true if p1 and p2 can be considered equal. */ >> + >> +static bool >> +pt_solution_equal_p (struct pt_solution *p1, struct pt_solution *p2) >> >> would go into tree-ssa-structalias.c instead. >> >> +static bool >> +gimple_base_equal_p (gimple s1, gimple s2) >> +{ >> ... >> + if (gimple_modified_p (s1) || gimple_modified_p (s2)) >> + return false; >> >> that shouldn't be of concern. >> >> + if (s1->gsbase.subcode != s2->gsbase.subcode) >> + return false; >> >> for assigns that are of class GIMPLE_SINGLE_RHS we do not >> update subcode during transformations so it can differ for now >> equal statements. >> > > handled properly now. > >> I'm not sure if a splay tree for the SSA name version equivalency >> map is the best representation - I would have used a simple >> array of num_ssa_names size and assign value-numbers >> (the lesser version for example). >> >> Thus equiv_insert would do >> >> value = MIN (SSA_NAME_VERSION (val1), SSA_NAME_VERSION (val2)); >> values[SSA_NAME_VERSION (val1)] = value; >> values[SSA_NAME_VERSION (val2)] = value; >> >> if the names are not defined in bb1 resp. bb2 we would have to insert >> a PHI node in the merged block - that would be a cost thingy for >> doing this value-numbering in a more global way. >> > > local value numbering code has been removed. > >> You don't seem to be concerned about the SSA names points-to >> information, but it surely has the same issues as that of the calls >> (so either they should be equal or they should be conservatively >> merged). But as-is you don't allow any values to flow into the >> merged blocks that are not equal for both edges, no? >> > > Correct, that's still a todo. > >> + TV_TREE_CFG, /* tv_id */ >> >> add a new timevar. We wan to be able to turn the pass off, >> so also add a new option (I can see it might make debugging harder >> in some cases). >> > > I added -ftree-tail-merge and TV_TREE_TAIL_MERGE. > >> Can you assess the effect of the patch on GCC itself (for example >> when building cc1?)? What's the size benefit and the compile-time >> overhead? >> > > effect on building cc1: > > real user sys > without: 19m50.158s 19m 2.090s 0m20.860s > with: 19m59.456s 19m17.170s 0m20.350s > ---------- > +15.080s > +1.31% That's quite a lot of time. > $ size without/cc1 with/cc1 > text data bss dec hex filename > 17515986 41320 1364352 18921658 120b8ba without/cc1 > 17399226 41320 1364352 18804898 11ef0a2 with/cc1 > -------- > -116760 > -0.67% > > OK for trunk, provided build & reg-testing on ARM is ok? I miss additions to the testsuite. +static bool +bb_dominated_by_p (basic_block bb1, basic_block bb2) please use + if (TREE_CODE (val1) == SSA_NAME) + { + if (!same_preds && !SSA_NAME_IS_DEFAULT_DEF (val1) && !dominated_by_p (bb2, gimple_bb (SSA_NAME_DEF_STMT (val1))) return false; instead. All stmts should have a BB apart from def stmts of default defs (which are gimple_nops). +/* Return the canonical scc_vn tree for X, if we can use scc_vn_info. + Otherwise, return X. */ + +static tree +gvn_val (tree x) +{ + return ((scc_vn_ok && x != NULL && TREE_CODE (x) == SSA_NAME) + ? VN_INFO ((x))->valnum : x); +} I suppose we want to export vn_valueize from tree-ssa-sccvn.c instead which seems to perform the same. Do you ever call the above when scc_vn_ok is false or x is NULL? +static bool +gvn_uses_equal (tree val1, tree val2, basic_block bb1, + basic_block bb2, bool same_preds) +{ + gimple def1, def2; + basic_block def1_bb, def2_bb; + + if (val1 == NULL_TREE || val2 == NULL_TREE) + return false; does this ever happen? + if (gvn_val (val1) != gvn_val (val2)) + return false; I suppose a shortcut if (val1 == val2) return true; is possible? +static int *bb_size; + +/* Init bb_size administration. */ + +static void +init_bb_size (void) +{ if you need more per-BB info you can hook it into bb->aux. What's the size used for (I guess I'll see below ...)? + for (gsi = gsi_start_nondebug_bb (bb); + !gsi_end_p (gsi); gsi_next_nondebug (&gsi)) + size++; is pretty rough. I guess for a quick false result for comparing BBs (which means you could initialize the info lazily?) +struct same_succ +{ + /* The bbs that have the same successor bbs. */ + bitmap bbs; + /* The successor bbs. */ + bitmap succs; + /* Indicates whether the EDGE_TRUE/FALSE_VALUEs of succ_flags are swapped for + bb. */ + bitmap inverse; + /* The edge flags for each of the successor bbs. */ + VEC (int, heap) *succ_flags; + /* Indicates whether the struct is in the worklist. */ + bool in_worklist; +}; looks somewhat odd at first sight - maybe a overall comment what this is used for is missing. Well, let's see. +static hashval_t +same_succ_hash (const void *ve) +{ + const_same_succ_t e = (const_same_succ_t)ve; + hashval_t hashval = bitmap_hash (e->succs); + int flags; + unsigned int i; + unsigned int first = bitmap_first_set_bit (e->bbs); + int size = bb_size [first]; + gimple_stmt_iterator gsi; + gimple stmt; + basic_block bb = BASIC_BLOCK (first); + + hashval = iterative_hash_hashval_t (size, hashval); + for (gsi = gsi_start_nondebug_bb (bb); + !gsi_end_p (gsi); gsi_next_nondebug (&gsi)) + { + stmt = gsi_stmt (gsi); + hashval = iterative_hash_hashval_t (gimple_code (stmt), hashval); + if (!is_gimple_call (stmt)) + continue; + if (gimple_call_internal_p (stmt)) + hashval = iterative_hash_hashval_t + ((hashval_t) gimple_call_internal_fn (stmt), hashval); + else + hashval = iterative_hash_expr (gimple_call_fn (stmt), hashval); you could also keep a cache of the BB hash as you keep a cache of the size (if this function is called multiple times per BB). The hash looks relatively weak - for all asignments it will hash in GIMPLE_ASSIGN only ... I'd at least hash in gimple_assign_rhs_code. The call handling OTOH looks overly complicated to me ;) The hash will be dependent on stmt ordering even if that doesn't matter, like i = i + 1; j = j - 1; vs. the swapped variant. Similar the successor edges are not sorted, so true/false edges may be in different order. Not sure yet if your comparison function would make those BBs unequal anyway. +static bool +inverse_flags (const_same_succ_t e1, const_same_succ_t e2) +{ + int f1a, f1b, f2a, f2b; + int mask = ~(EDGE_TRUE_VALUE | EDGE_FALSE_VALUE); + + if (VEC_length (int, e1->succ_flags) != 2) + return false; ... I wonder why you keep a VEC of successor edges in same_succ_t instead of using the embedded successor edge vector in the basic_block structure? + bb_to_same_succ[bb->index] = *slot; looks like a candidate for per-BB info in bb->aux, too. +static void +find_same_succ (void) +{ + int i; + same_succ_t same = same_succ_alloc (); + + for (i = 0; i < last_basic_block; ++i) + { + find_same_succ_bb (BASIC_BLOCK (i), &same); + if (same == NULL) + same = same_succ_alloc (); + } I suppose you want FOR_EACH_BB (excluding entry/exit block) or FOR_ALL_BB (including them). The above also can have BASIC_BLOCK(i) == NULL. Similar in other places. + for (i = 0; i < n1; ++i) + { + ei = EDGE_PRED (bb1, i); + for (j = 0; j < n2; ++j) + { + ej = EDGE_PRED (bb2, j); + if (ei->src != ej->src) + continue; + nr_matches++; + break; + } + } FOR_EACH_EDGE (ei, iterator, bb1->preds) if (!find_edge (ei->src, bb2)) return false; is easier to parse. +static bool +gimple_subcode_equal_p (gimple s1, gimple s2, bool inv_cond) +{ + tree var, var_type; + bool honor_nans; + + if (is_gimple_assign (s1) + && gimple_assign_rhs_class (s1) == GIMPLE_SINGLE_RHS) + return true; the subcode for GIMPLE_SINGLE_RHS is gimple_assign_rhs_code (TREE_CODE of gimple_assign_rhs1 actually). +static bool +gimple_base_equal_p (gimple s1, gimple s2, bool inv_cond) I wonder if you still need this given .. +static bool +gimple_equal_p (gimple s1, gimple s2, bool same_preds, bool inv_cond) +{ + unsigned int i; + enum gimple_statement_structure_enum gss; + tree lhs1, lhs2; + basic_block bb1 = gimple_bb (s1), bb2 = gimple_bb (s2); + + /* Handle omp gimples conservatively. */ + if (is_gimple_omp (s1) || is_gimple_omp (s2)) + return false; + + /* Handle lhs. */ + lhs1 = gimple_get_lhs (s1); + lhs2 = gimple_get_lhs (s2); + if (lhs1 != NULL_TREE && lhs2 != NULL_TREE) + return (same_preds && TREE_CODE (lhs1) == SSA_NAME + && TREE_CODE (lhs2) == SSA_NAME + && gvn_val (lhs1) == gvn_val (lhs2)); + else if (!(lhs1 == NULL_TREE && lhs2 == NULL_TREE)) + return false; all lhs equivalency is defered to GVN (which means all GIMPLE_ASSIGN and GIMPLE_CALL stmts with a lhs). That leaves the case of calls without a lhs. I'd rather structure this function like if (gimple_code (s1) != gimple_code (s2)) return false; swithc (gimple_code (s1)) { case GIMPLE_CALL: ... compare arguments ... if equal ok, if not and we have a lhs use GVN. case GIMPLE_ASSIGN: ... compare GVN of the LHS ... case GIMPLE_COND: ... compare operands ... default: return false; } +static bool +bb_gimple_equal_p (basic_block bb1, basic_block bb2, bool same_preds, + bool inv_cond) + +{ you don't do an early out by comparing the pre-computed sizes. Mind you can have hashtable collisions where they still differ (did you check hashtable stats on it? how is the collision rate?) +static bool +bb_has_non_vop_phi (basic_block bb) +{ + gimple_seq phis = phi_nodes (bb); + gimple phi; + + if (phis == NULL) + return false; + + if (!gimple_seq_singleton_p (phis)) + return true; + + phi = gimple_seq_first_stmt (phis); + return !VOID_TYPE_P (TREE_TYPE (gimple_phi_result (phi))); return is_gimple_reg (gimple_phi_result (phi)); +static void +update_debug_stmts (void) +{ + int i; + basic_block bb; + + for (i = 0; i < last_basic_block; ++i) + { + gimple stmt; + gimple_stmt_iterator gsi; + + bb = BASIC_BLOCK (i); FOR_EACH_BB it must be possible to avoid scanning basic-blocks that are not affected by the transform, no? In fact the only affected basic-blocks should be those that were merged with another block? + /* Mark vops for updating. Without this, TODO_update_ssa_only_virtuals + won't do anything. */ + mark_sym_for_renaming (gimple_vop (cfun)); it won't insert any PHIs, that's correct. Still somewhat ugly, a manual update of PHI nodes should be possible. + if (dump_file) + { + fprintf (dump_file, "Before TODOs.\n"); with TDF_DETAILS only please. + free_dominance_info (CDI_DOMINATORS); if you keep dominance info up-to-date there is no need to free it. + TODO_verify_ssa | TODO_verify_stmts + | TODO_verify_flow | TODO_update_ssa_only_virtuals + | TODO_rebuild_alias please no TODO_rebuild_alias, simply remove it - alias info in merged paths should be compatible enough if there is value-equivalence between SSA names. At least you can't rely on TODO_rebuild_alias for correctness - it is skipped if IPA PTA was run for example. + | TODO_cleanup_cfg is that needed? If so return it from your execute function if you changed anything only. But I doubt your transformation introduces cleanup opportunities? New options and params need documentation in doc/invoke.texi. Thanks, Richard. > Thanks, > - Tom > > 2011-07-12 Tom de Vries <t...@codesourcery.com> > > PR middle-end/43864 > * tree-ssa-tail-merge.c: New file. > (bb_dominated_by_p): New function. > (scc_vn_ok): New var. > (init_gvn, delete_gvn, gvn_val, gvn_uses_equal): New function. > (bb_size): New var. > (init_bb_size, delete_bb_size): New function. > (struct same_succ): Define. > (same_succ_t, const_same_succ_t): New typedef. > (same_succ_print, same_succ_print_traverse, same_succ_hash) > (inverse_flags, same_succ_equal, same_succ_alloc, same_succ_delete) > (same_succ_reset): New function. > (same_succ_htab, bb_to_same_succ, same_succ_edge_flags) > (bitmap deleted_bbs, deleted_bb_preds): New vars. > (debug_same_succ): New function. > (worklist): New var. > (print_worklist, add_to_worklist, find_same_succ_bb, find_same_succ) > (init_worklist, delete_worklist, delete_basic_block_same_succ) > (update_worklist): New function. > (struct bb_cluster): Define. > (bb_cluster_t, const_bb_cluster_t): New typedef. > (print_cluster, debug_cluster, same_predecessors) > (add_bb_to_cluster, new_cluster, delete_cluster): New function. > (merge_cluster, all_clusters): New var. > (alloc_cluster_vectors, reset_cluster_vectors, delete_cluster_vectors) > (merge_clusters, set_cluster): New function. > (gimple_subcode_equal_p, gimple_base_equal_p, gimple_equal_p) > (bb_gimple_equal_p): New function. > (find_duplicate, same_phi_alternatives_1, same_phi_alternatives) > (bb_has_non_vop_phi, find_clusters_1, find_clusters): New function. > (replace_block_by, apply_clusters): New function. > (update_debug_stmt, update_debug_stmt): New function. > (tail_merge_optimize, gate_tail_merge): New function. > (pass_tail_merge): New gimple pass. > * tree-pass.h (pass_tail_merge): Declare new pass. > * passes.c (init_optimization_passes): Use new pass. > * Makefile.in (OBJS-common): Add tree-ssa-tail-merge.o. > (tree-ssa-tail-merge.o): New rule. > * opts.c (default_options_table): Set OPT_ftree_tail_merge by default > at > OPT_LEVELS_2_PLUS. > * timevar.def (TV_TREE_TAIL_MERGE): New timevar. > * common.opt (ftree-tail-merge): New switches. > * params.def (PARAM_TAIL_MERGE_MAX_COMPARISONS): New parameter. >