Ping? Teresa On Fri, May 4, 2012 at 3:41 PM, Teresa Johnson <[email protected]> wrote: > > On David's suggestion, I have removed the changes that rename niter_desc > to > loop_desc from this patch to focus the patch on the unrolling changes. I > can > submit a cleanup patch to do the renaming as soon as this one goes in. > > Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk? > > Thanks, > Teresa > > Here is the new description of improvements from the original patch: > > Improved patch based on feedback. Main changes are: > > 1) Improve efficiency by caching loop analysis results in the loop > auxiliary > info structure hanging off the loop structure. Added a new routine, > analyze_loop_insns, to fill in information about the average and total > number > of branches, as well as whether there are any floating point set and call > instructions in the loop. The new routine is invoked when we first create > a > loop's niter_desc struct, and the caller (get_simple_loop_desc) has been > modified to handle creating a niter_desc for the fake outermost loop. > > 2) Improvements to max_unroll_with_branches: > - Treat the fake outermost loop (the procedure body) as we would a hot > outer > loop, i.e. compute the max unroll looking at its nested branches, instead > of > shutting off unrolling when we reach the fake outermost loop. > - Pull the checks previously done in the caller into the routine (e.g. > whether the loop iterates frequently or contains fp instructions). > - Fix a bug in the previous version that sometimes caused overflow in the > new unroll factor. > > 3) Remove float variables, and use integer computation to compute the > average number of branches in the loop. > > 4) Detect more types of floating point computations in the loop by walking > all set instructions, not just single sets. > > 2012-05-04 Teresa Johnson <[email protected]> > > * doc/invoke.texi: Update the documentation with new params. > * loop-unroll.c (max_unroll_with_branches): New function. > (decide_unroll_constant_iterations, > decide_unroll_runtime_iterations): > Add heuristic to avoid increasing branch mispredicts when > unrolling. > (decide_peel_simple, decide_unroll_stupid): Retrieve number of > branches from niter_desc instead of via function that walks loop. > * loop-iv.c (get_simple_loop_desc): Invoke new analyze_loop_insns > function, and add guards to enable this function to work for the > outermost loop. > * cfgloop.c (insn_has_fp_set, analyze_loop_insns): New functions. > (num_loop_branches): Remove. > * cfgloop.h (struct loop_desc): Added new fields to cache > additional > loop analysis information. > (num_loop_branches): Remove. > (analyze_loop_insns): Declare. > * params.def (PARAM_MIN_ITER_UNROLL_WITH_BRANCHES): New param. > (PARAM_UNROLL_OUTER_LOOP_BRANCH_BUDGET): Ditto. > > Index: doc/invoke.texi > =================================================================== > --- doc/invoke.texi (revision 187013) > +++ doc/invoke.texi (working copy) > @@ -8842,6 +8842,12 @@ The maximum number of insns of an unswitched loop. > @item max-unswitch-level > The maximum number of branches unswitched in a single loop. > > +@item min-iter-unroll-with-branches > +Minimum iteration count to ignore branch effects when unrolling. > + > +@item unroll-outer-loop-branch-budget > +Maximum number of branches allowed in hot outer loop region after unroll. > + > @item lim-expensive > The minimum cost of an expensive expression in the loop invariant motion. > > Index: loop-unroll.c > =================================================================== > --- loop-unroll.c (revision 187013) > +++ loop-unroll.c (working copy) > @@ -152,6 +152,99 @@ static void combine_var_copies_in_loop_exit (struc > basic_block); > static rtx get_expansion (struct var_to_expand *); > > +/* Compute the maximum number of times LOOP can be unrolled without > exceeding > + a branch budget, which can increase branch mispredictions. The number > of > + branches is computed by weighting each branch with its expected > execution > + probability through the loop based on profile data. If no profile > feedback > + data exists, simply return the current NUNROLL factor. */ > + > +static unsigned > +max_unroll_with_branches(struct loop *loop, unsigned nunroll) > +{ > + struct loop *outer; > + struct niter_desc *outer_desc = 0; > + int outer_niters = 1; > + int frequent_iteration_threshold; > + unsigned branch_budget; > + struct niter_desc *desc = get_simple_loop_desc (loop); > + > + /* Ignore loops with FP computation as these tend to benefit much more > + consistently from unrolling. */ > + if (desc->has_fp) > + return nunroll; > + > + frequent_iteration_threshold = PARAM_VALUE > (PARAM_MIN_ITER_UNROLL_WITH_BRANCHES); > + if (expected_loop_iterations (loop) >= (unsigned) > frequent_iteration_threshold) > + return nunroll; > + > + /* If there was no profile feedback data, av_num_branches will be 0 > + and we won't limit unrolling. If the av_num_branches is at most 1, > + also don't limit unrolling as the back-edge branch will not be > duplicated. */ > + if (desc->av_num_branches <= 1) > + return nunroll; > + > + /* Walk up the loop tree until we find a hot outer loop in which the > current > + loop is nested. At that point we will compute the number of times > the > + current loop can be unrolled based on the number of branches in the > hot > + outer loop. */ > + outer = loop_outer (loop); > + /* The loop structure contains a fake outermost loop, so this should > always > + be non-NULL for our current loop. */ > + gcc_assert (outer); > + > + /* Walk up the loop tree until we either find a hot outer loop or hit > the > + fake outermost loop at the root. */ > + while (true) > + { > + outer_desc = get_simple_loop_desc (outer); > + > + /* Stop if we hit the fake outermost loop at the root of the tree, > + which includes the whole procedure. */ > + if (!loop_outer (outer)) > + break; > + > + if (outer_desc->const_iter) > + outer_niters *= outer_desc->niter; > + else if (outer->header->count) > + outer_niters *= expected_loop_iterations (outer); > + > + /* If the outer loop has enough iterations to be considered hot, > then > + we can stop our upwards loop tree traversal and examine the > current > + outer loop. */ > + if (outer_niters >= frequent_iteration_threshold) > + break; > + > + outer = loop_outer (outer); > + } > + > + gcc_assert(outer); > + > + /* Assume that any call will cause the branch budget to be exceeded, > + and that we can't unroll the current loop without increasing > + mispredicts. */ > + if (outer_desc->has_call) > + return 0; > + > + /* Otherwise, compute the maximum number of times current loop can be > + unrolled without exceeding our branch budget. First we subtract > + off the outer loop's average branch count from the budget. Note > + that this includes the branches in the current loop. This yields > + the number of branches left in the budget for the unrolled copies. > + We divide this by the number of branches in the current loop that > + must be duplicated when we unroll, which is the total average > + number of branches minus the back-edge branch. This yields the > + number of new loop body copies that can be created by unrolling > + without exceeding the budget, to which we add 1 to get the unroll > + factor. Note that the "outermost loop" may be the whole procedure > + if we did not find a hot enough enclosing loop. */ > + branch_budget = PARAM_VALUE (PARAM_UNROLL_OUTER_LOOP_BRANCH_BUDGET); > + if (outer_desc->av_num_branches > branch_budget) > + return 0; > + /* We already returned early if desc->av_num_branches <= 1. */ > + return (branch_budget - outer_desc->av_num_branches) > + / (desc->av_num_branches - 1) + 1; > +} > + > /* Unroll and/or peel (depending on FLAGS) LOOPS. */ > void > unroll_and_peel_loops (int flags) > @@ -522,6 +615,7 @@ static void > decide_unroll_constant_iterations (struct loop *loop, int flags) > { > unsigned nunroll, nunroll_by_av, best_copies, best_unroll = 0, n_copies, > i; > + unsigned nunroll_branches; > struct niter_desc *desc; > > if (!(flags & UAP_UNROLL)) > @@ -565,6 +659,21 @@ decide_unroll_constant_iterations (struct loop *lo > return; > } > > + /* Be careful when unrolling loops with branches inside -- it can > increase > + the number of mispredicts. */ > + if (desc->num_branches > 1) > + { > + nunroll_branches = max_unroll_with_branches (loop, nunroll); > + if (nunroll > nunroll_branches) > + nunroll = nunroll_branches; > + if (nunroll <= 1) > + { > + if (dump_file) > + fprintf (dump_file, ";; Not unrolling, contains branches\n"); > + return; > + } > + } > + > /* Check whether the loop rolls enough to consider. */ > if (desc->niter < 2 * nunroll) > { > @@ -802,7 +911,7 @@ unroll_loop_constant_iterations (struct loop *loop > static void > decide_unroll_runtime_iterations (struct loop *loop, int flags) > { > - unsigned nunroll, nunroll_by_av, i; > + unsigned nunroll, nunroll_by_av, nunroll_branches, i; > struct niter_desc *desc; > > if (!(flags & UAP_UNROLL)) > @@ -856,6 +965,21 @@ decide_unroll_runtime_iterations (struct loop *loo > return; > } > > + /* Be careful when unrolling loops with branches inside -- it can > increase > + the number of mispredicts. */ > + if (desc->num_branches > 1) > + { > + nunroll_branches = max_unroll_with_branches (loop, nunroll); > + if (nunroll > nunroll_branches) > + nunroll = nunroll_branches; > + if (nunroll <= 1) > + { > + if (dump_file) > + fprintf (dump_file, ";; Not unrolling, contains branches\n"); > + return; > + } > + } > + > /* If we have profile feedback, check whether the loop rolls. */ > if ((loop->header->count > && expected_loop_iterations (loop) < 2 * nunroll) > @@ -1233,7 +1357,7 @@ decide_peel_simple (struct loop *loop, int flags) > > /* Do not simply peel loops with branches inside -- it increases number > of mispredicts. */ > - if (num_loop_branches (loop) > 1) > + if (desc->num_branches > 1) > { > if (dump_file) > fprintf (dump_file, ";; Not peeling, contains branches\n"); > @@ -1394,7 +1518,7 @@ decide_unroll_stupid (struct loop *loop, int flags > > /* Do not unroll loops with branches inside -- it increases number > of mispredicts. */ > - if (num_loop_branches (loop) > 1) > + if (desc->num_branches > 1) > { > if (dump_file) > fprintf (dump_file, ";; Not unrolling, contains branches\n"); > Index: loop-iv.c > =================================================================== > --- loop-iv.c (revision 187013) > +++ loop-iv.c (working copy) > @@ -2948,8 +2948,12 @@ get_simple_loop_desc (struct loop *loop) > /* At least desc->infinite is not always initialized by > find_simple_loop_exit. */ > desc = XCNEW (struct niter_desc); > - iv_analysis_loop_init (loop); > - find_simple_exit (loop, desc); > + if (loop->latch != EXIT_BLOCK_PTR) > + { > + iv_analysis_loop_init (loop); > + find_simple_exit (loop, desc); > + } > + analyze_loop_insns (loop, desc); > loop->aux = desc; > > if (desc->simple_p && (desc->assumptions || desc->infinite)) > Index: cfgloop.c > =================================================================== > --- cfgloop.c (revision 187013) > +++ cfgloop.c (working copy) > @@ -1156,24 +1156,98 @@ get_loop_exit_edges (const struct loop *loop) > return edges; > } > > -/* Counts the number of conditional branches inside LOOP. */ > +/* Determine if INSN is a floating point set. */ > > -unsigned > -num_loop_branches (const struct loop *loop) > +static bool > +insn_has_fp_set(rtx insn) > { > - unsigned i, n; > - basic_block * body; > + int i; > + rtx pat = PATTERN(insn); > + if (GET_CODE (pat) == SET) > + return (FLOAT_MODE_P (GET_MODE (SET_DEST (pat)))); > + else if (GET_CODE (pat) == PARALLEL) > + { > + for (i = 0; i < XVECLEN (pat, 0); i++) > + { > + rtx sub = XVECEXP (pat, 0, i); > + if (GET_CODE (sub) == SET) > + return (FLOAT_MODE_P (GET_MODE (SET_DEST (sub)))); > + } > + } > + return false; > +} > > - gcc_assert (loop->latch != EXIT_BLOCK_PTR); > +/* Analyzes the instructions inside LOOP, updating the DESC. Currently > counts > + the number of conditional branch instructions, calls and fp > instructions, > + as well as the average number of branches executed per iteration. */ > > +void > +analyze_loop_insns (const struct loop *loop, struct niter_desc *desc) > +{ > + unsigned i, nbranch; > + gcov_type weighted_nbranch; > + bool has_call, has_fp; > + basic_block * body, bb; > + rtx insn; > + gcov_type header_count = loop->header->count; > + > + nbranch = weighted_nbranch = 0; > + has_call = has_fp = false; > + > body = get_loop_body (loop); > - n = 0; > for (i = 0; i < loop->num_nodes; i++) > - if (EDGE_COUNT (body[i]->succs) >= 2) > - n++; > + { > + bb = body[i]; > + > + if (EDGE_COUNT (bb->succs) >= 2) > + { > + nbranch++; > + > + /* If this block is executed less frequently than the header > (loop > + entry), then it is weighted based on its execution count, > which > + will be turned into a ratio compared to the loop header > below. */ > + if (bb->count < header_count) > + weighted_nbranch += bb->count; > + > + /* When it is executed more frequently than the header (i.e. it > is > + in a nested inner loop), simply weight the branch the same > as the > + header execution count, so that it will contribute 1 branch > to > + the ratio computed below. */ > + else > + weighted_nbranch += header_count; > + } > + > + /* No need to iterate through the instructions below if > + both flags have already been set. */ > + if (has_call && has_fp) > + continue; > + > + FOR_BB_INSNS (bb, insn) > + { > + if (!INSN_P (insn)) > + continue; > + > + if (!has_call) > + has_call = CALL_P (insn); > + > + if (!has_fp) > + has_fp = insn_has_fp_set (insn); > + } > + } > free (body); > > - return n; > + desc->num_branches = nbranch; > + /* Now divide the weights computed above by the loop header execution > count, > + to compute the average number of branches through the loop. By > adding > + header_count/2 to the numerator we round to nearest with integer > + division. */ > + if (header_count != 0) > + desc->av_num_branches > + = (weighted_nbranch + header_count/2) / header_count; > + else > + desc->av_num_branches = 0; > + desc->has_call = has_call; > + desc->has_fp = has_fp; > } > > /* Adds basic block BB to LOOP. */ > Index: cfgloop.h > =================================================================== > --- cfgloop.h (revision 187013) > +++ cfgloop.h (working copy) > @@ -249,7 +249,6 @@ extern basic_block *get_loop_body_in_custom_order > > extern VEC (edge, heap) *get_loop_exit_edges (const struct loop *); > edge single_exit (const struct loop *); > -extern unsigned num_loop_branches (const struct loop *); > > extern edge loop_preheader_edge (const struct loop *); > extern edge loop_latch_edge (const struct loop *); > @@ -359,7 +358,8 @@ struct rtx_iv > }; > > /* The description of an exit from the loop and of the number of > iterations > - till we take the exit. */ > + till we take the exit. Also includes other information used primarily > + by the loop unroller. */ > > struct niter_desc > { > @@ -400,6 +400,18 @@ struct niter_desc > > /* The number of iterations of the loop. */ > rtx niter_expr; > + > + /* The number of branches in the loop. */ > + unsigned num_branches; > + > + /* The number of executed branches per iteration. */ > + unsigned av_num_branches; > + > + /* Whether the loop contains a call instruction. */ > + bool has_call; > + > + /* Whether the loop contains fp instructions. */ > + bool has_fp; > }; > > extern void iv_analysis_loop_init (struct loop *); > @@ -413,6 +425,7 @@ extern void iv_analysis_done (void); > > extern struct niter_desc *get_simple_loop_desc (struct loop *loop); > extern void free_simple_loop_desc (struct loop *loop); > +void analyze_loop_insns (const struct loop *, struct niter_desc *desc); > > static inline struct niter_desc * > simple_loop_desc (struct loop *loop) > Index: params.def > =================================================================== > --- params.def (revision 187013) > +++ params.def (working copy) > @@ -312,6 +312,15 @@ DEFPARAM(PARAM_MAX_UNROLL_ITERATIONS, > "The maximum depth of a loop nest we completely peel", > 8, 0, 0) > > +DEFPARAM(PARAM_MIN_ITER_UNROLL_WITH_BRANCHES, > + "min-iter-unroll-with-branches", > + "Minimum iteration count to ignore branch effects when unrolling", > + 50, 0, 0) > +DEFPARAM(PARAM_UNROLL_OUTER_LOOP_BRANCH_BUDGET, > + "unroll-outer-loop-branch-budget", > + "Maximum number of branches allowed in hot outer loop region after > unroll", > + 25, 0, 0) > + > /* The maximum number of insns of an unswitched loop. */ > DEFPARAM(PARAM_MAX_UNSWITCH_INSNS, > "max-unswitch-insns", > > -- > This patch is available for review at > http://codereview.appspot.com/6099055
-- Teresa Johnson | Software Engineer | [email protected] | 408-460-2413
