Ping? Teresa On Fri, May 11, 2012 at 6:11 AM, Teresa Johnson <tejohn...@google.com> wrote: > Ping? > Teresa > > On Fri, May 4, 2012 at 3:41 PM, Teresa Johnson <tejohn...@google.com> wrote: >> >> On David's suggestion, I have removed the changes that rename niter_desc >> to >> loop_desc from this patch to focus the patch on the unrolling changes. I >> can >> submit a cleanup patch to do the renaming as soon as this one goes in. >> >> Bootstrapped and tested on x86_64-unknown-linux-gnu. Ok for trunk? >> >> Thanks, >> Teresa >> >> Here is the new description of improvements from the original patch: >> >> Improved patch based on feedback. Main changes are: >> >> 1) Improve efficiency by caching loop analysis results in the loop >> auxiliary >> info structure hanging off the loop structure. Added a new routine, >> analyze_loop_insns, to fill in information about the average and total >> number >> of branches, as well as whether there are any floating point set and call >> instructions in the loop. The new routine is invoked when we first create >> a >> loop's niter_desc struct, and the caller (get_simple_loop_desc) has been >> modified to handle creating a niter_desc for the fake outermost loop. >> >> 2) Improvements to max_unroll_with_branches: >> - Treat the fake outermost loop (the procedure body) as we would a hot >> outer >> loop, i.e. compute the max unroll looking at its nested branches, instead >> of >> shutting off unrolling when we reach the fake outermost loop. >> - Pull the checks previously done in the caller into the routine (e.g. >> whether the loop iterates frequently or contains fp instructions). >> - Fix a bug in the previous version that sometimes caused overflow in the >> new unroll factor. >> >> 3) Remove float variables, and use integer computation to compute the >> average number of branches in the loop. >> >> 4) Detect more types of floating point computations in the loop by walking >> all set instructions, not just single sets. >> >> 2012-05-04 Teresa Johnson <tejohn...@google.com> >> >> * doc/invoke.texi: Update the documentation with new params. >> * loop-unroll.c (max_unroll_with_branches): New function. >> (decide_unroll_constant_iterations, >> decide_unroll_runtime_iterations): >> Add heuristic to avoid increasing branch mispredicts when >> unrolling. >> (decide_peel_simple, decide_unroll_stupid): Retrieve number of >> branches from niter_desc instead of via function that walks loop. >> * loop-iv.c (get_simple_loop_desc): Invoke new analyze_loop_insns >> function, and add guards to enable this function to work for the >> outermost loop. >> * cfgloop.c (insn_has_fp_set, analyze_loop_insns): New functions. >> (num_loop_branches): Remove. >> * cfgloop.h (struct loop_desc): Added new fields to cache >> additional >> loop analysis information. >> (num_loop_branches): Remove. >> (analyze_loop_insns): Declare. >> * params.def (PARAM_MIN_ITER_UNROLL_WITH_BRANCHES): New param. >> (PARAM_UNROLL_OUTER_LOOP_BRANCH_BUDGET): Ditto. >> >> Index: doc/invoke.texi >> =================================================================== >> --- doc/invoke.texi (revision 187013) >> +++ doc/invoke.texi (working copy) >> @@ -8842,6 +8842,12 @@ The maximum number of insns of an unswitched loop. >> @item max-unswitch-level >> The maximum number of branches unswitched in a single loop. >> >> +@item min-iter-unroll-with-branches >> +Minimum iteration count to ignore branch effects when unrolling. >> + >> +@item unroll-outer-loop-branch-budget >> +Maximum number of branches allowed in hot outer loop region after unroll. >> + >> @item lim-expensive >> The minimum cost of an expensive expression in the loop invariant motion. >> >> Index: loop-unroll.c >> =================================================================== >> --- loop-unroll.c (revision 187013) >> +++ loop-unroll.c (working copy) >> @@ -152,6 +152,99 @@ static void combine_var_copies_in_loop_exit (struc >> basic_block); >> static rtx get_expansion (struct var_to_expand *); >> >> +/* Compute the maximum number of times LOOP can be unrolled without >> exceeding >> + a branch budget, which can increase branch mispredictions. The number >> of >> + branches is computed by weighting each branch with its expected >> execution >> + probability through the loop based on profile data. If no profile >> feedback >> + data exists, simply return the current NUNROLL factor. */ >> + >> +static unsigned >> +max_unroll_with_branches(struct loop *loop, unsigned nunroll) >> +{ >> + struct loop *outer; >> + struct niter_desc *outer_desc = 0; >> + int outer_niters = 1; >> + int frequent_iteration_threshold; >> + unsigned branch_budget; >> + struct niter_desc *desc = get_simple_loop_desc (loop); >> + >> + /* Ignore loops with FP computation as these tend to benefit much more >> + consistently from unrolling. */ >> + if (desc->has_fp) >> + return nunroll; >> + >> + frequent_iteration_threshold = PARAM_VALUE >> (PARAM_MIN_ITER_UNROLL_WITH_BRANCHES); >> + if (expected_loop_iterations (loop) >= (unsigned) >> frequent_iteration_threshold) >> + return nunroll; >> + >> + /* If there was no profile feedback data, av_num_branches will be 0 >> + and we won't limit unrolling. If the av_num_branches is at most 1, >> + also don't limit unrolling as the back-edge branch will not be >> duplicated. */ >> + if (desc->av_num_branches <= 1) >> + return nunroll; >> + >> + /* Walk up the loop tree until we find a hot outer loop in which the >> current >> + loop is nested. At that point we will compute the number of times >> the >> + current loop can be unrolled based on the number of branches in the >> hot >> + outer loop. */ >> + outer = loop_outer (loop); >> + /* The loop structure contains a fake outermost loop, so this should >> always >> + be non-NULL for our current loop. */ >> + gcc_assert (outer); >> + >> + /* Walk up the loop tree until we either find a hot outer loop or hit >> the >> + fake outermost loop at the root. */ >> + while (true) >> + { >> + outer_desc = get_simple_loop_desc (outer); >> + >> + /* Stop if we hit the fake outermost loop at the root of the tree, >> + which includes the whole procedure. */ >> + if (!loop_outer (outer)) >> + break; >> + >> + if (outer_desc->const_iter) >> + outer_niters *= outer_desc->niter; >> + else if (outer->header->count) >> + outer_niters *= expected_loop_iterations (outer); >> + >> + /* If the outer loop has enough iterations to be considered hot, >> then >> + we can stop our upwards loop tree traversal and examine the >> current >> + outer loop. */ >> + if (outer_niters >= frequent_iteration_threshold) >> + break; >> + >> + outer = loop_outer (outer); >> + } >> + >> + gcc_assert(outer); >> + >> + /* Assume that any call will cause the branch budget to be exceeded, >> + and that we can't unroll the current loop without increasing >> + mispredicts. */ >> + if (outer_desc->has_call) >> + return 0; >> + >> + /* Otherwise, compute the maximum number of times current loop can be >> + unrolled without exceeding our branch budget. First we subtract >> + off the outer loop's average branch count from the budget. Note >> + that this includes the branches in the current loop. This yields >> + the number of branches left in the budget for the unrolled copies. >> + We divide this by the number of branches in the current loop that >> + must be duplicated when we unroll, which is the total average >> + number of branches minus the back-edge branch. This yields the >> + number of new loop body copies that can be created by unrolling >> + without exceeding the budget, to which we add 1 to get the unroll >> + factor. Note that the "outermost loop" may be the whole procedure >> + if we did not find a hot enough enclosing loop. */ >> + branch_budget = PARAM_VALUE (PARAM_UNROLL_OUTER_LOOP_BRANCH_BUDGET); >> + if (outer_desc->av_num_branches > branch_budget) >> + return 0; >> + /* We already returned early if desc->av_num_branches <= 1. */ >> + return (branch_budget - outer_desc->av_num_branches) >> + / (desc->av_num_branches - 1) + 1; >> +} >> + >> /* Unroll and/or peel (depending on FLAGS) LOOPS. */ >> void >> unroll_and_peel_loops (int flags) >> @@ -522,6 +615,7 @@ static void >> decide_unroll_constant_iterations (struct loop *loop, int flags) >> { >> unsigned nunroll, nunroll_by_av, best_copies, best_unroll = 0, n_copies, >> i; >> + unsigned nunroll_branches; >> struct niter_desc *desc; >> >> if (!(flags & UAP_UNROLL)) >> @@ -565,6 +659,21 @@ decide_unroll_constant_iterations (struct loop *lo >> return; >> } >> >> + /* Be careful when unrolling loops with branches inside -- it can >> increase >> + the number of mispredicts. */ >> + if (desc->num_branches > 1) >> + { >> + nunroll_branches = max_unroll_with_branches (loop, nunroll); >> + if (nunroll > nunroll_branches) >> + nunroll = nunroll_branches; >> + if (nunroll <= 1) >> + { >> + if (dump_file) >> + fprintf (dump_file, ";; Not unrolling, contains branches\n"); >> + return; >> + } >> + } >> + >> /* Check whether the loop rolls enough to consider. */ >> if (desc->niter < 2 * nunroll) >> { >> @@ -802,7 +911,7 @@ unroll_loop_constant_iterations (struct loop *loop >> static void >> decide_unroll_runtime_iterations (struct loop *loop, int flags) >> { >> - unsigned nunroll, nunroll_by_av, i; >> + unsigned nunroll, nunroll_by_av, nunroll_branches, i; >> struct niter_desc *desc; >> >> if (!(flags & UAP_UNROLL)) >> @@ -856,6 +965,21 @@ decide_unroll_runtime_iterations (struct loop *loo >> return; >> } >> >> + /* Be careful when unrolling loops with branches inside -- it can >> increase >> + the number of mispredicts. */ >> + if (desc->num_branches > 1) >> + { >> + nunroll_branches = max_unroll_with_branches (loop, nunroll); >> + if (nunroll > nunroll_branches) >> + nunroll = nunroll_branches; >> + if (nunroll <= 1) >> + { >> + if (dump_file) >> + fprintf (dump_file, ";; Not unrolling, contains branches\n"); >> + return; >> + } >> + } >> + >> /* If we have profile feedback, check whether the loop rolls. */ >> if ((loop->header->count >> && expected_loop_iterations (loop) < 2 * nunroll) >> @@ -1233,7 +1357,7 @@ decide_peel_simple (struct loop *loop, int flags) >> >> /* Do not simply peel loops with branches inside -- it increases number >> of mispredicts. */ >> - if (num_loop_branches (loop) > 1) >> + if (desc->num_branches > 1) >> { >> if (dump_file) >> fprintf (dump_file, ";; Not peeling, contains branches\n"); >> @@ -1394,7 +1518,7 @@ decide_unroll_stupid (struct loop *loop, int flags >> >> /* Do not unroll loops with branches inside -- it increases number >> of mispredicts. */ >> - if (num_loop_branches (loop) > 1) >> + if (desc->num_branches > 1) >> { >> if (dump_file) >> fprintf (dump_file, ";; Not unrolling, contains branches\n"); >> Index: loop-iv.c >> =================================================================== >> --- loop-iv.c (revision 187013) >> +++ loop-iv.c (working copy) >> @@ -2948,8 +2948,12 @@ get_simple_loop_desc (struct loop *loop) >> /* At least desc->infinite is not always initialized by >> find_simple_loop_exit. */ >> desc = XCNEW (struct niter_desc); >> - iv_analysis_loop_init (loop); >> - find_simple_exit (loop, desc); >> + if (loop->latch != EXIT_BLOCK_PTR) >> + { >> + iv_analysis_loop_init (loop); >> + find_simple_exit (loop, desc); >> + } >> + analyze_loop_insns (loop, desc); >> loop->aux = desc; >> >> if (desc->simple_p && (desc->assumptions || desc->infinite)) >> Index: cfgloop.c >> =================================================================== >> --- cfgloop.c (revision 187013) >> +++ cfgloop.c (working copy) >> @@ -1156,24 +1156,98 @@ get_loop_exit_edges (const struct loop *loop) >> return edges; >> } >> >> -/* Counts the number of conditional branches inside LOOP. */ >> +/* Determine if INSN is a floating point set. */ >> >> -unsigned >> -num_loop_branches (const struct loop *loop) >> +static bool >> +insn_has_fp_set(rtx insn) >> { >> - unsigned i, n; >> - basic_block * body; >> + int i; >> + rtx pat = PATTERN(insn); >> + if (GET_CODE (pat) == SET) >> + return (FLOAT_MODE_P (GET_MODE (SET_DEST (pat)))); >> + else if (GET_CODE (pat) == PARALLEL) >> + { >> + for (i = 0; i < XVECLEN (pat, 0); i++) >> + { >> + rtx sub = XVECEXP (pat, 0, i); >> + if (GET_CODE (sub) == SET) >> + return (FLOAT_MODE_P (GET_MODE (SET_DEST (sub)))); >> + } >> + } >> + return false; >> +} >> >> - gcc_assert (loop->latch != EXIT_BLOCK_PTR); >> +/* Analyzes the instructions inside LOOP, updating the DESC. Currently >> counts >> + the number of conditional branch instructions, calls and fp >> instructions, >> + as well as the average number of branches executed per iteration. */ >> >> +void >> +analyze_loop_insns (const struct loop *loop, struct niter_desc *desc) >> +{ >> + unsigned i, nbranch; >> + gcov_type weighted_nbranch; >> + bool has_call, has_fp; >> + basic_block * body, bb; >> + rtx insn; >> + gcov_type header_count = loop->header->count; >> + >> + nbranch = weighted_nbranch = 0; >> + has_call = has_fp = false; >> + >> body = get_loop_body (loop); >> - n = 0; >> for (i = 0; i < loop->num_nodes; i++) >> - if (EDGE_COUNT (body[i]->succs) >= 2) >> - n++; >> + { >> + bb = body[i]; >> + >> + if (EDGE_COUNT (bb->succs) >= 2) >> + { >> + nbranch++; >> + >> + /* If this block is executed less frequently than the header >> (loop >> + entry), then it is weighted based on its execution count, >> which >> + will be turned into a ratio compared to the loop header >> below. */ >> + if (bb->count < header_count) >> + weighted_nbranch += bb->count; >> + >> + /* When it is executed more frequently than the header (i.e. it >> is >> + in a nested inner loop), simply weight the branch the same >> as the >> + header execution count, so that it will contribute 1 branch >> to >> + the ratio computed below. */ >> + else >> + weighted_nbranch += header_count; >> + } >> + >> + /* No need to iterate through the instructions below if >> + both flags have already been set. */ >> + if (has_call && has_fp) >> + continue; >> + >> + FOR_BB_INSNS (bb, insn) >> + { >> + if (!INSN_P (insn)) >> + continue; >> + >> + if (!has_call) >> + has_call = CALL_P (insn); >> + >> + if (!has_fp) >> + has_fp = insn_has_fp_set (insn); >> + } >> + } >> free (body); >> >> - return n; >> + desc->num_branches = nbranch; >> + /* Now divide the weights computed above by the loop header execution >> count, >> + to compute the average number of branches through the loop. By >> adding >> + header_count/2 to the numerator we round to nearest with integer >> + division. */ >> + if (header_count != 0) >> + desc->av_num_branches >> + = (weighted_nbranch + header_count/2) / header_count; >> + else >> + desc->av_num_branches = 0; >> + desc->has_call = has_call; >> + desc->has_fp = has_fp; >> } >> >> /* Adds basic block BB to LOOP. */ >> Index: cfgloop.h >> =================================================================== >> --- cfgloop.h (revision 187013) >> +++ cfgloop.h (working copy) >> @@ -249,7 +249,6 @@ extern basic_block *get_loop_body_in_custom_order >> >> extern VEC (edge, heap) *get_loop_exit_edges (const struct loop *); >> edge single_exit (const struct loop *); >> -extern unsigned num_loop_branches (const struct loop *); >> >> extern edge loop_preheader_edge (const struct loop *); >> extern edge loop_latch_edge (const struct loop *); >> @@ -359,7 +358,8 @@ struct rtx_iv >> }; >> >> /* The description of an exit from the loop and of the number of >> iterations >> - till we take the exit. */ >> + till we take the exit. Also includes other information used primarily >> + by the loop unroller. */ >> >> struct niter_desc >> { >> @@ -400,6 +400,18 @@ struct niter_desc >> >> /* The number of iterations of the loop. */ >> rtx niter_expr; >> + >> + /* The number of branches in the loop. */ >> + unsigned num_branches; >> + >> + /* The number of executed branches per iteration. */ >> + unsigned av_num_branches; >> + >> + /* Whether the loop contains a call instruction. */ >> + bool has_call; >> + >> + /* Whether the loop contains fp instructions. */ >> + bool has_fp; >> }; >> >> extern void iv_analysis_loop_init (struct loop *); >> @@ -413,6 +425,7 @@ extern void iv_analysis_done (void); >> >> extern struct niter_desc *get_simple_loop_desc (struct loop *loop); >> extern void free_simple_loop_desc (struct loop *loop); >> +void analyze_loop_insns (const struct loop *, struct niter_desc *desc); >> >> static inline struct niter_desc * >> simple_loop_desc (struct loop *loop) >> Index: params.def >> =================================================================== >> --- params.def (revision 187013) >> +++ params.def (working copy) >> @@ -312,6 +312,15 @@ DEFPARAM(PARAM_MAX_UNROLL_ITERATIONS, >> "The maximum depth of a loop nest we completely peel", >> 8, 0, 0) >> >> +DEFPARAM(PARAM_MIN_ITER_UNROLL_WITH_BRANCHES, >> + "min-iter-unroll-with-branches", >> + "Minimum iteration count to ignore branch effects when unrolling", >> + 50, 0, 0) >> +DEFPARAM(PARAM_UNROLL_OUTER_LOOP_BRANCH_BUDGET, >> + "unroll-outer-loop-branch-budget", >> + "Maximum number of branches allowed in hot outer loop region after >> unroll", >> + 25, 0, 0) >> + >> /* The maximum number of insns of an unswitched loop. */ >> DEFPARAM(PARAM_MAX_UNSWITCH_INSNS, >> "max-unswitch-insns", >> >> -- >> This patch is available for review at >> http://codereview.appspot.com/6099055 > > > > > -- > Teresa Johnson | Software Engineer | tejohn...@google.com | 408-460-2413
-- Teresa Johnson | Software Engineer | tejohn...@google.com | 408-460-2413