> -----Original Message----- > From: Richard Guenther [mailto:richard.guent...@gmail.com] > Sent: Tuesday, September 04, 2012 6:31 PM > To: Zhenqiang Chen > Cc: Steven Bosscher; gcc-patches@gcc.gnu.org > Subject: Re: Ping: [PATCH] Enable bbro for -Os > > On Wed, Aug 29, 2012 at 10:42 AM, Zhenqiang Chen > <zhenqiang.c...@arm.com> wrote: > >> -----Original Message----- > >> From: Steven Bosscher [mailto:stevenb....@gmail.com] > >> Sent: Friday, August 24, 2012 8:17 PM > >> To: Zhenqiang Chen > >> Cc: gcc-patches@gcc.gnu.org > >> Subject: Re: Ping: [PATCH] Enable bbro for -Os > >> > >> On Wed, Aug 22, 2012 at 8:49 AM, Zhenqiang Chen > >> <zhenqiang.c...@arm.com> wrote: > >> >> The patch is to enable bbro for -Os. When optimizing for size, it > >> >> * avoid duplicating block. > >> >> * keep its original order if there is no chance to fall through. > >> >> * ignore edge frequency and probability. > >> >> * handle predecessor first if its index is smaller to break long trace. > >> > >> You do this by inserting the index as a key. I don't fully understand > >> this change. You're assuming that a block with a lower index has a > >> lower pre- order number in the CFG's DFS spanning tree, IIUC (i.e. > >> the blocks are numbered sequentially)? I'm not sure that's always > >> true. I think you > > should > >> add an explanation for this heuristic. > > > > Thank you for the comments. > > > > cleanup_cfg is called at the end cfg_layout_initialize before > > reorder_basic_blocks. cleanup_cfg does lots of optimization on cfg and > > renumber the basic blocks. After cleanup_cfg, the blocks are roughly > > numbered sequentially. > > Well, sequentially in their current "order" which is not in any way flow- > controlled.
Yip. The order is not flow-controlled. During debugging, I found the order is quite good for code size. Logs show I have code size improvement only with cleanup_cfg and without reorder_basic_blocks. But the order is not the final result. It is changed after cfg_layout_finalize. The patch tries to keep the order and connect some fall through edges. > @@ -530,10 +544,11 @@ find_traces_1_round (int branch_th, int exec_th, > gcov_type count_th, > } > > /* Edge that cannot be fallthru or improbable or infrequent > - successor (i.e. it is unsuitable successor). */ > + successor (i.e. it is unsuitable successor). > + For size, ignore the frequency and probability. */ > if (!(e->flags & EDGE_CAN_FALLTHRU) || (e->flags & EDGE_COMPLEX) > - || prob < branch_th || EDGE_FREQUENCY (e) < exec_th > - || e->count < count_th) > + || (prob < branch_th || EDGE_FREQUENCY (e) < exec_th > + || e->count < count_th) && !for_size) > continue; > > why that change? It seems you do re-orderings that would not be done with > -Os even though your goal was to preserve the original ordering. The change just ignores the frequency and probability for -Os. Connecting EDGE_CAN_FALLTHRU edges does have more chance to reduce size. No reason to skip it. > + /* Wait for the predecessors. */ > + if ((e == best_edge) && for_size > + && (EDGE_COUNT (best_edge->dest->succs) > 1 > + || EDGE_COUNT (best_edge->dest->preds) > 1)) > + { > + best_edge = NULL; > + } > > I don't understand this (well, I'm not very familiar with bb-reorder), doesn't > that mean you rather want to push this block to the next round? The change is to break long trace to reduce long jump. It does not connect the best_edge immediately. Put the block to the next round to give a chance for its predecessor which index is smaller to be handled first. Take if-then-else as an example: A / \ B C \ / D Without this change, the final order might be * ABD ... C // C is at the end of the program, might need longjump C-->D * ACD ... B //B is at the end of the program, might need longjump B-->D But from code size view, ABCD/ACBD is better since it reduces the possibility of longjump. The change is to generate such code. In this example we ignore best_edge B-->D and C-->D. So put D to next round, then the block with less index is selected. (I'm not familiar with cfg-cleanup. My logs show D's index is always greater than B's and C's) > Overall I think this patch looks like adding hacks into the heuristics to make it > sane for -Os instead of doing what the comment suggests: I think I follow the comments: * "minimize the combined size": In the old heuristics, there is a long performance sensitive trace from entry to exit. In my patch, it is broken into small ones. Logs show with the patch, most traces with only one or two blocks; few have 3 blocks. * "more or less automatically remove extra jumps": By ignoring the frequency and probability, the patch have more chance to follow through. When there is no chance, just keep its original order. * " long jumps": As mentioned above the patch has less chance to introduce long jumps. > - /* Don't reorder blocks when optimizing for size because extra jump insns > may > - be created; also barrier may create extra padding. > > - More correctly we should have a block reordering mode that tried to > - minimize the combined size of all the jumps. This would more or less > - automatically remove extra jumps, but would also try to use more short > - jumps instead of long jumps. */ > - if (!optimize_function_for_speed_p (cfun)) > - return false; > > So, can you categorize the reorderings that you see are profitable? How > much gain do you get when you try to minimize the number of jumps (thus, > maximize the number of fallthrus?) cfg_cleanup does the key optimizations. The patch improves cfg_cleanup's result by * Connect fall through edges to keep the better result of cfg_cleanup * reduce jump to loop header. To reuse current framework, the patch adds code to break long trace to reduce long jumps. Here are the CSiBE code size benchmark results: * For ARM, code size reduces 0.21%. * For MIPS, code size reduces 0.25%. * For PPC, code size reduces 0.33%. * For X86, code size reduces 0.22%. Thanks! -Zhenqiang