RE: Ping: [PATCH] Enable bbro for -Os

Zhenqiang Chen Wed, 05 Sep 2012 02:56:42 -0700

> -----Original Message-----
> From: Richard Guenther [mailto:richard.guent...@gmail.com]
> Sent: Tuesday, September 04, 2012 6:31 PM
> To: Zhenqiang Chen
> Cc: Steven Bosscher; gcc-patches@gcc.gnu.org
> Subject: Re: Ping: [PATCH] Enable bbro for -Os
> 
> On Wed, Aug 29, 2012 at 10:42 AM, Zhenqiang Chen
> <zhenqiang.c...@arm.com> wrote:
> >> -----Original Message-----
> >> From: Steven Bosscher [mailto:stevenb....@gmail.com]
> >> Sent: Friday, August 24, 2012 8:17 PM
> >> To: Zhenqiang Chen
> >> Cc: gcc-patches@gcc.gnu.org
> >> Subject: Re: Ping: [PATCH] Enable bbro for -Os
> >>
> >> On Wed, Aug 22, 2012 at 8:49 AM, Zhenqiang Chen
> >> <zhenqiang.c...@arm.com> wrote:
> >> >> The patch is to enable bbro for -Os. When optimizing for size, it
> >> >> * avoid duplicating block.
> >> >> * keep its original order if there is no chance to fall through.
> >> >> * ignore edge frequency and probability.
> >> >> * handle predecessor first if its index is smaller to break long
trace.
> >>
> >> You do this by inserting the index as a key. I don't fully understand
> >> this change. You're assuming that a block with a lower index has a
> >> lower pre- order number in the CFG's DFS spanning tree, IIUC (i.e.
> >> the blocks are numbered sequentially)? I'm not sure that's always
> >> true. I think you
> > should
> >> add an explanation for this heuristic.
> >
> > Thank you for the comments.
> >
> > cleanup_cfg is called at the end cfg_layout_initialize before
> > reorder_basic_blocks. cleanup_cfg does lots of optimization on cfg and
> > renumber the basic blocks. After cleanup_cfg, the blocks are roughly
> > numbered sequentially.
> 
> Well, sequentially in their current "order" which is not in any way flow-
> controlled.


Yip. The order is not flow-controlled. During debugging, I found the order
is quite good for code size. Logs show I have code size improvement only
with cleanup_cfg and without reorder_basic_blocks. But the order is not the
final result. It is changed after cfg_layout_finalize. The patch tries to
keep the order and connect some fall through edges.

> @@ -530,10 +544,11 @@ find_traces_1_round (int branch_th, int exec_th,
> gcov_type count_th,
>                 }
> 
>               /* Edge that cannot be fallthru or improbable or infrequent
> -                successor (i.e. it is unsuitable successor).  */
> +                successor (i.e. it is unsuitable successor).
> +                For size, ignore the frequency and probability.  */
>               if (!(e->flags & EDGE_CAN_FALLTHRU) || (e->flags &
EDGE_COMPLEX)
> -                 || prob < branch_th || EDGE_FREQUENCY (e) < exec_th
> -                 || e->count < count_th)
> +                 || (prob < branch_th || EDGE_FREQUENCY (e) < exec_th
> +                     || e->count < count_th) && !for_size)
>                 continue;
> 
> why that change?  It seems you do re-orderings that would not be done with
> -Os even though your goal was to preserve the original ordering.

The change just ignores the frequency and probability for -Os. Connecting
EDGE_CAN_FALLTHRU edges does have more chance to reduce size. No reason to
skip it.

> +             /* Wait for the predecessors.  */
> +             if ((e == best_edge) && for_size
> +                 && (EDGE_COUNT (best_edge->dest->succs) > 1
> +                     || EDGE_COUNT (best_edge->dest->preds) > 1))
> +               {
> +                 best_edge = NULL;
> +               }
> 
> I don't understand this (well, I'm not very familiar with bb-reorder),
doesn't
> that mean you rather want to push this block to the next round?

The change is to break long trace to reduce long jump. It does not connect
the best_edge immediately. Put the block to the next round to give a chance
for its predecessor which index is smaller to be handled first.
Take if-then-else as an example:
     A
  /    \  
B      C
 \    /
   D
Without this change, the final order might be
* ABD ... C  // C is at the end of the program, might need longjump C-->D
* ACD ... B //B is at the end of the program, might need longjump B-->D

But from code size view, ABCD/ACBD is better since it reduces the
possibility of longjump. The change is to generate such code. In this
example we ignore best_edge B-->D and C-->D. So put D to next round, then
the block with less index is selected. (I'm not familiar with cfg-cleanup.
My logs show D's index is always greater than B's and C's) 

> Overall I think this patch looks like adding hacks into the heuristics to
make it
> sane for -Os instead of doing what the comment suggests:

I think I follow the comments:
* "minimize the combined size": In the old heuristics, there is a long
performance sensitive trace from entry to exit. In my patch, it is broken
into small ones. Logs show with the patch, most traces with only one or two
blocks; few have 3 blocks.
* "more or less automatically remove extra jumps": By ignoring the frequency
and probability, the patch have more chance to follow through. When there is
no chance, just keep its original order.
 * " long jumps": As mentioned above the patch has less chance to introduce
long jumps.

> -  /* Don't reorder blocks when optimizing for size because extra jump
insns
> may
> -     be created; also barrier may create extra padding.
> 
> -     More correctly we should have a block reordering mode that tried to
> -     minimize the combined size of all the jumps.  This would more or
less
> -     automatically remove extra jumps, but would also try to use more
short
> -     jumps instead of long jumps.  */
> -  if (!optimize_function_for_speed_p (cfun))
> -    return false;
> 
> So, can you categorize the reorderings that you see are profitable?  How
> much gain do you get when you try to minimize the number of jumps (thus,
> maximize the number of fallthrus?)

cfg_cleanup does the key optimizations. The patch improves cfg_cleanup's
result by 
* Connect fall through edges to keep the better result of cfg_cleanup
* reduce jump to loop header.
To reuse current framework, the patch adds code to break long trace to
reduce long jumps.

Here are the CSiBE code size benchmark results:
* For ARM, code size reduces 0.21%.
* For MIPS, code size reduces 0.25%.
* For PPC, code size reduces 0.33%.
* For X86, code size reduces 0.22%.

Thanks!
-Zhenqiang

RE: Ping: [PATCH] Enable bbro for -Os

Reply via email to