On Wed, Jul 22, 2020 at 3:10 AM Andrea Corallo <andrea.cora...@arm.com> wrote:
>
> Hi all,
>
> this second patch implements the AArch64 specific back-end pass
> 'branch-dilution' controllable by the followings command line options:
>
> -mbranch-dilution
>
> --param=aarch64-branch-dilution-granularity={num}
>
> --param=aarch64-branch-dilution-max-branches={num}
>
> Some cores known to be able to benefit from this pass have been given
> default tuning values for their granularity and max-branches.  Each
> affected core has a very specific granule size and associated max-branch
> limit.  This is a microarchitecture specific optimization.  Typical
> usage should be -mbranch-dilution with a specified -mcpu.  Cores with a
> granularity tuned to 0 will be ignored. Options are provided for
> experimentation.

Can you give a simple example of what this patch does?
Also your testcases seem too sensitive to other optimizations which
could happen.  E.g. the call to "branch (i)" could be pulled out of
the switch statement.  Or even the "*i += N;" could be moved to one
Basic block and the switch becomes just one if statement.

> Observed performance improvements on Neoverse N1 SPEC CPU 2006 where
> up to ~+3% (xalancbmk) and ~+1.5% (sjeng).  Average code size increase
> for all the testsuite proved to be ~0.4%.

Also does this improve any non-SPEC benchmarks or has it only been
benchmarked with SPEC?

A few comments about the patch itself:
> --- a/gcc/config.gcc
> +++ b/gcc/config.gcc
> @@ -321,7 +321,7 @@ aarch64*-*-*)
>  c_target_objs="aarch64-c.o"
>  cxx_target_objs="aarch64-c.o"
>  d_target_objs="aarch64-d.o"
> - extra_objs="aarch64-builtins.o aarch-common.o aarch64-sve-builtins.o 
> aarch64-sve-builtins-shapes.o aarch64-sve-builtins-base.o 
> aarch64-sve-builtins-sve2.o cortex-a57-fma-steering.o aarch64-speculation.o 
> falkor-tag-collision-avoidance.o aarch64-bti-insert.o"
> + extra_objs="aarch64-builtins.o aarch-common.o aarch64-sve-builtins.o 
> aarch64-sve-builtins-shapes.o aarch64-sve-builtins-base.o 
> aarch64-sve-builtins-sve2.o cortex-a57-fma-steering.o aarch64-speculation.o 
> falkor-tag-collision-avoidance.o aarch64-bti-insert.o 
> aarch64-branch-dilution.o"
>  target_gtfiles="\$(srcdir)/config/aarch64/aarch64-builtins.c 
> \$(srcdir)/config/aarch64/aarch64-sve-builtins.h 
> \$(srcdir)/config/aarch64/aarch64-sve-builtins.cc"
>  target_has_targetm_common=yes
>  ;;

I think it is time to change how extra_objs is done and split it
across a few lines; this should be done in a different patch, it will
simplify future additions later on.

+unsigned max_branch = 0;
+unsigned granule_size = 0;

These two really should be part of the insn_granule class.  Having
global variables is not a good idea with respect to threading of GCC.

+      dump_printf (MSG_NOTE,
+    "%d. %s%s%s  > NEXT = (%d), PREV = (%d) -- UID: %d\n",
+    insn->index, GET_RTX_NAME (GET_CODE (insn->rtx)),
+    any_uncondjump_p (insn->rtx) ? " (ubranch)" : "",
+    insn->is_nop ? " (nop)" : "",
+    insn->next ? insn->next->index : -1,
+    insn->prev ? insn->prev->index : -1, INSN_UID (insn->rtx));

This part should really be of a method of insn_info instead.

is_branch (insn->rtx)

Why not:
insn->is_branch ()

This simplifies the code and really shows what is being done.

+  if (is_branch (prev_real_nondebug_insn (insn->rtx))
+      && is_branch (next_real_nondebug_insn (insn->rtx)))

Maybe:
+  if (is_branch (insn->prev_real_nondebug_insn ())
+      && is_branch (insn->next_real_nondebug_insn ()))

+  while (current_insn && !current_insn->is_branch)
+    {
+      current_insn = current_insn->next;
+    }

Why not:
current_insn = current_insn->next_branch_insn ();

There are many more examples of where you can improve like the above;
that is the way you define insn_info can be improved and push some of
the implementation back into the insn_info definition.

Thanks,
Andrew

>
> * Algorithm and Heuristic
>
> The pass takes a very simple 'sliding window' approach to the problem.
> We crawl through each instruction (starting at the first branch) and
> keep track of the number of branches within the current "granule" (or
> window).  When this exceeds the max-branch value, the pass will dilute
> the current granule, inserting nops to push out some of the branches.
> The heuristic will favor unconditonal branches (for performance
> reasons), or branches that are between two other branches (in order to
> decrease the likelihood of another dilution call being needed).
>
> Each branch type required a different method for nop insertion due to
> RTL/basic_block restrictions:
>
> - Returning calls do not end a basic block so can be handled by
>   emitting a generic nop.
>
> - Unconditional branches must be the end of a basic block, and nops
>   cannot be outside of a basic block.  Thus the need for FILLER_INSN,
>   which allows placement outside of a basic block
>
> - and translates to a nop.
>
> - For most conditional branches we've taken a simple approach and only
>   handle the fallthru edge for simplicity, which we do by inserting a
>   "nop block" of nops on the fallthru edge, mapping that back to the
>   original destination block.
>
> - asm gotos and pcsets are going to be tricky to analyze from a
>   dilution perspective so are ignored at present.
>
> * Testing
>
> The two patches has been tested together on top of current master on
> aarch64-unknown-linux-gnu as follow:
>
> - Successful compilation of 3 stage bootstrap with the
>   pass forced on (for stage 2, 3)
>
> - No additional compilation failures (SPEC CPU 2006 and SPEC CPU 2017)
>
> - No 'make check' regressions
>
>
> Regards
>
>   Andrea
>
> gcc/ChangeLog
>
> 2020-07-17  Andrea Corallo  <andrea.cora...@arm.com>
>             Carey Williams  <carey.willi...@arm.com>
>
>         * config.gcc (extra_objs): Add aarch64-branch-dilution.o.
>         * config/aarch64/aarch64-branch-dilution.c: New file.
>         * config/aarch64/aarch64-passes.def (branch-dilution): Register
>         pass.
>         * config/aarch64/aarch64-protos.h (struct tune_params): Declare
>         tuning parameters bdilution_gsize and bdilution_maxb.
>         (make_pass_branch_dilution): New declaration.
>         * config/aarch64/aarch64.c (generic_tunings, cortexa35_tunings)
>         (cortexa53_tunings, cortexa57_tunings, cortexa72_tunings)
>         (cortexa73_tunings, exynosm1_tunings, thunderxt88_tunings)
>         (thunderx_tunings, tsv110_tunings, xgene1_tunings)
>         (qdf24xx_tunings, saphira_tunings, thunderx2t99_tunings)
>         (neoversen1_tunings): Provide default tunings for bdilution_gsize
>         and bdilution_maxb.
>         * config/aarch64/aarch64.md (filler_insn): Define new insn.
>         * config/aarch64/aarch64.opt (-mbranch-dilution)
>         (--param=aarch64-branch-dilution-granularity)
>         (--param=aarch64-branch-dilution-max-branches): Add new options.
>         * config/aarch64/t-aarch64 (aarch64-branch-dilution.c): New rule
>         for aarch64-branch-dilution.c.
>         * doc/invoke.texi (-mbranch-dilution)
>         (--param=aarch64-branch-dilution-granularity)
>         (--param=aarch64-branch-dilution-max-branches): Document branch
>         dilution options.
>
> gcc/testsuite/ChangeLog
>
> 2020-07-17  Andrea Corallo  <andrea.cora...@arm.com>
>             Carey Williams  <carey.willi...@arm.com>
>
>         * gcc.target/aarch64/branch-dilution-off.c: New file.
>         * gcc.target/aarch64/branch-dilution-on.c: New file.

Reply via email to