On Fri, Nov 9, 2018 at 6:23 PM Sudakshina Das <sudi....@arm.com> wrote: > > Hi > > I am posting this patch on behalf of Carey (cc'ed). I also have some > review comments that I will make as a reply to this later. > > > This implements a new AArch64 specific back-end pass that helps optimize > branch-dense code, which can be a bottleneck for performance on some Arm > cores. This is achieved by padding out the branch-dense sections of the > instruction stream with nops.
Wouldn't this be more suitable for implementing inside the assembler? > This has proven to show up to a 2.61%~ improvement on the Cortex A-72 > (SPEC CPU 2006: sjeng). > > The implementation includes the addition of a new RTX instruction class > FILLER_INSN, which has been white listed to allow placement of NOPs > outside of a basic block. This is to allow padding after unconditional > branches. This is favorable so that any performance gained from > diluting branches is not paid straight back via excessive eating of nops. > > It was deemed that a new RTX class was less invasive than modifying > behavior in regards to standard UNSPEC nops. > > ## Command Line Options > > Three new target-specific options are provided: > - mbranch-dilution > - mbranch-dilution-granularity={num} > - mbranch-dilution-max-branches={num} > > A number of cores known to be able to benefit from this pass have been > given default tuning values for their granularity and max-branches. > Each affected core has a very specific granule size and associated > max-branch limit. This is a microarchitecture specific optimization. > Typical usage should be -mdilute-branches with a specificed -mcpu. Cores > with a granularity tuned to 0 will be ignored. Options are provided for > experimentation. > > ## Algorithm and Heuristic > > The pass takes a very simple 'sliding window' approach to the problem. > We crawl through each instruction (starting at the first branch) and > keep track of the number of branches within the current "granule" (or > window). When this exceeds the max-branch value, the pass will dilute > the current granule, inserting nops to push out some of the branches. > The heuristic will favour unconditonal branches (for performance > reasons), or branches that are between two other branches (in order to > decrease the likelihood of another dilution call being needed). > > Each branch type required a different method for nop insertion due to > RTL/basic_block restrictions: > > - Returning calls do not end a basic block so can be handled by emitting > a generic nop. > - Unconditional branches must be the end of a basic block, and nops > cannot be outside of a basic block. > Thus the need for FILLER_INSN, which allows placement outside of a > basic block - and translates to a nop. > - For most conditional branches we've taken a simple approach and only > handle the fallthru edge for simplicity, > which we do by inserting a "nop block" of nops on the fallthru edge, > mapping that back to the original destination block. > - asm gotos and pcsets are going to be tricky to analyse from a dilution > perspective so are ignored at present. > > > ## Changelog > > gcc/testsuite/ChangeLog: > > 2018-11-09 Carey Williams <carey.willi...@arm.com> > > * gcc.target/aarch64/branch-dilution-off.c: New test. > * gcc.target/aarch64/branch-dilution-on.c: New test. > > > gcc/ChangeLog: > > 2018-11-09 Carey Williams <carey.willi...@arm.com> > > * cfgbuild.c (inside_basic_block_p): Add FILLER_INSN case. > * cfgrtl.c (rtl_verify_bb_layout): Whitelist FILLER_INSN outside > basic blocks. > * config.gcc (extra_objs): Add aarch64-branch-dilution.o. > * config/aarch64/aarch64-branch-dilution.c: New file. > * config/aarch64/aarch64-passes.def (branch-dilution): Register > pass. > * config/aarch64/aarch64-protos.h (struct tune_params): Declare > tuning parameters bdilution_gsize and bdilution_maxb. > (make_pass_branch_dilution): New declaration. > * config/aarch64/aarch64.c (generic_tunings,cortexa35_tunings, > cortexa53_tunings,cortexa57_tunings,cortexa72_tunings, > cortexa73_tunings,exynosm1_tunings,thunderxt88_tunings, > thunderx_tunings,tsv110_tunings,xgene1_tunings, > qdf24xx_tunings,saphira_tunings,thunderx2t99_tunings): > Provide default tunings for bdilution_gsize and bdilution_maxb. > * config/aarch64/aarch64.md (filler_insn): Define new insn. > * config/aarch64/aarch64.opt (mbranch-dilution, > mbranch-dilution-granularity, > mbranch-dilution-max-branches): Define new branch dilution > options. > * config/aarch64/t-aarch64 (aarch64-branch-dilution.c): New rule > for aarch64-branch-dilution.c. > * coretypes.h (rtx_filler_insn): New rtx class. > * doc/invoke.texi (mbranch-dilution, > mbranch-dilution-granularity, > mbranch-dilution-max-branches): Document branch dilution > options. > * emit-rtl.c (emit_filler_after): New emit function. > * rtl.def (FILLER_INSN): New RTL EXPR of type RTX_INSN. > * rtl.h (class GTY): New class for rtx_filler_insn. > (is_a_helper ::test): New test helper for rtx_filler_insn. > (macro FILLER_INSN_P(X)): New predicate. > * target-insns.def (filler_insn): Add target insn def. > > ### Testing > - Successful compilation of 3 stage bootstrap with the pass forced on > (for stage 2, 3) > - No additional compilation failures (SPEC CPU 2006 and SPEC CPU 2017) > - No 'make check' regressions > > Is this ok for trunk? > > Thanks > Sudi