Re: [PATCH, GCC, AArch64] Branch Dilution Pass

Kyrill Tkachov Mon, 12 Nov 2018 07:14:15 -0800

Hi Richard,

On 12/11/18 14:13, Richard Biener wrote:

On Fri, Nov 9, 2018 at 6:23 PM Sudakshina Das <sudi....@arm.com> wrote:
>
> Hi
>
> I am posting this patch on behalf of Carey (cc'ed). I also have some
> review comments that I will make as a reply to this later.
>
>
> This implements a new AArch64 specific back-end pass that helps optimize
> branch-dense code, which can be a bottleneck for performance on some Arm
> cores. This is achieved by padding out the branch-dense sections of the
> instruction stream with nops.


Wouldn't this be more suitable for implementing inside the assembler?


The number of NOPs to insert to get the performance benefits varies from core 
to core,
I don't think we want to add such CPU-specific optimisation logic to the 
assembler.

Thanks,
Kyrill

> This has proven to show up to a 2.61%~ improvement on the Cortex A-72
> (SPEC CPU 2006: sjeng).
>
> The implementation includes the addition of a new RTX instruction class
> FILLER_INSN, which has been white listed to allow placement of NOPs
> outside of a basic block. This is to allow padding after unconditional
> branches. This is favorable so that any performance gained from
> diluting branches is not paid straight back via excessive eating of nops.
>
> It was deemed that a new RTX class was less invasive than modifying
> behavior in regards to standard UNSPEC nops.
>
> ## Command Line Options
>
> Three new target-specific options are provided:
> - mbranch-dilution
> - mbranch-dilution-granularity={num}
> - mbranch-dilution-max-branches={num}
>
> A number of cores known to be able to benefit from this pass have been
> given default tuning values for their granularity and max-branches.
> Each affected core has a very specific granule size and associated
> max-branch limit. This is a microarchitecture specific optimization.
> Typical usage should be -mdilute-branches with a specificed -mcpu. Cores
> with a granularity tuned to 0 will be ignored. Options are provided for
> experimentation.
>
> ## Algorithm and Heuristic
>
> The pass takes a very simple 'sliding window' approach to the problem.
> We crawl through each instruction (starting at the first branch) and
> keep track of the number of branches within the current "granule" (or
> window). When this exceeds the max-branch value, the pass will dilute
> the current granule, inserting nops to push out some of the branches.
> The heuristic will favour unconditonal branches (for performance
> reasons), or branches that are between two other branches (in order to
> decrease the likelihood of another dilution call being needed).
>
> Each branch type required a different method for nop insertion due to
> RTL/basic_block restrictions:
>
> - Returning calls do not end a basic block so can be handled by emitting
> a generic nop.
> - Unconditional branches must be the end of a basic block, and nops
> cannot be outside of a basic block.
>    Thus the need for FILLER_INSN, which allows placement outside of a
> basic block - and translates to a nop.
> - For most conditional branches we've taken a simple approach and only
> handle the fallthru edge for simplicity,
>    which we do by inserting a "nop block" of nops on the fallthru edge,
> mapping that back to the original destination block.
> - asm gotos and pcsets are going to be tricky to analyse from a dilution
> perspective so are ignored at present.
>
>
> ## Changelog
>
> gcc/testsuite/ChangeLog:
>
> 2018-11-09  Carey Williams <carey.willi...@arm.com>
>
>         * gcc.target/aarch64/branch-dilution-off.c: New test.
>         * gcc.target/aarch64/branch-dilution-on.c: New test.
>
>
> gcc/ChangeLog:
>
> 2018-11-09  Carey Williams <carey.willi...@arm.com>
>
>         * cfgbuild.c (inside_basic_block_p): Add FILLER_INSN case.
>         * cfgrtl.c (rtl_verify_bb_layout): Whitelist FILLER_INSN outside
>         basic blocks.
>         * config.gcc (extra_objs): Add aarch64-branch-dilution.o.
>         * config/aarch64/aarch64-branch-dilution.c: New file.
>         * config/aarch64/aarch64-passes.def (branch-dilution): Register
>         pass.
>         * config/aarch64/aarch64-protos.h (struct tune_params): Declare
>         tuning parameters bdilution_gsize and bdilution_maxb.
>         (make_pass_branch_dilution): New declaration.
>         * config/aarch64/aarch64.c (generic_tunings,cortexa35_tunings,
> cortexa53_tunings,cortexa57_tunings,cortexa72_tunings,
> cortexa73_tunings,exynosm1_tunings,thunderxt88_tunings,
>         thunderx_tunings,tsv110_tunings,xgene1_tunings,
> qdf24xx_tunings,saphira_tunings,thunderx2t99_tunings):
>         Provide default tunings for bdilution_gsize and bdilution_maxb.
>         * config/aarch64/aarch64.md (filler_insn): Define new insn.
>         * config/aarch64/aarch64.opt (mbranch-dilution,
>         mbranch-dilution-granularity,
>         mbranch-dilution-max-branches): Define new branch dilution
>         options.
>         * config/aarch64/t-aarch64 (aarch64-branch-dilution.c): New rule
>         for aarch64-branch-dilution.c.
>         * coretypes.h (rtx_filler_insn): New rtx class.
>         * doc/invoke.texi (mbranch-dilution,
>         mbranch-dilution-granularity,
>         mbranch-dilution-max-branches): Document branch dilution
>         options.
>         * emit-rtl.c (emit_filler_after): New emit function.
>         * rtl.def (FILLER_INSN): New RTL EXPR of type RTX_INSN.
>         * rtl.h (class GTY): New class for rtx_filler_insn.
>         (is_a_helper ::test): New test helper for rtx_filler_insn.
>         (macro FILLER_INSN_P(X)): New predicate.
>         * target-insns.def (filler_insn): Add target insn def.
>
> ### Testing
> - Successful compilation of 3 stage bootstrap with the pass forced on
> (for stage 2, 3)
> - No additional compilation failures (SPEC CPU 2006 and SPEC CPU 2017)
> - No 'make check' regressions
>
> Is this ok for trunk?
>
> Thanks
> Sudi

Re: [PATCH, GCC, AArch64] Branch Dilution Pass

Reply via email to