Hi Richard,
On 12/11/18 14:13, Richard Biener wrote:
On Fri, Nov 9, 2018 at 6:23 PM Sudakshina Das <sudi....@arm.com> wrote:
>
> Hi
>
> I am posting this patch on behalf of Carey (cc'ed). I also have some
> review comments that I will make as a reply to this later.
>
>
> This implements a new AArch64 specific back-end pass that helps optimize
> branch-dense code, which can be a bottleneck for performance on some Arm
> cores. This is achieved by padding out the branch-dense sections of the
> instruction stream with nops.
Wouldn't this be more suitable for implementing inside the assembler?
The number of NOPs to insert to get the performance benefits varies from core
to core,
I don't think we want to add such CPU-specific optimisation logic to the
assembler.
Thanks,
Kyrill
> This has proven to show up to a 2.61%~ improvement on the Cortex A-72
> (SPEC CPU 2006: sjeng).
>
> The implementation includes the addition of a new RTX instruction class
> FILLER_INSN, which has been white listed to allow placement of NOPs
> outside of a basic block. This is to allow padding after unconditional
> branches. This is favorable so that any performance gained from
> diluting branches is not paid straight back via excessive eating of nops.
>
> It was deemed that a new RTX class was less invasive than modifying
> behavior in regards to standard UNSPEC nops.
>
> ## Command Line Options
>
> Three new target-specific options are provided:
> - mbranch-dilution
> - mbranch-dilution-granularity={num}
> - mbranch-dilution-max-branches={num}
>
> A number of cores known to be able to benefit from this pass have been
> given default tuning values for their granularity and max-branches.
> Each affected core has a very specific granule size and associated
> max-branch limit. This is a microarchitecture specific optimization.
> Typical usage should be -mdilute-branches with a specificed -mcpu. Cores
> with a granularity tuned to 0 will be ignored. Options are provided for
> experimentation.
>
> ## Algorithm and Heuristic
>
> The pass takes a very simple 'sliding window' approach to the problem.
> We crawl through each instruction (starting at the first branch) and
> keep track of the number of branches within the current "granule" (or
> window). When this exceeds the max-branch value, the pass will dilute
> the current granule, inserting nops to push out some of the branches.
> The heuristic will favour unconditonal branches (for performance
> reasons), or branches that are between two other branches (in order to
> decrease the likelihood of another dilution call being needed).
>
> Each branch type required a different method for nop insertion due to
> RTL/basic_block restrictions:
>
> - Returning calls do not end a basic block so can be handled by emitting
> a generic nop.
> - Unconditional branches must be the end of a basic block, and nops
> cannot be outside of a basic block.
> Thus the need for FILLER_INSN, which allows placement outside of a
> basic block - and translates to a nop.
> - For most conditional branches we've taken a simple approach and only
> handle the fallthru edge for simplicity,
> which we do by inserting a "nop block" of nops on the fallthru edge,
> mapping that back to the original destination block.
> - asm gotos and pcsets are going to be tricky to analyse from a dilution
> perspective so are ignored at present.
>
>
> ## Changelog
>
> gcc/testsuite/ChangeLog:
>
> 2018-11-09 Carey Williams <carey.willi...@arm.com>
>
> * gcc.target/aarch64/branch-dilution-off.c: New test.
> * gcc.target/aarch64/branch-dilution-on.c: New test.
>
>
> gcc/ChangeLog:
>
> 2018-11-09 Carey Williams <carey.willi...@arm.com>
>
> * cfgbuild.c (inside_basic_block_p): Add FILLER_INSN case.
> * cfgrtl.c (rtl_verify_bb_layout): Whitelist FILLER_INSN outside
> basic blocks.
> * config.gcc (extra_objs): Add aarch64-branch-dilution.o.
> * config/aarch64/aarch64-branch-dilution.c: New file.
> * config/aarch64/aarch64-passes.def (branch-dilution): Register
> pass.
> * config/aarch64/aarch64-protos.h (struct tune_params): Declare
> tuning parameters bdilution_gsize and bdilution_maxb.
> (make_pass_branch_dilution): New declaration.
> * config/aarch64/aarch64.c (generic_tunings,cortexa35_tunings,
> cortexa53_tunings,cortexa57_tunings,cortexa72_tunings,
> cortexa73_tunings,exynosm1_tunings,thunderxt88_tunings,
> thunderx_tunings,tsv110_tunings,xgene1_tunings,
> qdf24xx_tunings,saphira_tunings,thunderx2t99_tunings):
> Provide default tunings for bdilution_gsize and bdilution_maxb.
> * config/aarch64/aarch64.md (filler_insn): Define new insn.
> * config/aarch64/aarch64.opt (mbranch-dilution,
> mbranch-dilution-granularity,
> mbranch-dilution-max-branches): Define new branch dilution
> options.
> * config/aarch64/t-aarch64 (aarch64-branch-dilution.c): New rule
> for aarch64-branch-dilution.c.
> * coretypes.h (rtx_filler_insn): New rtx class.
> * doc/invoke.texi (mbranch-dilution,
> mbranch-dilution-granularity,
> mbranch-dilution-max-branches): Document branch dilution
> options.
> * emit-rtl.c (emit_filler_after): New emit function.
> * rtl.def (FILLER_INSN): New RTL EXPR of type RTX_INSN.
> * rtl.h (class GTY): New class for rtx_filler_insn.
> (is_a_helper ::test): New test helper for rtx_filler_insn.
> (macro FILLER_INSN_P(X)): New predicate.
> * target-insns.def (filler_insn): Add target insn def.
>
> ### Testing
> - Successful compilation of 3 stage bootstrap with the pass forced on
> (for stage 2, 3)
> - No additional compilation failures (SPEC CPU 2006 and SPEC CPU 2017)
> - No 'make check' regressions
>
> Is this ok for trunk?
>
> Thanks
> Sudi