Re: [PATCH, GCC, AArch64] Branch Dilution Pass

2018-11-23 Thread Carey Williams
Hi all,

Thank you for the comments. For the same reasons as stated by Kyrill and 
Richard E., I do believe it has to be the compiler that implements this. I will 
look to split the patch and address the points made by Andrew.


From: Richard Earnshaw (lists) 
Sent: 12 November 2018 15:55:58
To: Kyrill Tkachov; Richard Biener; Sudakshina Das
Cc: GCC Patches; nd; Carey Williams
Subject: Re: [PATCH, GCC, AArch64] Branch Dilution Pass

On 12/11/2018 15:13, Kyrill Tkachov wrote:
> Hi Richard,
>
> On 12/11/18 14:13, Richard Biener wrote:
>> On Fri, Nov 9, 2018 at 6:23 PM Sudakshina Das  wrote:
>> >
>> > Hi
>> >
>> > I am posting this patch on behalf of Carey (cc'ed). I also have some
>> > review comments that I will make as a reply to this later.
>> >
>> >
>> > This implements a new AArch64 specific back-end pass that helps
>> optimize
>> > branch-dense code, which can be a bottleneck for performance on some
>> Arm
>> > cores. This is achieved by padding out the branch-dense sections of the
>> > instruction stream with nops.
>>
>> Wouldn't this be more suitable for implementing inside the assembler?
>>
>
> The number of NOPs to insert to get the performance benefits varies from
> core to core,
> I don't think we want to add such CPU-specific optimisation logic to the
> assembler.

Additionally, the compiler has to keep track of branch ranges.  It can't
do this properly if the assembler is emitting more instructions than the
compiler thinks it is.

R.

>
> Thanks,
> Kyrill
>
>> > This has proven to show up to a 2.61%~ improvement on the Cortex A-72
>> > (SPEC CPU 2006: sjeng).
>> >
>> > The implementation includes the addition of a new RTX instruction class
>> > FILLER_INSN, which has been white listed to allow placement of NOPs
>> > outside of a basic block. This is to allow padding after unconditional
>> > branches. This is favorable so that any performance gained from
>> > diluting branches is not paid straight back via excessive eating of
>> nops.
>> >
>> > It was deemed that a new RTX class was less invasive than modifying
>> > behavior in regards to standard UNSPEC nops.
>> >
>> > ## Command Line Options
>> >
>> > Three new target-specific options are provided:
>> > - mbranch-dilution
>> > - mbranch-dilution-granularity={num}
>> > - mbranch-dilution-max-branches={num}
>> >
>> > A number of cores known to be able to benefit from this pass have been
>> > given default tuning values for their granularity and max-branches.
>> > Each affected core has a very specific granule size and associated
>> > max-branch limit. This is a microarchitecture specific optimization.
>> > Typical usage should be -mdilute-branches with a specificed -mcpu.
>> Cores
>> > with a granularity tuned to 0 will be ignored. Options are provided for
>> > experimentation.
>> >
>> > ## Algorithm and Heuristic
>> >
>> > The pass takes a very simple 'sliding window' approach to the problem.
>> > We crawl through each instruction (starting at the first branch) and
>> > keep track of the number of branches within the current "granule" (or
>> > window). When this exceeds the max-branch value, the pass will dilute
>> > the current granule, inserting nops to push out some of the branches.
>> > The heuristic will favour unconditonal branches (for performance
>> > reasons), or branches that are between two other branches (in order to
>> > decrease the likelihood of another dilution call being needed).
>> >
>> > Each branch type required a different method for nop insertion due to
>> > RTL/basic_block restrictions:
>> >
>> > - Returning calls do not end a basic block so can be handled by
>> emitting
>> > a generic nop.
>> > - Unconditional branches must be the end of a basic block, and nops
>> > cannot be outside of a basic block.
>> >Thus the need for FILLER_INSN, which allows placement outside of a
>> > basic block - and translates to a nop.
>> > - For most conditional branches we've taken a simple approach and only
>> > handle the fallthru edge for simplicity,
>> >which we do by inserting a "nop block" of nops on the fallthru edge,
>> > mapping that back to the original destination block.
>> > - asm gotos and pcsets are going to be tricky to analyse from a
>> dilution
>> > perspective so are ignored at present.
>> >

Re: [PATCH, GCC, AArch64] Branch Dilution Pass

2018-11-12 Thread Richard Earnshaw (lists)
On 12/11/2018 15:13, Kyrill Tkachov wrote:
> Hi Richard,
> 
> On 12/11/18 14:13, Richard Biener wrote:
>> On Fri, Nov 9, 2018 at 6:23 PM Sudakshina Das  wrote:
>> >
>> > Hi
>> >
>> > I am posting this patch on behalf of Carey (cc'ed). I also have some
>> > review comments that I will make as a reply to this later.
>> >
>> >
>> > This implements a new AArch64 specific back-end pass that helps
>> optimize
>> > branch-dense code, which can be a bottleneck for performance on some
>> Arm
>> > cores. This is achieved by padding out the branch-dense sections of the
>> > instruction stream with nops.
>>
>> Wouldn't this be more suitable for implementing inside the assembler?
>>
> 
> The number of NOPs to insert to get the performance benefits varies from
> core to core,
> I don't think we want to add such CPU-specific optimisation logic to the
> assembler.

Additionally, the compiler has to keep track of branch ranges.  It can't
do this properly if the assembler is emitting more instructions than the
compiler thinks it is.

R.

> 
> Thanks,
> Kyrill
> 
>> > This has proven to show up to a 2.61%~ improvement on the Cortex A-72
>> > (SPEC CPU 2006: sjeng).
>> >
>> > The implementation includes the addition of a new RTX instruction class
>> > FILLER_INSN, which has been white listed to allow placement of NOPs
>> > outside of a basic block. This is to allow padding after unconditional
>> > branches. This is favorable so that any performance gained from
>> > diluting branches is not paid straight back via excessive eating of
>> nops.
>> >
>> > It was deemed that a new RTX class was less invasive than modifying
>> > behavior in regards to standard UNSPEC nops.
>> >
>> > ## Command Line Options
>> >
>> > Three new target-specific options are provided:
>> > - mbranch-dilution
>> > - mbranch-dilution-granularity={num}
>> > - mbranch-dilution-max-branches={num}
>> >
>> > A number of cores known to be able to benefit from this pass have been
>> > given default tuning values for their granularity and max-branches.
>> > Each affected core has a very specific granule size and associated
>> > max-branch limit. This is a microarchitecture specific optimization.
>> > Typical usage should be -mdilute-branches with a specificed -mcpu.
>> Cores
>> > with a granularity tuned to 0 will be ignored. Options are provided for
>> > experimentation.
>> >
>> > ## Algorithm and Heuristic
>> >
>> > The pass takes a very simple 'sliding window' approach to the problem.
>> > We crawl through each instruction (starting at the first branch) and
>> > keep track of the number of branches within the current "granule" (or
>> > window). When this exceeds the max-branch value, the pass will dilute
>> > the current granule, inserting nops to push out some of the branches.
>> > The heuristic will favour unconditonal branches (for performance
>> > reasons), or branches that are between two other branches (in order to
>> > decrease the likelihood of another dilution call being needed).
>> >
>> > Each branch type required a different method for nop insertion due to
>> > RTL/basic_block restrictions:
>> >
>> > - Returning calls do not end a basic block so can be handled by
>> emitting
>> > a generic nop.
>> > - Unconditional branches must be the end of a basic block, and nops
>> > cannot be outside of a basic block.
>> >    Thus the need for FILLER_INSN, which allows placement outside of a
>> > basic block - and translates to a nop.
>> > - For most conditional branches we've taken a simple approach and only
>> > handle the fallthru edge for simplicity,
>> >    which we do by inserting a "nop block" of nops on the fallthru edge,
>> > mapping that back to the original destination block.
>> > - asm gotos and pcsets are going to be tricky to analyse from a
>> dilution
>> > perspective so are ignored at present.
>> >
>> >
>> > ## Changelog
>> >
>> > gcc/testsuite/ChangeLog:
>> >
>> > 2018-11-09  Carey Williams 
>> >
>> > * gcc.target/aarch64/branch-dilution-off.c: New test.
>> > * gcc.target/aarch64/branch-dilution-on.c: New test.
>> >
>> >
>> > gcc/ChangeLog:
>> >
>> > 2018-11-09  Carey Williams 
>> >
>> > * cfgbuild.c (inside_basic_block_p): Add FILLER_INSN case.
>> > * cfgrtl.c (rtl_verify_bb_layout): Whitelist FILLER_INSN
>> outside
>> > basic blocks.
>> > * config.gcc (extra_objs): Add aarch64-branch-dilution.o.
>> > * config/aarch64/aarch64-branch-dilution.c: New file.
>> > * config/aarch64/aarch64-passes.def (branch-dilution): Register
>> > pass.
>> > * config/aarch64/aarch64-protos.h (struct tune_params): Declare
>> > tuning parameters bdilution_gsize and bdilution_maxb.
>> > (make_pass_branch_dilution): New declaration.
>> > * config/aarch64/aarch64.c (generic_tunings,cortexa35_tunings,
>> > cortexa53_tunings,cortexa57_tunings,cortexa72_tunings,
>> > cortexa73_tunings,exynosm1_tunings,thunderxt88_tunings,
>> > 

Re: [PATCH, GCC, AArch64] Branch Dilution Pass

2018-11-12 Thread Kyrill Tkachov

Hi Richard,

On 12/11/18 14:13, Richard Biener wrote:

On Fri, Nov 9, 2018 at 6:23 PM Sudakshina Das  wrote:
>
> Hi
>
> I am posting this patch on behalf of Carey (cc'ed). I also have some
> review comments that I will make as a reply to this later.
>
>
> This implements a new AArch64 specific back-end pass that helps optimize
> branch-dense code, which can be a bottleneck for performance on some Arm
> cores. This is achieved by padding out the branch-dense sections of the
> instruction stream with nops.

Wouldn't this be more suitable for implementing inside the assembler?



The number of NOPs to insert to get the performance benefits varies from core 
to core,
I don't think we want to add such CPU-specific optimisation logic to the 
assembler.

Thanks,
Kyrill


> This has proven to show up to a 2.61%~ improvement on the Cortex A-72
> (SPEC CPU 2006: sjeng).
>
> The implementation includes the addition of a new RTX instruction class
> FILLER_INSN, which has been white listed to allow placement of NOPs
> outside of a basic block. This is to allow padding after unconditional
> branches. This is favorable so that any performance gained from
> diluting branches is not paid straight back via excessive eating of nops.
>
> It was deemed that a new RTX class was less invasive than modifying
> behavior in regards to standard UNSPEC nops.
>
> ## Command Line Options
>
> Three new target-specific options are provided:
> - mbranch-dilution
> - mbranch-dilution-granularity={num}
> - mbranch-dilution-max-branches={num}
>
> A number of cores known to be able to benefit from this pass have been
> given default tuning values for their granularity and max-branches.
> Each affected core has a very specific granule size and associated
> max-branch limit. This is a microarchitecture specific optimization.
> Typical usage should be -mdilute-branches with a specificed -mcpu. Cores
> with a granularity tuned to 0 will be ignored. Options are provided for
> experimentation.
>
> ## Algorithm and Heuristic
>
> The pass takes a very simple 'sliding window' approach to the problem.
> We crawl through each instruction (starting at the first branch) and
> keep track of the number of branches within the current "granule" (or
> window). When this exceeds the max-branch value, the pass will dilute
> the current granule, inserting nops to push out some of the branches.
> The heuristic will favour unconditonal branches (for performance
> reasons), or branches that are between two other branches (in order to
> decrease the likelihood of another dilution call being needed).
>
> Each branch type required a different method for nop insertion due to
> RTL/basic_block restrictions:
>
> - Returning calls do not end a basic block so can be handled by emitting
> a generic nop.
> - Unconditional branches must be the end of a basic block, and nops
> cannot be outside of a basic block.
>Thus the need for FILLER_INSN, which allows placement outside of a
> basic block - and translates to a nop.
> - For most conditional branches we've taken a simple approach and only
> handle the fallthru edge for simplicity,
>which we do by inserting a "nop block" of nops on the fallthru edge,
> mapping that back to the original destination block.
> - asm gotos and pcsets are going to be tricky to analyse from a dilution
> perspective so are ignored at present.
>
>
> ## Changelog
>
> gcc/testsuite/ChangeLog:
>
> 2018-11-09  Carey Williams 
>
> * gcc.target/aarch64/branch-dilution-off.c: New test.
> * gcc.target/aarch64/branch-dilution-on.c: New test.
>
>
> gcc/ChangeLog:
>
> 2018-11-09  Carey Williams 
>
> * cfgbuild.c (inside_basic_block_p): Add FILLER_INSN case.
> * cfgrtl.c (rtl_verify_bb_layout): Whitelist FILLER_INSN outside
> basic blocks.
> * config.gcc (extra_objs): Add aarch64-branch-dilution.o.
> * config/aarch64/aarch64-branch-dilution.c: New file.
> * config/aarch64/aarch64-passes.def (branch-dilution): Register
> pass.
> * config/aarch64/aarch64-protos.h (struct tune_params): Declare
> tuning parameters bdilution_gsize and bdilution_maxb.
> (make_pass_branch_dilution): New declaration.
> * config/aarch64/aarch64.c (generic_tunings,cortexa35_tunings,
> cortexa53_tunings,cortexa57_tunings,cortexa72_tunings,
> cortexa73_tunings,exynosm1_tunings,thunderxt88_tunings,
> thunderx_tunings,tsv110_tunings,xgene1_tunings,
> qdf24xx_tunings,saphira_tunings,thunderx2t99_tunings):
> Provide default tunings for bdilution_gsize and bdilution_maxb.
> * config/aarch64/aarch64.md (filler_insn): Define new insn.
> * config/aarch64/aarch64.opt (mbranch-dilution,
> mbranch-dilution-granularity,
> mbranch-dilution-max-branches): Define new branch dilution
> options.
> * config/aarch64/t-aarch64 (aarch64-branch-dilution.c): New rule
> for aarch64-branch-dilution.c.
> * coretypes.h 

Re: [PATCH, GCC, AArch64] Branch Dilution Pass

2018-11-12 Thread Richard Biener
On Fri, Nov 9, 2018 at 6:23 PM Sudakshina Das  wrote:
>
> Hi
>
> I am posting this patch on behalf of Carey (cc'ed). I also have some
> review comments that I will make as a reply to this later.
>
>
> This implements a new AArch64 specific back-end pass that helps optimize
> branch-dense code, which can be a bottleneck for performance on some Arm
> cores. This is achieved by padding out the branch-dense sections of the
> instruction stream with nops.

Wouldn't this be more suitable for implementing inside the assembler?

> This has proven to show up to a 2.61%~ improvement on the Cortex A-72
> (SPEC CPU 2006: sjeng).
>
> The implementation includes the addition of a new RTX instruction class
> FILLER_INSN, which has been white listed to allow placement of NOPs
> outside of a basic block. This is to allow padding after unconditional
> branches. This is favorable so that any performance gained from
> diluting branches is not paid straight back via excessive eating of nops.
>
> It was deemed that a new RTX class was less invasive than modifying
> behavior in regards to standard UNSPEC nops.
>
> ## Command Line Options
>
> Three new target-specific options are provided:
> - mbranch-dilution
> - mbranch-dilution-granularity={num}
> - mbranch-dilution-max-branches={num}
>
> A number of cores known to be able to benefit from this pass have been
> given default tuning values for their granularity and max-branches.
> Each affected core has a very specific granule size and associated
> max-branch limit. This is a microarchitecture specific optimization.
> Typical usage should be -mdilute-branches with a specificed -mcpu. Cores
> with a granularity tuned to 0 will be ignored. Options are provided for
> experimentation.
>
> ## Algorithm and Heuristic
>
> The pass takes a very simple 'sliding window' approach to the problem.
> We crawl through each instruction (starting at the first branch) and
> keep track of the number of branches within the current "granule" (or
> window). When this exceeds the max-branch value, the pass will dilute
> the current granule, inserting nops to push out some of the branches.
> The heuristic will favour unconditonal branches (for performance
> reasons), or branches that are between two other branches (in order to
> decrease the likelihood of another dilution call being needed).
>
> Each branch type required a different method for nop insertion due to
> RTL/basic_block restrictions:
>
> - Returning calls do not end a basic block so can be handled by emitting
> a generic nop.
> - Unconditional branches must be the end of a basic block, and nops
> cannot be outside of a basic block.
>Thus the need for FILLER_INSN, which allows placement outside of a
> basic block - and translates to a nop.
> - For most conditional branches we've taken a simple approach and only
> handle the fallthru edge for simplicity,
>which we do by inserting a "nop block" of nops on the fallthru edge,
> mapping that back to the original destination block.
> - asm gotos and pcsets are going to be tricky to analyse from a dilution
> perspective so are ignored at present.
>
>
> ## Changelog
>
> gcc/testsuite/ChangeLog:
>
> 2018-11-09  Carey Williams  
>
> * gcc.target/aarch64/branch-dilution-off.c: New test.
> * gcc.target/aarch64/branch-dilution-on.c: New test.
>
>
> gcc/ChangeLog:
>
> 2018-11-09  Carey Williams  
>
> * cfgbuild.c (inside_basic_block_p): Add FILLER_INSN case.
> * cfgrtl.c (rtl_verify_bb_layout): Whitelist FILLER_INSN outside
> basic blocks.
> * config.gcc (extra_objs): Add aarch64-branch-dilution.o.
> * config/aarch64/aarch64-branch-dilution.c: New file.
> * config/aarch64/aarch64-passes.def (branch-dilution): Register
> pass.
> * config/aarch64/aarch64-protos.h (struct tune_params): Declare
> tuning parameters bdilution_gsize and bdilution_maxb.
> (make_pass_branch_dilution): New declaration.
> * config/aarch64/aarch64.c (generic_tunings,cortexa35_tunings,
> cortexa53_tunings,cortexa57_tunings,cortexa72_tunings,
> cortexa73_tunings,exynosm1_tunings,thunderxt88_tunings,
> thunderx_tunings,tsv110_tunings,xgene1_tunings,
> qdf24xx_tunings,saphira_tunings,thunderx2t99_tunings):
> Provide default tunings for bdilution_gsize and bdilution_maxb.
> * config/aarch64/aarch64.md (filler_insn): Define new insn.
> * config/aarch64/aarch64.opt (mbranch-dilution,
> mbranch-dilution-granularity,
> mbranch-dilution-max-branches): Define new branch dilution
> options.
> * config/aarch64/t-aarch64 (aarch64-branch-dilution.c): New rule
> for aarch64-branch-dilution.c.
> * coretypes.h (rtx_filler_insn): New rtx class.
> * doc/invoke.texi (mbranch-dilution,
> mbranch-dilution-granularity,
> mbranch-dilution-max-branches): Document branch dilution
> options.
> * emit-rtl.c 

Re: [PATCH, GCC, AArch64] Branch Dilution Pass

2018-11-09 Thread Andrew Pinski
On Fri, Nov 9, 2018 at 9:23 AM Sudakshina Das  wrote:
>
> Hi
>
> I am posting this patch on behalf of Carey (cc'ed). I also have some
> review comments that I will make as a reply to this later.
>
>
> This implements a new AArch64 specific back-end pass that helps optimize
> branch-dense code, which can be a bottleneck for performance on some Arm
> cores. This is achieved by padding out the branch-dense sections of the
> instruction stream with nops.
>
> This has proven to show up to a 2.61%~ improvement on the Cortex A-72
> (SPEC CPU 2006: sjeng).
>
> The implementation includes the addition of a new RTX instruction class
> FILLER_INSN, which has been white listed to allow placement of NOPs
> outside of a basic block. This is to allow padding after unconditional
> branches. This is favorable so that any performance gained from
> diluting branches is not paid straight back via excessive eating of nops.
>
> It was deemed that a new RTX class was less invasive than modifying
> behavior in regards to standard UNSPEC nops.

Maybe you should split this up into two patches, one of the
FILLER_INSN part and one for the aarch64 parts for easier review for
the maintainers.

- extra_objs="aarch64-builtins.o aarch-common.o
cortex-a57-fma-steering.o aarch64-speculation.o
falkor-tag-collision-avoidance.o"
+ extra_objs="aarch64-builtins.o aarch-common.o
cortex-a57-fma-steering.o aarch64-speculation.o
falkor-tag-collision-avoidance.o aarch64-branch-dilution.o"
Also this seems like the line is already long but that does not mean
it should get even longer; maybe split it up like:
extra_objs="aarch64-builtins.o aarch-common.o cortex-a57-fma-steering.o"
extra_objs="${extra_objs} aarch64-speculation.o
falkor-tag-collision-avoidance.o aarch64-branch-dilution.o"

I think you should be using some anonymous namespaces for insn_info,
insn_granule and maybe a few more things too.  Since those are all
local to the pass only (someone could make the same mistake as you and
use those names and you run into ODR violations then).

You might want to use the new dump_printf_loc functions instead of
fprintf where you are printing into dump_file.

+unsigned MAX_BRANCH = 0;
+unsigned GRANULE_SIZE = 0;
Most likely these should not be captilized at all and definitely
either static or in anonymous namespaces.

+inline bool
+is_branch (rtx_insn *insn)

Likewise about static or in an anonymouse namespace.
Maybe it could be rewritten to be easier to understand:
{
  if (insn == NULL)
  return false;
return JUMP_P (insn) || CALL_P (insn) || ANY_RETURN_P (insn);
}

That is swap the order there.

+  /* Pointers to the first/last instructions in granule.  */
+  insn_info *m_first = NULL;
+  insn_info *m_last = NULL;

Even though C++14 is the default compiler in newer versions, we
support C++98 to compile GCC with.  So please fix that so we don't
depend on C++11 features.  Simple way is to test native bootstrapping
with GCC 4.8 on CentOS 7.

+  if (insn->is_unconditional)
+{
+  value += 2;
+}

I noticed in some places you use {} around one statement and others
not.  The style should without except when nested and it confusing
which else belongs to it.


+void
+insn_granule::update_indexes ()
This is missing a comment before the function.  Also this seems more
like fixup indexes rather than update.


+  if (MAX_BRANCH < 1)
+error ("branch dilution: max branch must be greater than zero");
+  else if (MAX_BRANCH >= GRANULE_SIZE)
+{
+  error ("branch dilution: max branches (%d) must be \
+   less than granule size (%d)", MAX_BRANCH, GRANULE_SIZE);
+}

You should almost never use error in backend (there are a few
exceptions to that rule).  Either use sorry or internal_error.  Also
the way you wrapped the last error message is incorrect as the tabs
will show up in the error message.

Thanks,
Andrew Pinski

>
> ## Command Line Options
>
> Three new target-specific options are provided:
> - mbranch-dilution
> - mbranch-dilution-granularity={num}
> - mbranch-dilution-max-branches={num}
>
> A number of cores known to be able to benefit from this pass have been
> given default tuning values for their granularity and max-branches.
> Each affected core has a very specific granule size and associated
> max-branch limit. This is a microarchitecture specific optimization.
> Typical usage should be -mdilute-branches with a specificed -mcpu. Cores
> with a granularity tuned to 0 will be ignored. Options are provided for
> experimentation.
>
> ## Algorithm and Heuristic
>
> The pass takes a very simple 'sliding window' approach to the problem.
> We crawl through each instruction (starting at the first branch) and
> keep track of the number of branches within the current "granule" (or
> window). When this exceeds the max-branch value, the pass will dilute
> the current granule, inserting nops to push out some of the branches.
> The heuristic will favour unconditonal branches (for performance
> reasons), or branches that are between two other 

[PATCH, GCC, AArch64] Branch Dilution Pass

2018-11-09 Thread Sudakshina Das
Hi

I am posting this patch on behalf of Carey (cc'ed). I also have some
review comments that I will make as a reply to this later.


This implements a new AArch64 specific back-end pass that helps optimize
branch-dense code, which can be a bottleneck for performance on some Arm
cores. This is achieved by padding out the branch-dense sections of the
instruction stream with nops.

This has proven to show up to a 2.61%~ improvement on the Cortex A-72
(SPEC CPU 2006: sjeng).

The implementation includes the addition of a new RTX instruction class
FILLER_INSN, which has been white listed to allow placement of NOPs
outside of a basic block. This is to allow padding after unconditional
branches. This is favorable so that any performance gained from
diluting branches is not paid straight back via excessive eating of nops.

It was deemed that a new RTX class was less invasive than modifying
behavior in regards to standard UNSPEC nops.

## Command Line Options

Three new target-specific options are provided:
- mbranch-dilution
- mbranch-dilution-granularity={num}
- mbranch-dilution-max-branches={num}

A number of cores known to be able to benefit from this pass have been
given default tuning values for their granularity and max-branches.
Each affected core has a very specific granule size and associated
max-branch limit. This is a microarchitecture specific optimization. 
Typical usage should be -mdilute-branches with a specificed -mcpu. Cores 
with a granularity tuned to 0 will be ignored. Options are provided for 
experimentation.

## Algorithm and Heuristic

The pass takes a very simple 'sliding window' approach to the problem. 
We crawl through each instruction (starting at the first branch) and 
keep track of the number of branches within the current "granule" (or 
window). When this exceeds the max-branch value, the pass will dilute 
the current granule, inserting nops to push out some of the branches. 
The heuristic will favour unconditonal branches (for performance 
reasons), or branches that are between two other branches (in order to 
decrease the likelihood of another dilution call being needed).

Each branch type required a different method for nop insertion due to 
RTL/basic_block restrictions:

- Returning calls do not end a basic block so can be handled by emitting
a generic nop.
- Unconditional branches must be the end of a basic block, and nops 
cannot be outside of a basic block.
   Thus the need for FILLER_INSN, which allows placement outside of a 
basic block - and translates to a nop.
- For most conditional branches we've taken a simple approach and only 
handle the fallthru edge for simplicity,
   which we do by inserting a "nop block" of nops on the fallthru edge, 
mapping that back to the original destination block.
- asm gotos and pcsets are going to be tricky to analyse from a dilution 
perspective so are ignored at present.


## Changelog

gcc/testsuite/ChangeLog:

2018-11-09  Carey Williams  

* gcc.target/aarch64/branch-dilution-off.c: New test.
* gcc.target/aarch64/branch-dilution-on.c: New test.


gcc/ChangeLog:

2018-11-09  Carey Williams  

* cfgbuild.c (inside_basic_block_p): Add FILLER_INSN case.
* cfgrtl.c (rtl_verify_bb_layout): Whitelist FILLER_INSN outside
basic blocks.
* config.gcc (extra_objs): Add aarch64-branch-dilution.o.
* config/aarch64/aarch64-branch-dilution.c: New file.
* config/aarch64/aarch64-passes.def (branch-dilution): Register
pass.
* config/aarch64/aarch64-protos.h (struct tune_params): Declare
tuning parameters bdilution_gsize and bdilution_maxb.
(make_pass_branch_dilution): New declaration.
* config/aarch64/aarch64.c (generic_tunings,cortexa35_tunings,
cortexa53_tunings,cortexa57_tunings,cortexa72_tunings,
cortexa73_tunings,exynosm1_tunings,thunderxt88_tunings,
thunderx_tunings,tsv110_tunings,xgene1_tunings,
qdf24xx_tunings,saphira_tunings,thunderx2t99_tunings):
Provide default tunings for bdilution_gsize and bdilution_maxb.
* config/aarch64/aarch64.md (filler_insn): Define new insn.
* config/aarch64/aarch64.opt (mbranch-dilution,
mbranch-dilution-granularity,
mbranch-dilution-max-branches): Define new branch dilution
options.
* config/aarch64/t-aarch64 (aarch64-branch-dilution.c): New rule
for aarch64-branch-dilution.c.
* coretypes.h (rtx_filler_insn): New rtx class.
* doc/invoke.texi (mbranch-dilution,
mbranch-dilution-granularity,
mbranch-dilution-max-branches): Document branch dilution
options.
* emit-rtl.c (emit_filler_after): New emit function.
* rtl.def (FILLER_INSN): New RTL EXPR of type RTX_INSN.
* rtl.h (class GTY): New class for rtx_filler_insn.
(is_a_helper ::test): New test helper for rtx_filler_insn.
(macro FILLER_INSN_P(X)): New predicate.
* target-insns.def