Hi Andrew,

thanks for reviewing I'll work on your comments.  Just replying to the
high level questions.

Andrew Pinski <pins...@gmail.com> writes:

> On Wed, Jul 22, 2020 at 3:10 AM Andrea Corallo <andrea.cora...@arm.com> wrote:
>>
>> Hi all,
>>
>> this second patch implements the AArch64 specific back-end pass
>> 'branch-dilution' controllable by the followings command line options:
>>
>> -mbranch-dilution
>>
>> --param=aarch64-branch-dilution-granularity={num}
>>
>> --param=aarch64-branch-dilution-max-branches={num}
>>
>> Some cores known to be able to benefit from this pass have been given
>> default tuning values for their granularity and max-branches.  Each
>> affected core has a very specific granule size and associated max-branch
>> limit.  This is a microarchitecture specific optimization.  Typical
>> usage should be -mbranch-dilution with a specified -mcpu.  Cores with a
>> granularity tuned to 0 will be ignored. Options are provided for
>> experimentation.
>
> Can you give a simple example of what this patch does?

Sure, this pass simply moves a sliding window over the insns trying to
make sure that we never have more then 'max_branch' branches for every
'granule_size' insns.

If too many branches are detected nops are added where considered less
armful to correct that.

There are obviously many scenarios where the compiler can generate a
branch dense pieces of code but say we have the equivalent of:

====
.L389:
        bl      foo
        b       .L43
.L388:
        bl      foo
        b       .L42
.L387:
        bl      foo
        b       .L41
.L386:
        bl      foo
        b       .L40
====

Assuming granule size 4 and max branches 2 this will be transformed in
the equivalent of:

====
.L389:
        bl      foo
        b       .L43
        nop
        nop
.L388:
        bl      foo
        b       .L42
        nop
        nop
.L387:
        bl      foo
        b       .L41
        nop
        nop
.L386:
        bl      foo
        b       .L40
        nop
        nop
====

> Also your testcases seem too sensitive to other optimizations which
> could happen.  E.g. the call to "branch (i)" could be pulled out of
> the switch statement.  Or even the "*i += N;" could be moved to one
> Basic block and the switch becomes just one if statement.
>
>> Observed performance improvements on Neoverse N1 SPEC CPU 2006 where
>> up to ~+3% (xalancbmk) and ~+1.5% (sjeng).  Average code size increase
>> for all the testsuite proved to be ~0.4%.
>
> Also does this improve any non-SPEC benchmarks or has it only been
> benchmarked with SPEC?

So far I tried it only on SPEC 2006.  The transformation is not
benchmark specific tho, other code may benefit from it.

Thanks

  Andrea

Reply via email to