Re: [PATCH][AArch64] Use LDP/STP in shrinkwrapping

2018-02-06 Thread James Greenhalgh
On Fri, Jan 05, 2018 at 12:22:44PM +, Wilco Dijkstra wrote:
> The shrinkwrap optimization added late in GCC 7 allows each callee-save to
> be delayed and done only across blocks which need a particular callee-save.
> Although this reduces unnecessary memory traffic on code paths that need
> few callee-saves, it typically uses LDR/STR rather than LDP/STP.  The
> number of LDP/STP instructions is reduced by ~7%. This means more memory
> accesses and increased codesize, ~1.0% on average.
> 
> To improve this, if a particular callee-save must be saved/restored, also
> add the adjacent callee-save to allow use of LDP/STP.  This significantly
> reduces codesize (for example gcc_r, povray_r, parest_r, xalancbmk_r are
> 1% smaller).  This is a simple fix which can be backported.  A more advanced
> approach would scan blocks for pairs of callee-saves, but that requires a
> rewrite of all the callee-save code which is too late at this stage.
> 
> An example epilog in a shrinkwrapped function before:
> 
> ldpx21, x22, [sp,#16]
> ldrx23, [sp,#32]
> ldrx24, [sp,#40]
> ldpx25, x26, [sp,#48]
> ldrx27, [sp,#64]
> ldrx28, [sp,#72]
> ldrx30, [sp,#80]
> ldrd8, [sp,#88]
> ldpx19, x20, [sp],#96
> ret
> 
> And after this patch:
> 
> ldrd8, [sp,#88]
> ldpx21, x22, [sp,#16]
> ldpx23, x24, [sp,#32]
> ldpx25, x26, [sp,#48]
> ldpx27, x28, [sp,#64]
> ldrx30, [sp,#80]
> ldpx19, x20, [sp],#96
> ret
> 
> Passes bootstrap, OK for commit (and backport to GCC7)?

OK.

Thanks,
James

> 
> ChangeLog:
> 2018-01-05  Wilco Dijkstra  
> 
>   * config/aarch64/aarch64.c (aarch64_components_for_bb):
>   Increase LDP/STP opportunities by adding adjacent callee-saves.


Re: [PATCH][AArch64] Use LDP/STP in shrinkwrapping

2018-02-03 Thread Richard Sandiford
Segher Boessenkool  writes:
> On Thu, Jan 11, 2018 at 03:35:37PM +, Wilco Dijkstra wrote:
>> Segher Boessenkool wrote:
>>   
>> > Of course I see that ldp is useful.  I don't think that this particular
>> > way of forcing more pairs is a good idea.  Needs testing / benchmarking /
>> > instrumentation, and we haven't seen any of that.
>> 
>> I wouldn't propose a patch if it caused slowdowns. In fact I am seeing
>> speedups,
>> particularly benchmarks which benefit from shrinkwrapping (eg. povray). Int 
>> is
>> flat, and there is an overall gain on fp.
>> 
>> > Forcing pairs before separate shrink-wrapping reduces the effectiveness
>> > of the latter by a lot.
>> 
>> That doesn't appear to be the case - or at least any reduction in
>> effectiveness is
>> more than mitigated by the lower number of memory accesses and I-cache 
>> misses.
>> To do better still you'd need to compute an optimal set of pairs, and
>> that is quite
>> difficult in the current infrastructure. I could dynamically create
>> pairs just in the backend
>> but that won't be optimal either.
>
> Right, I certainly believe forming more pairs before sws (as you do)
> improves the code -- but I think forming the pairs only after sws will
> work even better.
>
> But yes that is more work to implement, and the benefit (if any) is
> unknown.  I hoped I could convince you to try ;-)

Where we do stand with this patch?  I think it's legitimately a
regression fix; like Wilco, I'm seeing benchmarks in which epilogue
restores are no longer paired for important functions.

Even if things could be improved elsewhere, the patch seems to make
sense on its own and is also nicely localised.

Thanks,
Richard


Re: [PATCH][AArch64] Use LDP/STP in shrinkwrapping

2018-01-11 Thread Segher Boessenkool
On Thu, Jan 11, 2018 at 03:35:37PM +, Wilco Dijkstra wrote:
> Segher Boessenkool wrote:
>   
> > Of course I see that ldp is useful.  I don't think that this particular
> > way of forcing more pairs is a good idea.  Needs testing / benchmarking /
> > instrumentation, and we haven't seen any of that.
> 
> I wouldn't propose a patch if it caused slowdowns. In fact I am seeing 
> speedups,
> particularly benchmarks which benefit from shrinkwrapping (eg. povray). Int is
> flat, and there is an overall gain on fp.
> 
> > Forcing pairs before separate shrink-wrapping reduces the effectiveness
> > of the latter by a lot.
> 
> That doesn't appear to be the case - or at least any reduction in 
> effectiveness is
> more than mitigated by the lower number of memory accesses and I-cache misses.
> To do better still you'd need to compute an optimal set of pairs, and that is 
> quite
> difficult in the current infrastructure. I could dynamically create pairs 
> just in the backend
> but that won't be optimal either.

Right, I certainly believe forming more pairs before sws (as you do)
improves the code -- but I think forming the pairs only after sws will
work even better.

But yes that is more work to implement, and the benefit (if any) is
unknown.  I hoped I could convince you to try ;-)


Segher


Re: [PATCH][AArch64] Use LDP/STP in shrinkwrapping

2018-01-11 Thread Wilco Dijkstra
Segher Boessenkool wrote:
  
> Of course I see that ldp is useful.  I don't think that this particular
> way of forcing more pairs is a good idea.  Needs testing / benchmarking /
> instrumentation, and we haven't seen any of that.

I wouldn't propose a patch if it caused slowdowns. In fact I am seeing speedups,
particularly benchmarks which benefit from shrinkwrapping (eg. povray). Int is
flat, and there is an overall gain on fp.

> Forcing pairs before separate shrink-wrapping reduces the effectiveness
> of the latter by a lot.

That doesn't appear to be the case - or at least any reduction in effectiveness 
is
more than mitigated by the lower number of memory accesses and I-cache misses.
To do better still you'd need to compute an optimal set of pairs, and that is 
quite
difficult in the current infrastructure. I could dynamically create pairs just 
in the backend
but that won't be optimal either.

Wilco



Re: [PATCH][AArch64] Use LDP/STP in shrinkwrapping

2018-01-10 Thread Segher Boessenkool
On Tue, Jan 09, 2018 at 09:13:23PM -0800, Andrew Pinski wrote:
> On Tue, Jan 9, 2018 at 6:54 AM, Segher Boessenkool
>  wrote:
> > On Tue, Jan 09, 2018 at 12:23:42PM +, Wilco Dijkstra wrote:
> >> Segher Boessenkool wrote:
> >> > On Mon, Jan 08, 2018 at 0:25:47PM +, Wilco Dijkstra wrote:
> >> >> > Always pairing two registers together *also* degrades code quality.
> >> >>
> >> >> No, while it's not optimal, it means smaller code and fewer memory 
> >> >> accesses.
> >> >
> >> > It means you execute *more* memory accesses.  Always.  This may be
> >> > sometimes hidden, sure.  I'm not saying you do not want more ldp's;
> >> > I'm saying this particular strategy is very far from ideal.
> >>
> >> No it means less since the number of memory accesses reduces (memory
> >> bandwidth may increase but that's not an issue).
> >
> > The problem is *more* memory accesses are executed at runtime.  Which is
> > why separate shrink-wrapping does what it does: to have *fewer* executed.
> > (It's not just the direct execution cost why that helps: more important
> > are latencies to dependent ops, microarchitectural traps, etc.).
> 
> On most micro-arch of AARCH64, having one LDP/STP will take just as
> long as one LDR/STR as long as it is on the same cache line.
> So having one LDP/STP compared to two LDR?STR is much better.  LDP/STP
> is considered one memory access really and that is where the confusion
> is coming from.  We are reducing the overall number of memory accesses
> or keeping it the same on that path.
> Hope this explanation allows you to understand why pairing does not
> degrade the code quality but improves it overall.

Of course I see that ldp is useful.  I don't think that this particular
way of forcing more pairs is a good idea.  Needs testing / benchmarking /
instrumentation, and we haven't seen any of that.

Forcing pairs before separate shrink-wrapping reduces the effectiveness
of the latter by a lot.


Segher


Re: [PATCH][AArch64] Use LDP/STP in shrinkwrapping

2018-01-09 Thread Andrew Pinski
On Tue, Jan 9, 2018 at 6:54 AM, Segher Boessenkool
 wrote:
> On Tue, Jan 09, 2018 at 12:23:42PM +, Wilco Dijkstra wrote:
>> Segher Boessenkool wrote:
>> > On Mon, Jan 08, 2018 at 0:25:47PM +, Wilco Dijkstra wrote:
>> >> > Always pairing two registers together *also* degrades code quality.
>> >>
>> >> No, while it's not optimal, it means smaller code and fewer memory 
>> >> accesses.
>> >
>> > It means you execute *more* memory accesses.  Always.  This may be
>> > sometimes hidden, sure.  I'm not saying you do not want more ldp's;
>> > I'm saying this particular strategy is very far from ideal.
>>
>> No it means less since the number of memory accesses reduces (memory
>> bandwidth may increase but that's not an issue).
>
> The problem is *more* memory accesses are executed at runtime.  Which is
> why separate shrink-wrapping does what it does: to have *fewer* executed.
> (It's not just the direct execution cost why that helps: more important
> are latencies to dependent ops, microarchitectural traps, etc.).

On most micro-arch of AARCH64, having one LDP/STP will take just as
long as one LDR/STR as long as it is on the same cache line.
So having one LDP/STP compared to two LDR?STR is much better.  LDP/STP
is considered one memory access really and that is where the confusion
is coming from.  We are reducing the overall number of memory accesses
or keeping it the same on that path.
Hope this explanation allows you to understand why pairing does not
degrade the code quality but improves it overall.

Thanks,
Andrew

>
> If you make A always stored whenever B is, and the other way around, the
> optimal place to do it will always store at least as often as either A
> or B, _but can also store more often than either_.
>
>> >> That may well be the problem. So if there are N predecessors, of which N-1
>> >> need to restore the same set of callee saves, but one was shrinkwrapped,
>> >> N-1 copies of the same restores might be emitted. N could be the number
>> >> of blocks in a function - I really hope it doesn't work out like that...
>> >
>> > In the worst case it would.  OTOH, joining every combo into blocks costs
>> > O(2**C) (where C is the # components) bb's worst case.
>> >
>> > It isn't a simple problem.  The current tuning works pretty well for us,
>> > but no doubt it can be improved!
>>
>> Well if there are C components, we could limit the total number of 
>> saves/restores
>> inserted to say 4C. Similarly common cases could easily share the restores
>> without increasing the number of branches.
>
> It is common to see many saves/restores generated for the exceptional cases.
>
>
> Segher


Re: [PATCH][AArch64] Use LDP/STP in shrinkwrapping

2018-01-09 Thread Segher Boessenkool
On Tue, Jan 09, 2018 at 12:23:42PM +, Wilco Dijkstra wrote:
> Segher Boessenkool wrote:
> > On Mon, Jan 08, 2018 at 0:25:47PM +, Wilco Dijkstra wrote:
> >> > Always pairing two registers together *also* degrades code quality.
> >> 
> >> No, while it's not optimal, it means smaller code and fewer memory 
> >> accesses.
> >
> > It means you execute *more* memory accesses.  Always.  This may be
> > sometimes hidden, sure.  I'm not saying you do not want more ldp's;
> > I'm saying this particular strategy is very far from ideal.
> 
> No it means less since the number of memory accesses reduces (memory
> bandwidth may increase but that's not an issue).

The problem is *more* memory accesses are executed at runtime.  Which is
why separate shrink-wrapping does what it does: to have *fewer* executed.
(It's not just the direct execution cost why that helps: more important
are latencies to dependent ops, microarchitectural traps, etc.).

If you make A always stored whenever B is, and the other way around, the
optimal place to do it will always store at least as often as either A
or B, _but can also store more often than either_.

> >> That may well be the problem. So if there are N predecessors, of which N-1
> >> need to restore the same set of callee saves, but one was shrinkwrapped,
> >> N-1 copies of the same restores might be emitted. N could be the number
> >> of blocks in a function - I really hope it doesn't work out like that...
> >
> > In the worst case it would.  OTOH, joining every combo into blocks costs
> > O(2**C) (where C is the # components) bb's worst case.
> >
> > It isn't a simple problem.  The current tuning works pretty well for us,
> > but no doubt it can be improved!
> 
> Well if there are C components, we could limit the total number of 
> saves/restores
> inserted to say 4C. Similarly common cases could easily share the restores
> without increasing the number of branches.

It is common to see many saves/restores generated for the exceptional cases.


Segher


Re: [PATCH][AArch64] Use LDP/STP in shrinkwrapping

2018-01-09 Thread Wilco Dijkstra
Segher Boessenkool wrote:
> On Mon, Jan 08, 2018 at 0:25:47PM +, Wilco Dijkstra wrote:
>> > Always pairing two registers together *also* degrades code quality.
>> 
>> No, while it's not optimal, it means smaller code and fewer memory accesses.
>
> It means you execute *more* memory accesses.  Always.  This may be
> sometimes hidden, sure.  I'm not saying you do not want more ldp's;
> I'm saying this particular strategy is very far from ideal.

No it means less since the number of memory accesses reduces (memory
bandwidth may increase but that's not an issue). You can argue that different
pairings might work better as you may be able to push some LDP/STPs into
less frequently executed blocks, but finding optimal pairings is a hard problem
since the offsets need to be consecutive.

>> That may well be the problem. So if there are N predecessors, of which N-1
>> need to restore the same set of callee saves, but one was shrinkwrapped,
>> N-1 copies of the same restores might be emitted. N could be the number
>> of blocks in a function - I really hope it doesn't work out like that...
>
> In the worst case it would.  OTOH, joining every combo into blocks costs
> O(2**C) (where C is the # components) bb's worst case.
>
> It isn't a simple problem.  The current tuning works pretty well for us,
> but no doubt it can be improved!

Well if there are C components, we could limit the total number of 
saves/restores
inserted to say 4C. Similarly common cases could easily share the restores
without increasing the number of branches.

Wilco


Re: [PATCH][AArch64] Use LDP/STP in shrinkwrapping

2018-01-08 Thread Segher Boessenkool
On Mon, Jan 08, 2018 at 08:25:47PM +, Wilco Dijkstra wrote:
> > Always pairing two registers together *also* degrades code quality.
> 
> No, while it's not optimal, it means smaller code and fewer memory accesses.

It means you execute *more* memory accesses.  Always.  This may be
sometimes hidden, sure.  I'm not saying you do not want more ldp's;
I'm saying this particular strategy is very far from ideal.

> >> Another issue is that after pro_and_epilogue pass I see multiple restores
> >> of the same registers and then a branch to the same block. We should try
> >> to avoid the unnecessary duplication.
> >
> > It already does that if *all* predecessors of that block do that.  If you
> > want to do it in other cases, you end up with more jumps.  That may be
> > beneficial in some cases, of course, but it is not an obvious win (and in
> > the general case it is, hrm let's use nice words, "terrible").
> 
> That may well be the problem. So if there are N predecessors, of which N-1
> need to restore the same set of callee saves, but one was shrinkwrapped,
> N-1 copies of the same restores might be emitted. N could be the number
> of blocks in a function - I really hope it doesn't work out like that...

In the worst case it would.  OTOH, joining every combo into blocks costs
O(2**C) (where C is the # components) bb's worst case.

It isn't a simple problem.  The current tuning works pretty well for us,
but no doubt it can be improved!


Segher


Re: [PATCH][AArch64] Use LDP/STP in shrinkwrapping

2018-01-08 Thread Wilco Dijkstra
Segher Boessenkool wrote:
> On Mon, Jan 08, 2018 at 01:27:24PM +, Wilco Dijkstra wrote:
>
>> Peepholing is very conservative about instructions using SP and won't touch
>> anything frame related. If this was working better then the backend could 
>> just
>> emit single loads/stores and let peepholing generate LDP/STP.
>
> How unfortunate; that should definitely be improved then.

Improving that helps indeed but won't fix the issue. The epilog may not
always be duplicated and merged like in my example. Any subset of saves
and restores may not be pairable, so the worst case still has twice as many
memory accesses.

> Always pairing two registers together *also* degrades code quality.

No, while it's not optimal, it means smaller code and fewer memory accesses.

>> Another issue is that after pro_and_epilogue pass I see multiple restores
>> of the same registers and then a branch to the same block. We should try
>> to avoid the unnecessary duplication.
>
> It already does that if *all* predecessors of that block do that.  If you
> want to do it in other cases, you end up with more jumps.  That may be
> beneficial in some cases, of course, but it is not an obvious win (and in
> the general case it is, hrm let's use nice words, "terrible").

That may well be the problem. So if there are N predecessors, of which N-1
need to restore the same set of callee saves, but one was shrinkwrapped,
N-1 copies of the same restores might be emitted. N could be the number
of blocks in a function - I really hope it doesn't work out like that...

Wilco

Re: [PATCH][AArch64] Use LDP/STP in shrinkwrapping

2018-01-08 Thread Segher Boessenkool
On Mon, Jan 08, 2018 at 01:27:24PM +, Wilco Dijkstra wrote:
> Segher Boessenkool wrote:
> > On Fri, Jan 05, 2018 at 12:22:44PM +, Wilco Dijkstra wrote:
> >> An example epilog in a shrinkwrapped function before:
> >> 
> >> ldp    x21, x22, [sp,#16]
> >> ldr    x23, [sp,#32]
> >> ldr    x24, [sp,#40]
> >> ldp    x25, x26, [sp,#48]
> >> ldr    x27, [sp,#64]
> >> ldr    x28, [sp,#72]
> >> ldr    x30, [sp,#80]
> >> ldr    d8, [sp,#88]
> >> ldp    x19, x20, [sp],#96
> >> ret
> >
> > In this example, the compiler already can make a ldp for both x23/x24 and
> > x27/x28 just fine (if not in emit_epilogue_components, then simply in a
> > peephole); why did that not work?  Or is this not the actual generated
> > machine code (and there are labels between the insns, for example)?
> 
> This block originally had a label in it, 2 blocks emitted identical restores 
> and
> then branched to the final epilog. The final epilogue was then duplicated so
> we end up with 2 almost identical epilogs of 10 instructions (almost since
> there were 1-2 unrelated instructions in both blocks).
> 
> Peepholing is very conservative about instructions using SP and won't touch
> anything frame related. If this was working better then the backend could just
> emit single loads/stores and let peepholing generate LDP/STP.

How unfortunate; that should definitely be improved then.

Always pairing two registers together *also* degrades code quality.

> Another issue is that after pro_and_epilogue pass I see multiple restores
> of the same registers and then a branch to the same block. We should try
> to avoid the unnecessary duplication.

It already does that if *all* predecessors of that block do that.  If you
want to do it in other cases, you end up with more jumps.  That may be
beneficial in some cases, of course, but it is not an obvious win (and in
the general case it is, hrm let's use nice words, "terrible").


Segher


Re: [PATCH][AArch64] Use LDP/STP in shrinkwrapping

2018-01-08 Thread Wilco Dijkstra
Segher Boessenkool wrote:
> On Fri, Jan 05, 2018 at 12:22:44PM +, Wilco Dijkstra wrote:
>> An example epilog in a shrinkwrapped function before:
>> 
>> ldp    x21, x22, [sp,#16]
>> ldr    x23, [sp,#32]
>> ldr    x24, [sp,#40]
>> ldp    x25, x26, [sp,#48]
>> ldr    x27, [sp,#64]
>> ldr    x28, [sp,#72]
>> ldr    x30, [sp,#80]
>> ldr    d8, [sp,#88]
>> ldp    x19, x20, [sp],#96
>> ret
>
> In this example, the compiler already can make a ldp for both x23/x24 and
> x27/x28 just fine (if not in emit_epilogue_components, then simply in a
> peephole); why did that not work?  Or is this not the actual generated
> machine code (and there are labels between the insns, for example)?

This block originally had a label in it, 2 blocks emitted identical restores and
then branched to the final epilog. The final epilogue was then duplicated so
we end up with 2 almost identical epilogs of 10 instructions (almost since
there were 1-2 unrelated instructions in both blocks).

Peepholing is very conservative about instructions using SP and won't touch
anything frame related. If this was working better then the backend could just
emit single loads/stores and let peepholing generate LDP/STP.

However this is not the real issue. In the worst case the current code may
only emit LDR and STR. If there are multiple callee-saves in a block, we
want to use LDP/STP, and if there is an odd number of registers, we want
to add a callee-save from an inner block.

Another issue is that after pro_and_epilogue pass I see multiple restores
of the same registers and then a branch to the same block. We should try
to avoid the unnecessary duplication.

Wilco

Re: [PATCH][AArch64] Use LDP/STP in shrinkwrapping

2018-01-07 Thread Segher Boessenkool
Hi Wilco,

On Fri, Jan 05, 2018 at 12:22:44PM +, Wilco Dijkstra wrote:
> An example epilog in a shrinkwrapped function before:
> 
> ldpx21, x22, [sp,#16]
> ldrx23, [sp,#32]
> ldrx24, [sp,#40]
> ldpx25, x26, [sp,#48]
> ldrx27, [sp,#64]
> ldrx28, [sp,#72]
> ldrx30, [sp,#80]
> ldrd8, [sp,#88]
> ldpx19, x20, [sp],#96
> ret

In this example, the compiler already can make a ldp for both x23/x24 and
x27/x28 just fine (if not in emit_epilogue_components, then simply in a
peephole); why did that not work?  Or is this not the actual generated
machine code (and there are labels between the insns, for example)?


Segher


[PATCH][AArch64] Use LDP/STP in shrinkwrapping

2018-01-05 Thread Wilco Dijkstra
The shrinkwrap optimization added late in GCC 7 allows each callee-save to
be delayed and done only across blocks which need a particular callee-save.
Although this reduces unnecessary memory traffic on code paths that need
few callee-saves, it typically uses LDR/STR rather than LDP/STP.  The
number of LDP/STP instructions is reduced by ~7%. This means more memory
accesses and increased codesize, ~1.0% on average.

To improve this, if a particular callee-save must be saved/restored, also
add the adjacent callee-save to allow use of LDP/STP.  This significantly
reduces codesize (for example gcc_r, povray_r, parest_r, xalancbmk_r are
1% smaller).  This is a simple fix which can be backported.  A more advanced
approach would scan blocks for pairs of callee-saves, but that requires a
rewrite of all the callee-save code which is too late at this stage.

An example epilog in a shrinkwrapped function before:

ldpx21, x22, [sp,#16]
ldrx23, [sp,#32]
ldrx24, [sp,#40]
ldpx25, x26, [sp,#48]
ldrx27, [sp,#64]
ldrx28, [sp,#72]
ldrx30, [sp,#80]
ldrd8, [sp,#88]
ldpx19, x20, [sp],#96
ret

And after this patch:

ldrd8, [sp,#88]
ldpx21, x22, [sp,#16]
ldpx23, x24, [sp,#32]
ldpx25, x26, [sp,#48]
ldpx27, x28, [sp,#64]
ldrx30, [sp,#80]
ldpx19, x20, [sp],#96
ret

Passes bootstrap, OK for commit (and backport to GCC7)?

ChangeLog:
2018-01-05  Wilco Dijkstra  

* config/aarch64/aarch64.c (aarch64_components_for_bb):
Increase LDP/STP opportunities by adding adjacent callee-saves.
--

diff --git a/gcc/config/aarch64/aarch64.c b/gcc/config/aarch64/aarch64.c
index 
9735fc18402dd8fe2fa4022eef4c0522814a0552..da21032b19413d0361b8d30b51a31124eaaa31a1
 100644
--- a/gcc/config/aarch64/aarch64.c
+++ b/gcc/config/aarch64/aarch64.c
@@ -3503,7 +3503,22 @@ aarch64_components_for_bb (basic_block bb)
&& (bitmap_bit_p (in, regno)
   || bitmap_bit_p (gen, regno)
   || bitmap_bit_p (kill, regno)))
- bitmap_set_bit (components, regno);
+  {
+   unsigned regno2, offset, offset2;
+   bitmap_set_bit (components, regno);
+
+   /* If there is a callee-save at an adjacent offset, add it too
+  to increase the use of LDP/STP.  */
+   offset = cfun->machine->frame.reg_offset[regno];
+   regno2 = ((offset & 8) == 0) ? regno + 1 : regno - 1;
+
+   if (regno2 <= LAST_SAVED_REGNUM)
+ {
+   offset2 = cfun->machine->frame.reg_offset[regno2];
+   if ((offset & ~8) == (offset2 & ~8))
+ bitmap_set_bit (components, regno2);
+ }
+  }
 
   return components;
 }