Re: [PATCH, AArch64] atomics: prefetch the destination for write prior to ldxr/stxr loops

2016-05-27 Thread James Greenhalgh
On Tue, Mar 15, 2016 at 03:31:30PM +, James Greenhalgh wrote:
> On Mon, Mar 07, 2016 at 10:54:25PM -0800, Andrew Pinski wrote:
> > On Mon, Mar 7, 2016 at 8:12 PM, Yangfei (Felix)  
> > wrote:
> > >> On Mon, Mar 7, 2016 at 7:27 PM, Yangfei (Felix)  
> > >> wrote:
> > >> > Hi,
> > >> >
> > >> > As discussed in LKML:
> > >> http://lists.infradead.org/pipermail/linux-arm-kernel/2015-July/355996.html,
> > >>  the
> > >> cost of changing a cache line
> > >> > from shared to exclusive state can be significant on aarch64 cores,
> > >> especially when this is triggered by an exclusive store, since it may
> > >> > result in having to retry the transaction.
> > >> > This patch makes use of the "prfm PSTL1STRM" instruction to 
> > >> > prefetch
> > >> cache lines for write prior to ldxr/stxr loops generated by the ll/sc 
> > >> atomic
> > >> routines.
> > >> > Bootstrapped on AArch64 server, is it OK?
> > >>
> > >>
> > >> I don't think this is a good thing in general.  For an example on 
> > >> ThunderX, the
> > >> prefetch just adds a cycle for no benefit.  This really depends on the
> > >> micro-architecture of the core and how LDXR/STXR are
> > >> implemented.   So after this patch, it will slow down ThunderX.
> > >>
> > >> Thanks,
> > >> Andrew Pinski
> > >>
> > >
> > > Hi Andrew,
> > >
> > >I am not quite clear about the ThunderX micro-arch.  But, Yes, I agree
> > >it depends on the micro-architecture of the core.  As the mentioned
> > >kernel patch is merged upstream, I think the added prefetch instruction
> > >in atomic routines is good for most of AArch64 cores in the market.  If
> > >it does nothing good for ThunderX, then how about adding some checking
> > >here?  I mean disabling the the generation of the prfm if we are tuning
> > >for ThunderX.
> > 
> > No it is not just not do any good, it actually causes worse
> > performance for ThunderX.  How about only doing it for the
> > micro-architecture where it helps and also not do it for generic since
> > it hurts ThunderX so much.
> 
> This should be a GCC 7 patch at this point, which should give us some time
> to talk through whether we want this patch or not.
> 
> How bad is this for ThunderX - upthread you said one cycle penalty, but here
> you suggest it hurts ThunderX more? Note that the prefetch is outside of
> the LDXR/STXR loop.

Hi Andrew,

Did you have any further thoughts on the magnitude of the penalty you
would face on ThunderX? 

Felix, I think if this is going to be expensive for some AArch64 platforms,
then the best way to progress this patch would be to add a new flag to
tune_flags. Something like AARCH64_EXTRA_TUNE_PREFETCH_LDREX. This would
allow targets which want the explicit prefetch to enable it.

Thanks,
James



Re: [PATCH, AArch64] atomics: prefetch the destination for write prior to ldxr/stxr loops

2016-03-15 Thread James Greenhalgh
On Mon, Mar 07, 2016 at 10:54:25PM -0800, Andrew Pinski wrote:
> On Mon, Mar 7, 2016 at 8:12 PM, Yangfei (Felix)  wrote:
> >> On Mon, Mar 7, 2016 at 7:27 PM, Yangfei (Felix)  
> >> wrote:
> >> > Hi,
> >> >
> >> > As discussed in LKML:
> >> http://lists.infradead.org/pipermail/linux-arm-kernel/2015-July/355996.html,
> >>  the
> >> cost of changing a cache line
> >> > from shared to exclusive state can be significant on aarch64 cores,
> >> especially when this is triggered by an exclusive store, since it may
> >> > result in having to retry the transaction.
> >> > This patch makes use of the "prfm PSTL1STRM" instruction to prefetch
> >> cache lines for write prior to ldxr/stxr loops generated by the ll/sc 
> >> atomic
> >> routines.
> >> > Bootstrapped on AArch64 server, is it OK?
> >>
> >>
> >> I don't think this is a good thing in general.  For an example on 
> >> ThunderX, the
> >> prefetch just adds a cycle for no benefit.  This really depends on the
> >> micro-architecture of the core and how LDXR/STXR are
> >> implemented.   So after this patch, it will slow down ThunderX.
> >>
> >> Thanks,
> >> Andrew Pinski
> >>
> >
> > Hi Andrew,
> >
> >I am not quite clear about the ThunderX micro-arch.  But, Yes, I agree
> >it depends on the micro-architecture of the core.  As the mentioned
> >kernel patch is merged upstream, I think the added prefetch instruction
> >in atomic routines is good for most of AArch64 cores in the market.  If
> >it does nothing good for ThunderX, then how about adding some checking
> >here?  I mean disabling the the generation of the prfm if we are tuning
> >for ThunderX.
> 
> No it is not just not do any good, it actually causes worse
> performance for ThunderX.  How about only doing it for the
> micro-architecture where it helps and also not do it for generic since
> it hurts ThunderX so much.

This should be a GCC 7 patch at this point, which should give us some time
to talk through whether we want this patch or not.

How bad is this for ThunderX - upthread you said one cycle penalty, but here
you suggest it hurts ThunderX more? Note that the prefetch is outside of
the LDXR/STXR loop.

Thanks,
James



Re: [PATCH, AArch64] atomics: prefetch the destination for write prior to ldxr/stxr loops

2016-03-07 Thread Andrew Pinski
On Mon, Mar 7, 2016 at 8:12 PM, Yangfei (Felix)  wrote:
>> On Mon, Mar 7, 2016 at 7:27 PM, Yangfei (Felix)  
>> wrote:
>> > Hi,
>> >
>> > As discussed in LKML:
>> http://lists.infradead.org/pipermail/linux-arm-kernel/2015-July/355996.html, 
>> the
>> cost of changing a cache line
>> > from shared to exclusive state can be significant on aarch64 cores,
>> especially when this is triggered by an exclusive store, since it may
>> > result in having to retry the transaction.
>> > This patch makes use of the "prfm PSTL1STRM" instruction to prefetch
>> cache lines for write prior to ldxr/stxr loops generated by the ll/sc atomic
>> routines.
>> > Bootstrapped on AArch64 server, is it OK?
>>
>>
>> I don't think this is a good thing in general.  For an example on ThunderX, 
>> the
>> prefetch just adds a cycle for no benefit.  This really depends on the
>> micro-architecture of the core and how LDXR/STXR are
>> implemented.   So after this patch, it will slow down ThunderX.
>>
>> Thanks,
>> Andrew Pinski
>>
>
> Hi Andrew,
>
>I am not quite clear about the ThunderX micro-arch.  But, Yes, I agree it 
> depends on the micro-architecture of the core.
>As the mentioned kernel patch is merged upstream, I think the added 
> prefetch instruction in atomic routines is good for most of AArch64 cores in 
> the market.
>If it does nothing good for ThunderX, then how about adding some checking 
> here?  I mean disabling the the generation of the prfm if we are tuning for 
> ThunderX.

No it is not just not do any good, it actually causes worse
performance for ThunderX.  How about only doing it for the
micro-architecture where it helps and also not do it for generic since
it hurts ThunderX so much.

Thanks,
Andrew

>
> Thanks,
> Felix


Re: [PATCH, AArch64] atomics: prefetch the destination for write prior to ldxr/stxr loops

2016-03-07 Thread Yangfei (Felix)
> On Mon, Mar 7, 2016 at 7:27 PM, Yangfei (Felix)  wrote:
> > Hi,
> >
> > As discussed in LKML:
> http://lists.infradead.org/pipermail/linux-arm-kernel/2015-July/355996.html, 
> the
> cost of changing a cache line
> > from shared to exclusive state can be significant on aarch64 cores,
> especially when this is triggered by an exclusive store, since it may
> > result in having to retry the transaction.
> > This patch makes use of the "prfm PSTL1STRM" instruction to prefetch
> cache lines for write prior to ldxr/stxr loops generated by the ll/sc atomic
> routines.
> > Bootstrapped on AArch64 server, is it OK?
> 
> 
> I don't think this is a good thing in general.  For an example on ThunderX, 
> the
> prefetch just adds a cycle for no benefit.  This really depends on the
> micro-architecture of the core and how LDXR/STXR are
> implemented.   So after this patch, it will slow down ThunderX.
> 
> Thanks,
> Andrew Pinski
> 

Hi Andrew,

   I am not quite clear about the ThunderX micro-arch.  But, Yes, I agree it 
depends on the micro-architecture of the core.  
   As the mentioned kernel patch is merged upstream, I think the added prefetch 
instruction in atomic routines is good for most of AArch64 cores in the market. 
 
   If it does nothing good for ThunderX, then how about adding some checking 
here?  I mean disabling the the generation of the prfm if we are tuning for 
ThunderX.  
   
Thanks,
Felix


Re: [PATCH, AArch64] atomics: prefetch the destination for write prior to ldxr/stxr loops

2016-03-07 Thread Andrew Pinski
On Mon, Mar 7, 2016 at 7:27 PM, Yangfei (Felix)  wrote:
> Hi,
>
> As discussed in LKML: 
> http://lists.infradead.org/pipermail/linux-arm-kernel/2015-July/355996.html, 
> the cost of changing a cache line
> from shared to exclusive state can be significant on aarch64 cores, 
> especially when this is triggered by an exclusive store, since it may
> result in having to retry the transaction.
> This patch makes use of the "prfm PSTL1STRM" instruction to prefetch 
> cache lines for write prior to ldxr/stxr loops generated by the ll/sc atomic 
> routines.
> Bootstrapped on AArch64 server, is it OK?


I don't think this is a good thing in general.  For an example on
ThunderX, the prefetch just adds a cycle for no benefit.  This really
depends on the micro-architecture of the core and how LDXR/STXR are
implemented.   So after this patch, it will slow down ThunderX.

Thanks,
Andrew Pinski

>
> Thanks,
> Felix


[PATCH, AArch64] atomics: prefetch the destination for write prior to ldxr/stxr loops

2016-03-07 Thread Yangfei (Felix)
Hi,

As discussed in LKML: 
http://lists.infradead.org/pipermail/linux-arm-kernel/2015-July/355996.html, 
the cost of changing a cache line
from shared to exclusive state can be significant on aarch64 cores, 
especially when this is triggered by an exclusive store, since it may
result in having to retry the transaction. 
This patch makes use of the "prfm PSTL1STRM" instruction to prefetch cache 
lines for write prior to ldxr/stxr loops generated by the ll/sc atomic 
routines. 
Bootstrapped on AArch64 server, is it OK? 

Thanks,
Felix


aarch64-atomics-v0.diff
Description: aarch64-atomics-v0.diff