Re: [PATCH, AArch64] atomics: prefetch the destination for write prior to ldxr/stxr loops
On Tue, Mar 15, 2016 at 03:31:30PM +, James Greenhalgh wrote: > On Mon, Mar 07, 2016 at 10:54:25PM -0800, Andrew Pinski wrote: > > On Mon, Mar 7, 2016 at 8:12 PM, Yangfei (Felix)> > wrote: > > >> On Mon, Mar 7, 2016 at 7:27 PM, Yangfei (Felix) > > >> wrote: > > >> > Hi, > > >> > > > >> > As discussed in LKML: > > >> http://lists.infradead.org/pipermail/linux-arm-kernel/2015-July/355996.html, > > >> the > > >> cost of changing a cache line > > >> > from shared to exclusive state can be significant on aarch64 cores, > > >> especially when this is triggered by an exclusive store, since it may > > >> > result in having to retry the transaction. > > >> > This patch makes use of the "prfm PSTL1STRM" instruction to > > >> > prefetch > > >> cache lines for write prior to ldxr/stxr loops generated by the ll/sc > > >> atomic > > >> routines. > > >> > Bootstrapped on AArch64 server, is it OK? > > >> > > >> > > >> I don't think this is a good thing in general. For an example on > > >> ThunderX, the > > >> prefetch just adds a cycle for no benefit. This really depends on the > > >> micro-architecture of the core and how LDXR/STXR are > > >> implemented. So after this patch, it will slow down ThunderX. > > >> > > >> Thanks, > > >> Andrew Pinski > > >> > > > > > > Hi Andrew, > > > > > >I am not quite clear about the ThunderX micro-arch. But, Yes, I agree > > >it depends on the micro-architecture of the core. As the mentioned > > >kernel patch is merged upstream, I think the added prefetch instruction > > >in atomic routines is good for most of AArch64 cores in the market. If > > >it does nothing good for ThunderX, then how about adding some checking > > >here? I mean disabling the the generation of the prfm if we are tuning > > >for ThunderX. > > > > No it is not just not do any good, it actually causes worse > > performance for ThunderX. How about only doing it for the > > micro-architecture where it helps and also not do it for generic since > > it hurts ThunderX so much. > > This should be a GCC 7 patch at this point, which should give us some time > to talk through whether we want this patch or not. > > How bad is this for ThunderX - upthread you said one cycle penalty, but here > you suggest it hurts ThunderX more? Note that the prefetch is outside of > the LDXR/STXR loop. Hi Andrew, Did you have any further thoughts on the magnitude of the penalty you would face on ThunderX? Felix, I think if this is going to be expensive for some AArch64 platforms, then the best way to progress this patch would be to add a new flag to tune_flags. Something like AARCH64_EXTRA_TUNE_PREFETCH_LDREX. This would allow targets which want the explicit prefetch to enable it. Thanks, James
Re: [PATCH, AArch64] atomics: prefetch the destination for write prior to ldxr/stxr loops
On Mon, Mar 07, 2016 at 10:54:25PM -0800, Andrew Pinski wrote: > On Mon, Mar 7, 2016 at 8:12 PM, Yangfei (Felix)wrote: > >> On Mon, Mar 7, 2016 at 7:27 PM, Yangfei (Felix) > >> wrote: > >> > Hi, > >> > > >> > As discussed in LKML: > >> http://lists.infradead.org/pipermail/linux-arm-kernel/2015-July/355996.html, > >> the > >> cost of changing a cache line > >> > from shared to exclusive state can be significant on aarch64 cores, > >> especially when this is triggered by an exclusive store, since it may > >> > result in having to retry the transaction. > >> > This patch makes use of the "prfm PSTL1STRM" instruction to prefetch > >> cache lines for write prior to ldxr/stxr loops generated by the ll/sc > >> atomic > >> routines. > >> > Bootstrapped on AArch64 server, is it OK? > >> > >> > >> I don't think this is a good thing in general. For an example on > >> ThunderX, the > >> prefetch just adds a cycle for no benefit. This really depends on the > >> micro-architecture of the core and how LDXR/STXR are > >> implemented. So after this patch, it will slow down ThunderX. > >> > >> Thanks, > >> Andrew Pinski > >> > > > > Hi Andrew, > > > >I am not quite clear about the ThunderX micro-arch. But, Yes, I agree > >it depends on the micro-architecture of the core. As the mentioned > >kernel patch is merged upstream, I think the added prefetch instruction > >in atomic routines is good for most of AArch64 cores in the market. If > >it does nothing good for ThunderX, then how about adding some checking > >here? I mean disabling the the generation of the prfm if we are tuning > >for ThunderX. > > No it is not just not do any good, it actually causes worse > performance for ThunderX. How about only doing it for the > micro-architecture where it helps and also not do it for generic since > it hurts ThunderX so much. This should be a GCC 7 patch at this point, which should give us some time to talk through whether we want this patch or not. How bad is this for ThunderX - upthread you said one cycle penalty, but here you suggest it hurts ThunderX more? Note that the prefetch is outside of the LDXR/STXR loop. Thanks, James
Re: [PATCH, AArch64] atomics: prefetch the destination for write prior to ldxr/stxr loops
On Mon, Mar 7, 2016 at 8:12 PM, Yangfei (Felix)wrote: >> On Mon, Mar 7, 2016 at 7:27 PM, Yangfei (Felix) >> wrote: >> > Hi, >> > >> > As discussed in LKML: >> http://lists.infradead.org/pipermail/linux-arm-kernel/2015-July/355996.html, >> the >> cost of changing a cache line >> > from shared to exclusive state can be significant on aarch64 cores, >> especially when this is triggered by an exclusive store, since it may >> > result in having to retry the transaction. >> > This patch makes use of the "prfm PSTL1STRM" instruction to prefetch >> cache lines for write prior to ldxr/stxr loops generated by the ll/sc atomic >> routines. >> > Bootstrapped on AArch64 server, is it OK? >> >> >> I don't think this is a good thing in general. For an example on ThunderX, >> the >> prefetch just adds a cycle for no benefit. This really depends on the >> micro-architecture of the core and how LDXR/STXR are >> implemented. So after this patch, it will slow down ThunderX. >> >> Thanks, >> Andrew Pinski >> > > Hi Andrew, > >I am not quite clear about the ThunderX micro-arch. But, Yes, I agree it > depends on the micro-architecture of the core. >As the mentioned kernel patch is merged upstream, I think the added > prefetch instruction in atomic routines is good for most of AArch64 cores in > the market. >If it does nothing good for ThunderX, then how about adding some checking > here? I mean disabling the the generation of the prfm if we are tuning for > ThunderX. No it is not just not do any good, it actually causes worse performance for ThunderX. How about only doing it for the micro-architecture where it helps and also not do it for generic since it hurts ThunderX so much. Thanks, Andrew > > Thanks, > Felix
Re: [PATCH, AArch64] atomics: prefetch the destination for write prior to ldxr/stxr loops
> On Mon, Mar 7, 2016 at 7:27 PM, Yangfei (Felix)wrote: > > Hi, > > > > As discussed in LKML: > http://lists.infradead.org/pipermail/linux-arm-kernel/2015-July/355996.html, > the > cost of changing a cache line > > from shared to exclusive state can be significant on aarch64 cores, > especially when this is triggered by an exclusive store, since it may > > result in having to retry the transaction. > > This patch makes use of the "prfm PSTL1STRM" instruction to prefetch > cache lines for write prior to ldxr/stxr loops generated by the ll/sc atomic > routines. > > Bootstrapped on AArch64 server, is it OK? > > > I don't think this is a good thing in general. For an example on ThunderX, > the > prefetch just adds a cycle for no benefit. This really depends on the > micro-architecture of the core and how LDXR/STXR are > implemented. So after this patch, it will slow down ThunderX. > > Thanks, > Andrew Pinski > Hi Andrew, I am not quite clear about the ThunderX micro-arch. But, Yes, I agree it depends on the micro-architecture of the core. As the mentioned kernel patch is merged upstream, I think the added prefetch instruction in atomic routines is good for most of AArch64 cores in the market. If it does nothing good for ThunderX, then how about adding some checking here? I mean disabling the the generation of the prfm if we are tuning for ThunderX. Thanks, Felix
Re: [PATCH, AArch64] atomics: prefetch the destination for write prior to ldxr/stxr loops
On Mon, Mar 7, 2016 at 7:27 PM, Yangfei (Felix)wrote: > Hi, > > As discussed in LKML: > http://lists.infradead.org/pipermail/linux-arm-kernel/2015-July/355996.html, > the cost of changing a cache line > from shared to exclusive state can be significant on aarch64 cores, > especially when this is triggered by an exclusive store, since it may > result in having to retry the transaction. > This patch makes use of the "prfm PSTL1STRM" instruction to prefetch > cache lines for write prior to ldxr/stxr loops generated by the ll/sc atomic > routines. > Bootstrapped on AArch64 server, is it OK? I don't think this is a good thing in general. For an example on ThunderX, the prefetch just adds a cycle for no benefit. This really depends on the micro-architecture of the core and how LDXR/STXR are implemented. So after this patch, it will slow down ThunderX. Thanks, Andrew Pinski > > Thanks, > Felix
[PATCH, AArch64] atomics: prefetch the destination for write prior to ldxr/stxr loops
Hi, As discussed in LKML: http://lists.infradead.org/pipermail/linux-arm-kernel/2015-July/355996.html, the cost of changing a cache line from shared to exclusive state can be significant on aarch64 cores, especially when this is triggered by an exclusive store, since it may result in having to retry the transaction. This patch makes use of the "prfm PSTL1STRM" instruction to prefetch cache lines for write prior to ldxr/stxr loops generated by the ll/sc atomic routines. Bootstrapped on AArch64 server, is it OK? Thanks, Felix aarch64-atomics-v0.diff Description: aarch64-atomics-v0.diff