Re: [lttng-dev] Xeon Phi memory barriers

Paul E. McKenney Mon, 09 Dec 2013 13:48:36 -0800

On Sat, Dec 07, 2013 at 05:58:54AM +0000, Mathieu Desnoyers wrote:
> ----- Original Message -----
> > From: "Paul E. McKenney" <[email protected]>
> > To: "Mathieu Desnoyers" <[email protected]>
> > Cc: "Simon Marchi" <[email protected]>, [email protected]
> > Sent: Friday, December 6, 2013 10:40:45 PM
> > Subject: Re: [lttng-dev] Xeon Phi memory barriers
> > 
> > On Fri, Dec 06, 2013 at 08:15:38PM +0000, Mathieu Desnoyers wrote:
> > > ----- Original Message -----
> > > > From: "Simon Marchi" <[email protected]>
> > > > To: [email protected]
> > > > Sent: Tuesday, November 19, 2013 4:26:06 PM
> > > > Subject: [lttng-dev] Xeon Phi memory barriers
> > > > 
> > > > Hello there,
> > > 
> > > Hi Simon,
> > > 
> > > While reading this reply, please keep in mind that I'm in a
> > > mindset where I've been in a full week of meeting, and it's late on
> > > Friday evening here. So YMMV ;-) I'm CCing Paul E. McKenney, so he can
> > > debunk my answer :)
> > > 
> > > > 
> > > > liburcu does not build on the Intel Xeon Phi, because the chip is
> > > > recognized as x86_64, but lacks the {s,l,m}fence instructions found on
> > > > usual x86_64 processors. The following is taken from the Xeon Phi dev
> > > > guide:
> > > 
> > > Let's have a look:
> > > 
> > > > 
> > > > The Intel® Xeon PhiTM coprocessor memory model is the same as that of
> > > > the Intel® Pentium processor. The reads and writes always appear in
> > > > programmed order at the system bus (or the ring interconnect in the
> > > > case of the Intel® Xeon PhiTM coprocessor); the exception being that
> > > > read misses are permitted to go ahead of buffered writes on the system
> > > > bus when all the buffered writes are cached hits and are, therefore,
> > > > not directed to the same address being accessed by the read miss.
> > > 
> > > OK, so reads can be reordered with respect to following writes.
> > 
> > That would be -preceding- writes, correct?
> 
> Oh, yes, I got it reversed.
> 
> > 
> > > > As a consequence of its stricter memory ordering model, the Intel®
> > > > Xeon PhiTM coprocessor does not support the SFENCE, LFENCE, and MFENCE
> > > > instructions that provide a more efficient way of controlling memory
> > > > ordering on other Intel processors.
> > > 
> > > I guess sfence and lfence are indeed completely useless, because we only
> > > can ever care about ordering reads vs writes (mfence). But even the mfence
> > > is not there.
> > 
> > The usual approach is an atomic operation to a dummy location on the
> > stack.  Is that the recommendation for Xeon Phi?
> 
> Yes, see below,
> 
> > 
> > Either way, what should userspace RCU do to detect that it is being built
> > on a Xeon Phi?  I am sure that Mathieu would welcome the relevant patches
> > for this.  ;-)
> > 
> > > > While reads and writes from an Intel® Xeon PhiTM coprocessor appear in
> > > > program order on the system bus,
> > > 
> > > This part of the sentence seems misleading to me. Didn't the first
> > > sentence state the opposite ? "the exception being that
> > > read misses are permitted to go ahead of buffered writes on the system
> > > bus when all the buffered writes are cached hits and are, therefore,
> > > not directed to the same address being accessed by the read miss."
> > > 
> > > I'm probably missing something.
> > 
> > The trick might be that read misses are only allowed to pass write
> > -hits-, which would mean that the system bus would have already seen
> > the invalidate corresponding to the delayed write, and thus would
> > have no evidence of any misorderingr
> > 
> > > > the compiler can still reorder
> > > > unrelated memory operations while maintaining program order on a
> > > > single Intel® Xeon PhiTM coprocessor (hardware thread). If software
> > > > running on an Intel® Xeon PhiTM coprocessor is dependent on the order
> > > > of memory operations on another Intel® Xeon PhiTM coprocessor then a
> > > > serializing instruction (e.g., CPUID, instruction with a LOCK prefix)
> > > > between the memory operations is required to guarantee completion of
> > > > all memory accesses issued prior to the serializing instruction before
> > > > any subsequent memory operations are started.
> > 
> > OK, sounds like my guess of atomic instruction to dummy stack location
> > is correct, or perhaps carrying out a nearby assignment using an
> > xchg instruction.
> 
> Yes, or CPUID instruction seems OK too. We already use lock; addl on stack
> in URCU for cases where fence instructions may not be available (x86-32).
> 
> > 
> > > > (end of quote)
> > > > 
> > > > From what I understand, it is safe to leave out any run-time memory
> > > > barriers, but we still need barriers that prevent the compiler from
> > > > reordering (using __asm__ __volatile__ ("":::"memory")). In
> > > > urcu/arch/x86.h, I see that when CONFIG_RCU_HAVE_FENCE is false,
> > > > memory barriers result in both compile-time and run-time memory
> > > > barriers:  __asm__ __volatile__ ("lock; addl $0,0(%%esp)":::"memory").
> > > > I guess this would work for the Phi, but the lock instruction does not
> > > > seem necessary.
> > > 
> > > Actually, either a cpuid (core serializing) instruction or lock-prefixed
> > > instruction (serializing as a side-effect memory accesses) seems required.
> > 
> > It would certainly be safe.  One approach would be to keep it that way
> > unless/until someone showed it to be unnecessary.
> > 
> > > > So, should we just set CONFIG_RCU_HAVE_FENCE to false when compiling
> > > > for the Phi and go on with our lives, or should we add a specific
> > > > config for this case?
> > > 
> > > I _think_ we could get away with this mapping:
> > > 
> > > smp_wmb() -> barrier()
> > >   reasoning: write vs write are not reordered by the processor.
> > > 
> > > smp_rmb() -> barrier()
> > >   reasoning: read vs read not reordered by processor.
> > > 
> > > smp_mb() -> __asm__ __volatile__ ("lock; addl $0,0(%%esp)":::"memory")
> > >    or a cpuid instruction
> > >   reasoning: cpu can reorder reads vs later writes.
> > > 
> > > smp_read_barrier_depends() -> nothing at all (not needed at any level).
> > 
> > This should be safe, though I would argue for do { } while (0) for
> > smp_read_barrier_depends().
> 
> Indeed.
> 
> > 
> > > Interestingly enough, AFAIU, this seems to map to x86-TSO. Maybe that
> > > instead
> > > of defining a compiling option specifically for Xeon Phi, we could instead
> > > define a x86-tso.h header variant in userspace RCU and use it in all Intel
> > > processors that map to TSO (hint: very vast majority). The only exceptions
> > > seems to be Pentium Pro (needing smp_rmb() -> lfence) and some Windchip
> > > processors which could reorder stores (thus needing smp_wmb() -> sfence).
> > > 
> > > Thoughts ?
> > 
> > As long as there is some reasonable way of detecting them.
> 
> The issue here is that I don't see any easy way to detect PPro and Windchip. 
> AFAIU
> it needs to be done dynamically (e.g. by reading /proc/cpuinfo), and this 
> would
> require code patching. We unfortunately don't have the infrastructure for 
> this yet.
> 
> > 
> > Actually, why not use the locked add of zero for all x86 systems for
> > smp_mb()?
> 
> I suspect that perhaps on NUMA x86-64 systems, using locked add might have
> more severe performance impact than mfence. Also, AFAIU, when compiling for
> Xeon Phi, the compiler is targeting a specific sub-architecture
> 
>   "x$host_vendor" == "xk1om"
> 
> So we might not need to find the minimum common denominator between x86-64
> (generic) and Xeon Phi at all, since it has its own instruction set.


The Linux kernel does a locked add for 32 bit and mfence for 64 bit.
Xeon Phi appears to be a bit strange in that it does 64 bit, but not
mfence.

I guess I don't have an objection to a separate Xeon Phi build target,
though.

                                                        Thanx, Paul

> Thoughts ?
> 
> Thanks,
> 
> Mathieu
> 
> > 
> >                                                     Thanx, Paul
> > 
> > > Thanks,
> > > 
> > > Mathieu
> > > 
> > > > 
> > > > Simon
> > > > 
> > > > _______________________________________________
> > > > lttng-dev mailing list
> > > > [email protected]
> > > > http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
> > > > 
> > > 
> > > --
> > > Mathieu Desnoyers
> > > EfficiOS Inc.
> > > http://www.efficios.com
> > > 
> > 
> > 
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com
> 


_______________________________________________
lttng-dev mailing list
[email protected]
http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev

Re: [lttng-dev] Xeon Phi memory barriers

Reply via email to