----- Original Message ----- > From: "Paul E. McKenney" <paul...@linux.vnet.ibm.com> > To: "Mathieu Desnoyers" <mathieu.desnoy...@efficios.com> > Cc: "Simon Marchi" <simon.mar...@polymtl.ca>, lttng-dev@lists.lttng.org > Sent: Friday, December 6, 2013 10:40:45 PM > Subject: Re: [lttng-dev] Xeon Phi memory barriers > > On Fri, Dec 06, 2013 at 08:15:38PM +0000, Mathieu Desnoyers wrote: > > ----- Original Message ----- > > > From: "Simon Marchi" <simon.mar...@polymtl.ca> > > > To: lttng-dev@lists.lttng.org > > > Sent: Tuesday, November 19, 2013 4:26:06 PM > > > Subject: [lttng-dev] Xeon Phi memory barriers > > > > > > Hello there, > > > > Hi Simon, > > > > While reading this reply, please keep in mind that I'm in a > > mindset where I've been in a full week of meeting, and it's late on > > Friday evening here. So YMMV ;-) I'm CCing Paul E. McKenney, so he can > > debunk my answer :) > > > > > > > > liburcu does not build on the Intel Xeon Phi, because the chip is > > > recognized as x86_64, but lacks the {s,l,m}fence instructions found on > > > usual x86_64 processors. The following is taken from the Xeon Phi dev > > > guide: > > > > Let's have a look: > > > > > > > > The Intel® Xeon PhiTM coprocessor memory model is the same as that of > > > the Intel® Pentium processor. The reads and writes always appear in > > > programmed order at the system bus (or the ring interconnect in the > > > case of the Intel® Xeon PhiTM coprocessor); the exception being that > > > read misses are permitted to go ahead of buffered writes on the system > > > bus when all the buffered writes are cached hits and are, therefore, > > > not directed to the same address being accessed by the read miss. > > > > OK, so reads can be reordered with respect to following writes. > > That would be -preceding- writes, correct?
Oh, yes, I got it reversed. > > > > As a consequence of its stricter memory ordering model, the Intel® > > > Xeon PhiTM coprocessor does not support the SFENCE, LFENCE, and MFENCE > > > instructions that provide a more efficient way of controlling memory > > > ordering on other Intel processors. > > > > I guess sfence and lfence are indeed completely useless, because we only > > can ever care about ordering reads vs writes (mfence). But even the mfence > > is not there. > > The usual approach is an atomic operation to a dummy location on the > stack. Is that the recommendation for Xeon Phi? Yes, see below, > > Either way, what should userspace RCU do to detect that it is being built > on a Xeon Phi? I am sure that Mathieu would welcome the relevant patches > for this. ;-) > > > > While reads and writes from an Intel® Xeon PhiTM coprocessor appear in > > > program order on the system bus, > > > > This part of the sentence seems misleading to me. Didn't the first > > sentence state the opposite ? "the exception being that > > read misses are permitted to go ahead of buffered writes on the system > > bus when all the buffered writes are cached hits and are, therefore, > > not directed to the same address being accessed by the read miss." > > > > I'm probably missing something. > > The trick might be that read misses are only allowed to pass write > -hits-, which would mean that the system bus would have already seen > the invalidate corresponding to the delayed write, and thus would > have no evidence of any misorderingr > > > > the compiler can still reorder > > > unrelated memory operations while maintaining program order on a > > > single Intel® Xeon PhiTM coprocessor (hardware thread). If software > > > running on an Intel® Xeon PhiTM coprocessor is dependent on the order > > > of memory operations on another Intel® Xeon PhiTM coprocessor then a > > > serializing instruction (e.g., CPUID, instruction with a LOCK prefix) > > > between the memory operations is required to guarantee completion of > > > all memory accesses issued prior to the serializing instruction before > > > any subsequent memory operations are started. > > OK, sounds like my guess of atomic instruction to dummy stack location > is correct, or perhaps carrying out a nearby assignment using an > xchg instruction. Yes, or CPUID instruction seems OK too. We already use lock; addl on stack in URCU for cases where fence instructions may not be available (x86-32). > > > > (end of quote) > > > > > > From what I understand, it is safe to leave out any run-time memory > > > barriers, but we still need barriers that prevent the compiler from > > > reordering (using __asm__ __volatile__ ("":::"memory")). In > > > urcu/arch/x86.h, I see that when CONFIG_RCU_HAVE_FENCE is false, > > > memory barriers result in both compile-time and run-time memory > > > barriers: __asm__ __volatile__ ("lock; addl $0,0(%%esp)":::"memory"). > > > I guess this would work for the Phi, but the lock instruction does not > > > seem necessary. > > > > Actually, either a cpuid (core serializing) instruction or lock-prefixed > > instruction (serializing as a side-effect memory accesses) seems required. > > It would certainly be safe. One approach would be to keep it that way > unless/until someone showed it to be unnecessary. > > > > So, should we just set CONFIG_RCU_HAVE_FENCE to false when compiling > > > for the Phi and go on with our lives, or should we add a specific > > > config for this case? > > > > I _think_ we could get away with this mapping: > > > > smp_wmb() -> barrier() > > reasoning: write vs write are not reordered by the processor. > > > > smp_rmb() -> barrier() > > reasoning: read vs read not reordered by processor. > > > > smp_mb() -> __asm__ __volatile__ ("lock; addl $0,0(%%esp)":::"memory") > > or a cpuid instruction > > reasoning: cpu can reorder reads vs later writes. > > > > smp_read_barrier_depends() -> nothing at all (not needed at any level). > > This should be safe, though I would argue for do { } while (0) for > smp_read_barrier_depends(). Indeed. > > > Interestingly enough, AFAIU, this seems to map to x86-TSO. Maybe that > > instead > > of defining a compiling option specifically for Xeon Phi, we could instead > > define a x86-tso.h header variant in userspace RCU and use it in all Intel > > processors that map to TSO (hint: very vast majority). The only exceptions > > seems to be Pentium Pro (needing smp_rmb() -> lfence) and some Windchip > > processors which could reorder stores (thus needing smp_wmb() -> sfence). > > > > Thoughts ? > > As long as there is some reasonable way of detecting them. The issue here is that I don't see any easy way to detect PPro and Windchip. AFAIU it needs to be done dynamically (e.g. by reading /proc/cpuinfo), and this would require code patching. We unfortunately don't have the infrastructure for this yet. > > Actually, why not use the locked add of zero for all x86 systems for > smp_mb()? I suspect that perhaps on NUMA x86-64 systems, using locked add might have more severe performance impact than mfence. Also, AFAIU, when compiling for Xeon Phi, the compiler is targeting a specific sub-architecture "x$host_vendor" == "xk1om" So we might not need to find the minimum common denominator between x86-64 (generic) and Xeon Phi at all, since it has its own instruction set. Thoughts ? Thanks, Mathieu > > Thanx, Paul > > > Thanks, > > > > Mathieu > > > > > > > > Simon > > > > > > _______________________________________________ > > > lttng-dev mailing list > > > lttng-dev@lists.lttng.org > > > http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev > > > > > > > -- > > Mathieu Desnoyers > > EfficiOS Inc. > > http://www.efficios.com > > > > -- Mathieu Desnoyers EfficiOS Inc. http://www.efficios.com _______________________________________________ lttng-dev mailing list lttng-dev@lists.lttng.org http://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev