On Mon, Dec 8, 2014 at 8:35 PM, David Holmes <david.hol...@oracle.com> wrote:
>> So (as you say) with TSO you don't have a total order of stores if you >> read your own writes out of your own CPU's write buffer. However, my >> interpretation of "multiple-copy atomic" is that the initial >> publishing thread can choose to use an instruction with sufficiently >> strong memory barrier attached (e.g. LOCK;XXX on x86) to write to >> memory so that the write buffer is flushed and then use plain relaxed >> loads everywhere else to read those memory locations and this explains >> the situation on x86 and sparc where volatile writes are expensive and >> volatile reads are "free" and you get sequential consistency for Java >> volatiles. > > > We don't use lock'd instructions for volatile stores on x86, but the > trailing mfence achieves the "flushing". > > However this still raised some questions for me. Using a mfence on x86 or > equivalent on sparc, is no different to issuing a "DMB SYNC" on ARM, or a > SYNC on PowerPC. They each ensure TSO for volatile stores with global > visibility. So when such fences are used the resulting system should be > multiple-copy atomic - no? (No!**) And there seems to be an equivalence > between being multiple-copy atomic and providing the IRIW property. Yet we > know that on ARM/Power, as per the paper, TSO with global visibility is not ARM/Power don't have TSO. > sufficient to achieve IRIW. So what is it that x86 and sparc have in > addition to TSO that provides for IRIW? We have both been learning.... to think in new ways. I found the second section of Peter Sewell's tutorial 2 From Sequential Consistency to Relaxed Memory Models to be most useful, especially the diagrams. > I pondered this for quite a while before realizing that the mfence on x86 > (or equivalent on sparc) is not in fact playing the same role as the > DMB/SYNC on ARM/PPC. The key property that x86 and sparc have (and we can > ignore the store buffers) is that stores become globally visible - if any > other thread sees a store then all other threads see the same store. Whereas > on ARM/PPC you can imagine a store casually making its way through the > system, gradually becoming visible to more and more threads - unless there > is a DMB/SYNC to force a globally consistent memory view. Hence for IRIW > placing the DMB/SYNC after the store does not suffice because prior to the > DMB/SYNC the store may be visible to an arbitrary subset of threads. > Consequently IRIW requires the DMB/SYNC between the loads - to ensure that > each thread on their second load, must see the value that the other thread > saw on its first load (ref Section 6.1 of the paper). > > ** So using DMB/SYNC does not achieve multiple-copy atomicity, because until > the DMB/SYNC happens different threads can have different views of memory. To me, the most desirable property of x86-style TSO is that barriers are only necessary on stores to achieve sequential consistency - the publisher gets to decide. Volatile reads can then be close to free. > All of which reinforces to me that IRIW is an undesirable property to have > to implement. YMMV. (And I also need to re-examine the PPC64 implementation > to see exactly where they add/remove barriers when IRIW is enabled.) I believe you get a full sync between volatile reads. #define GET_FIELD_VOLATILE(obj, offset, type_name, v) \ oop p = JNIHandles::resolve(obj); \ if (support_IRIW_for_not_multiple_copy_atomic_cpu) { \ OrderAccess::fence(); \ } \ > Cheers, > David > >> http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf