On 10/12/2014 4:15 AM, Martin Buchholz wrote:
On Mon, Dec 8, 2014 at 8:35 PM, David Holmes <david.hol...@oracle.com> wrote:

So (as you say) with TSO you don't have a total order of stores if you
read your own writes out of your own CPU's write buffer.  However, my
interpretation of "multiple-copy atomic" is that the initial
publishing thread can choose to use an instruction with sufficiently
strong memory barrier attached (e.g. LOCK;XXX on x86) to write to
memory so that the write buffer is flushed and then use plain relaxed
loads everywhere else to read those memory locations and this explains
the situation on x86 and sparc where volatile writes are expensive and
volatile reads are "free" and you get sequential consistency for Java
volatiles.


We don't use lock'd instructions for volatile stores on x86, but the
trailing mfence achieves the "flushing".

However this still raised some questions for me. Using a mfence on x86 or
equivalent on sparc, is no different to issuing a "DMB SYNC" on ARM, or a
SYNC on PowerPC. They each ensure TSO for volatile stores with global
visibility. So when such fences are used the resulting system should be
multiple-copy atomic - no? (No!**) And there seems to be an equivalence
between being multiple-copy atomic and providing the IRIW property. Yet we
know that on ARM/Power, as per the paper, TSO with global visibility is not

ARM/Power don't have TSO.

Yes we all know that. Please re-read what I wrote.

sufficient to achieve IRIW. So what is it that x86 and sparc have in
addition to TSO that provides for IRIW?

We have both been learning.... to think in new ways.
  I found the second section of Peter Sewell's tutorial
2 From Sequential Consistency to Relaxed Memory Models
to be most useful, especially the diagrams.

I pondered this for quite a while before realizing that the mfence on x86
(or equivalent on sparc) is not in fact playing the same role as the
DMB/SYNC on ARM/PPC. The key property that x86 and sparc have (and we can
ignore the store buffers) is that stores become globally visible - if any
other thread sees a store then all other threads see the same store. Whereas
on ARM/PPC you can imagine a store casually making its way through the
system, gradually becoming visible to more and more threads - unless there
is a DMB/SYNC to force a globally consistent memory view. Hence for IRIW
placing the DMB/SYNC after the store does not suffice because prior to the
DMB/SYNC the store may be visible to an arbitrary subset of threads.
Consequently IRIW requires the DMB/SYNC between the loads - to ensure that
each thread on their second load, must see the value that the other thread
saw on its first load (ref Section 6.1 of the paper).

** So using DMB/SYNC does not achieve multiple-copy atomicity, because until
the DMB/SYNC happens different threads can have different views of memory.

To me, the most desirable property of x86-style TSO is that barriers
are only necessary on stores to achieve sequential consistency - the
publisher gets to decide.  Volatile reads can then be close to free.

TSO doesn't need store barriers for sequential consistency.

It is somewhat amusing I think that the free-ness of volatile reads on TSO comes from the fact that all writes cause global memory synchronization. But because we can't turn that off we can't actually measure the cost we pay for those synchronizing writes. In contrast on non-TSO we have to explicitly cause synchronizing writes and so potentially require synchronizing reads - and then complain because the "hidden costs" are no longer hidden :)

All of which reinforces to me that IRIW is an undesirable property to have
to implement. YMMV. (And I also need to re-examine the PPC64 implementation
to see exactly where they add/remove barriers when IRIW is enabled.)

I believe you get a full sync between volatile reads.

#define GET_FIELD_VOLATILE(obj, offset, type_name, v) \
   oop p = JNIHandles::resolve(obj); \
   if (support_IRIW_for_not_multiple_copy_atomic_cpu) { \
     OrderAccess::fence(); \
   } \

Yes, it was more the "remove" part that I was unsure of the details - I think they simply remove the trailing fence (ie PPC SYNC) from the volatile writes.

Thanks,
David


Cheers,
David

http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf

Reply via email to