Hi Martin,

On 9/12/2014 5:17 AM, Martin Buchholz wrote:
On Mon, Dec 8, 2014 at 12:46 AM, David Holmes <davidchol...@aapt.net.au> wrote:
Martin,

The paper you cite is about ARM and Power architectures - why do you think the 
lack of mention of x86/sparc implies those architectures are 
multiple-copy-atomic?

Reading some more in the same paper, I see:

Yes - mea culpa - I should have re-read the paper (I wouldn't have had to read very far).

"""Returning to the two properties above, in TSO a thread can see its
own writes before they become visible to other
threads (by reading them from its write buffer), but any write becomes
visible to all other threads simultaneously: TSO
is a multiple-copy atomic model, in the terminology of Collier
[Col92]. One can also see the possibility of reading
from the local write buffer as allowing a specific kind of local
reordering. A program that writes one location x then
reads another location y might execute by adding the write to x to the
thread’s buffer, then reading y from memory,
before finally making the write to x visible to other threads by
flushing it from the buffer. In this case the thread reads
the value of y that was in the memory before the new write of x hits memory."""

So I learnt two things from this:

1. The ARM architecture manual definition of "multi-copy atomicity" is not the same as that in the paper for "multiple-copy atomicity". The distinction being that the paper allows for a thread to read from its own store buffer, as long as all other threads must see the store at the same time. That is quite an important difference in terms of classifying systems.

2. I had thought that the store buffer might be shared - if not across cores then at least across different hardware threads on the same core. But it seems that is not the case either (based on the paper and my own reading of SPARC architecture info - stores attain global visibility at the L2 cache.)

So given that, yes I agree that sparc and x86 are multiple-copy atomic as defined by the paper.

So (as you say) with TSO you don't have a total order of stores if you
read your own writes out of your own CPU's write buffer.  However, my
interpretation of "multiple-copy atomic" is that the initial
publishing thread can choose to use an instruction with sufficiently
strong memory barrier attached (e.g. LOCK;XXX on x86) to write to
memory so that the write buffer is flushed and then use plain relaxed
loads everywhere else to read those memory locations and this explains
the situation on x86 and sparc where volatile writes are expensive and
volatile reads are "free" and you get sequential consistency for Java
volatiles.

We don't use lock'd instructions for volatile stores on x86, but the trailing mfence achieves the "flushing".

However this still raised some questions for me. Using a mfence on x86 or equivalent on sparc, is no different to issuing a "DMB SYNC" on ARM, or a SYNC on PowerPC. They each ensure TSO for volatile stores with global visibility. So when such fences are used the resulting system should be multiple-copy atomic - no? (No!**) And there seems to be an equivalence between being multiple-copy atomic and providing the IRIW property. Yet we know that on ARM/Power, as per the paper, TSO with global visibility is not sufficient to achieve IRIW. So what is it that x86 and sparc have in addition to TSO that provides for IRIW?

I pondered this for quite a while before realizing that the mfence on x86 (or equivalent on sparc) is not in fact playing the same role as the DMB/SYNC on ARM/PPC. The key property that x86 and sparc have (and we can ignore the store buffers) is that stores become globally visible - if any other thread sees a store then all other threads see the same store. Whereas on ARM/PPC you can imagine a store casually making its way through the system, gradually becoming visible to more and more threads - unless there is a DMB/SYNC to force a globally consistent memory view. Hence for IRIW placing the DMB/SYNC after the store does not suffice because prior to the DMB/SYNC the store may be visible to an arbitrary subset of threads. Consequently IRIW requires the DMB/SYNC between the loads - to ensure that each thread on their second load, must see the value that the other thread saw on its first load (ref Section 6.1 of the paper).

** So using DMB/SYNC does not achieve multiple-copy atomicity, because until the DMB/SYNC happens different threads can have different views of memory.

All of which reinforces to me that IRIW is an undesirable property to have to implement. YMMV. (And I also need to re-examine the PPC64 implementation to see exactly where they add/remove barriers when IRIW is enabled.)

Cheers,
David

http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf

Reply via email to