Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics

David Holmes Mon, 08 Dec 2014 20:37:08 -0800

Hi Martin,

On 9/12/2014 5:17 AM, Martin Buchholz wrote:

On Mon, Dec 8, 2014 at 12:46 AM, David Holmes <davidchol...@aapt.net.au> wrote:

Martin,


The paper you cite is about ARM and Power architectures - why do you think the 
lack of mention of x86/sparc implies those architectures are 
multiple-copy-atomic?


Reading some more in the same paper, I see:

Yes - mea culpa - I should have re-read the paper (I wouldn't have hadto read very far).

"""Returning to the two properties above, in TSO a thread can see its
own writes before they become visible to other
threads (by reading them from its write buffer), but any write becomes
visible to all other threads simultaneously: TSO
is a multiple-copy atomic model, in the terminology of Collier
[Col92]. One can also see the possibility of reading
from the local write buffer as allowing a specific kind of local
reordering. A program that writes one location x then
reads another location y might execute by adding the write to x to the
thread’s buffer, then reading y from memory,
before finally making the write to x visible to other threads by
flushing it from the buffer. In this case the thread reads
the value of y that was in the memory before the new write of x hits memory."""


So I learnt two things from this:

1. The ARM architecture manual definition of "multi-copy atomicity" isnot the same as that in the paper for "multiple-copy atomicity". Thedistinction being that the paper allows for a thread to read from itsown store buffer, as long as all other threads must see the store at thesame time. That is quite an important difference in terms of classifyingsystems.

2. I had thought that the store buffer might be shared - if not acrosscores then at least across different hardware threads on the same core.But it seems that is not the case either (based on the paper and my ownreading of SPARC architecture info - stores attain global visibility atthe L2 cache.)

So given that, yes I agree that sparc and x86 are multiple-copy atomicas defined by the paper.

So (as you say) with TSO you don't have a total order of stores if you
read your own writes out of your own CPU's write buffer.  However, my
interpretation of "multiple-copy atomic" is that the initial
publishing thread can choose to use an instruction with sufficiently
strong memory barrier attached (e.g. LOCK;XXX on x86) to write to
memory so that the write buffer is flushed and then use plain relaxed
loads everywhere else to read those memory locations and this explains
the situation on x86 and sparc where volatile writes are expensive and
volatile reads are "free" and you get sequential consistency for Java
volatiles.

We don't use lock'd instructions for volatile stores on x86, but thetrailing mfence achieves the "flushing".

However this still raised some questions for me. Using a mfence on x86or equivalent on sparc, is no different to issuing a "DMB SYNC" on ARM,or a SYNC on PowerPC. They each ensure TSO for volatile stores withglobal visibility. So when such fences are used the resulting systemshould be multiple-copy atomic - no? (No!**) And there seems to be anequivalence between being multiple-copy atomic and providing the IRIWproperty. Yet we know that on ARM/Power, as per the paper, TSO withglobal visibility is not sufficient to achieve IRIW. So what is it thatx86 and sparc have in addition to TSO that provides for IRIW?

I pondered this for quite a while before realizing that the mfence onx86 (or equivalent on sparc) is not in fact playing the same role as theDMB/SYNC on ARM/PPC. The key property that x86 and sparc have (and wecan ignore the store buffers) is that stores become globally visible -if any other thread sees a store then all other threads see the samestore. Whereas on ARM/PPC you can imagine a store casually making itsway through the system, gradually becoming visible to more and morethreads - unless there is a DMB/SYNC to force a globally consistentmemory view. Hence for IRIW placing the DMB/SYNC after the store doesnot suffice because prior to the DMB/SYNC the store may be visible to anarbitrary subset of threads. Consequently IRIW requires the DMB/SYNCbetween the loads - to ensure that each thread on their second load,must see the value that the other thread saw on its first load (refSection 6.1 of the paper).

** So using DMB/SYNC does not achieve multiple-copy atomicity, becauseuntil the DMB/SYNC happens different threads can have different views ofmemory.

All of which reinforces to me that IRIW is an undesirable property tohave to implement. YMMV. (And I also need to re-examine the PPC64implementation to see exactly where they add/remove barriers when IRIWis enabled.)


Cheers,
David

http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf

Re: [concurrency-interest] RFR: 8065804: JEP 171: Clarifications/corrections for fence intrinsics

Reply via email to