Re: C++0x Memory model and gcc

Andrew MacLeod Mon, 17 May 2010 09:36:41 -0700

Michael Matz wrote:

Hi,
On Mon, 17 May 2010, Andrew MacLeod wrote:
The guarantees you seem to want to establish by the proposed memory model.Possibly I misunderstood.
I'm not 100% sure on the guarantees you want to establish. The proposedmodel seems to merge multiple concepts together, all related tomemory access ordering and atomicity, but with different scope anddifficulty to guarantee.

I think the standard is excessively confusing, and overly academic. Ieven find the term memory model adds to the confusion. Some effort wasclearly involved in defining behaviour for hardware which does not yetexist, but the language is "prepared" for. I was particularly unhappythat they merged the whole synchronization thing to an atomic load orstore, at least originally. I would hazard a guess that it evolved tothis state based on an observation that synchronization is almostinevitably required when an atomic is being accessed. Thats just a guesshowever.


However, there is some fundamental goodness in it once you sort through it.

Lets see if I can paraphrase normal uses and map them to the standard :-)

The normal case would be when you have a system wide lock, and when youacquire the lock, you expect everything which occurred before the lockto be completed.

ie
process1 :    otherglob = 2;  global = 10;   set atomic_lock(1);
process2:   wait (atomic_lock() == 1);    print (global)

you expect 'global' in process 2 to always be 10. You are in effectusing the lock as a ready flag for global.

In order for that to happen in a consistent manner, there is moreinvolved than just waiting for the lock. If process 1 and 2 are runningon different machines, process 1 will have to flush its cache all theway to memory, and process 2 will have to wait for that to complete andvisible before it can proceed with allowing the proper value of globalto be loaded. Otherwise the results will not be as expected.

Thats the synchronization model which maps to the default or'sequentially consistent' C++ model. The cache flushing and whateverelse is required is built into the library routines for performingatomic loads and stores. There is no mechanism to specify that this lockis for the value of 'global', so the standard extends the definition ofthe lock to say it applies to *all* shared memory before the atomic lockvalue is set. so


process3:  wait (atomic_lock() == 1) print (otherglob);

will also work properly. This memory model will always involve someform of synchronization instructions, and potentially waiting on otherhardware to complete. I don't know much about this , but Im toldmachines are starting to provide instructions to accomplish this type ofsynchronization. The obvious conclusion is that once the hardware startsto be able to do this synchronization with a few instructions, theentire library call to set or read an atomic and performsynchronization may be inlinable without having a call of any kind,just straight line instructions. At this point, the optimizer will needto understand that those instructions are barriers.

If you are using an atomic variable simply as an variable, and don'tcare about the synchronization aspects (ie, you just want to always seea valid value for the variable), then that maps to the 'relaxed' mode.There may be some academic babble about certain provisions, but this iseffectively what it boils down to. The relaxed mode is what you use whenyou don't care about all that memory flushing and just want to see thevalues of the atomic itself. So this is the fastest model, but don'tdepend on the values of other shared variables. This is also what youget when you use the basic atomic store and load macros in C.

The sequential mode has the possibility of being VERY slow if you have awidely distributed system. Thats where the third mode comes in, therelease/acquire model. Proper utilization of it can remove many of thewaits present in the sequential model since different processes don'thave to wait for *all* cache flushes, just ones directly related to aspecific atomic variable in a specific other process. The model isprovided to allow code to run more efficiently, but requires a betterunderstanding of the subtleties of multi-processor side effects in thecode you write. I still don't really get it completely, but I'm notimplementing the synchronization parts, so I only need to understandsome of it :-) It is possible to optimize these operations, ie you cando CSE and dead store elimination which can also help the code runfaster. That comes later tho.

The optimization flags I'm currently working on are orthogonal to allthis, even though it uses the term memory-model. When a program iswritten for multi-processing the programmer usually attempts to write itsuch that there are no data races, otherwise there may beinconsistencies during execution. If a program has been developed andis data race free, the flags are meant to guarantee that the resultingcode will also be data race free, regardless of whether optimizations ison or off.Does that make anything clearer? Its true that a bunch of these thingsare all intertwined, and that's one of the reasons it comes across asbeing so complicated.

Its up to the library guys to make whatever process synchronization isrequired to happen, I leave that to them. They say they have a handle onit, we'll see. When they do, then we might get to inline it and do someinteresting things.


Andrew

Re: C++0x Memory model and gcc

Reply via email to