Michael Matz wrote:
Hi,

On Mon, 17 May 2010, Andrew MacLeod wrote:
The guarantees you seem to want to establish by the proposed memory model. Possibly I misunderstood.

I'm not 100% sure on the guarantees you want to establish. The proposed model seems to merge multiple concepts together, all related to memory access ordering and atomicity, but with different scope and difficulty to guarantee.

I think the standard is excessively confusing, and overly academic. I even find the term memory model adds to the confusion. Some effort was clearly involved in defining behaviour for hardware which does not yet exist, but the language is "prepared" for. I was particularly unhappy that they merged the whole synchronization thing to an atomic load or store, at least originally. I would hazard a guess that it evolved to this state based on an observation that synchronization is almost inevitably required when an atomic is being accessed. Thats just a guess however.

However, there is some fundamental goodness in it once you sort through it.

Lets see if I can paraphrase normal uses and map them to the standard :-)

The normal case would be when you have a system wide lock, and when you acquire the lock, you expect everything which occurred before the lock to be completed.
ie
process1 :    otherglob = 2;  global = 10;   set atomic_lock(1);
process2:   wait (atomic_lock() == 1);    print (global)

you expect 'global' in process 2 to always be 10. You are in effect using the lock as a ready flag for global.

In order for that to happen in a consistent manner, there is more involved than just waiting for the lock. If process 1 and 2 are running on different machines, process 1 will have to flush its cache all the way to memory, and process 2 will have to wait for that to complete and visible before it can proceed with allowing the proper value of global to be loaded. Otherwise the results will not be as expected.

Thats the synchronization model which maps to the default or 'sequentially consistent' C++ model. The cache flushing and whatever else is required is built into the library routines for performing atomic loads and stores. There is no mechanism to specify that this lock is for the value of 'global', so the standard extends the definition of the lock to say it applies to *all* shared memory before the atomic lock value is set. so

process3:  wait (atomic_lock() == 1) print (otherglob);

will also work properly. This memory model will always involve some form of synchronization instructions, and potentially waiting on other hardware to complete. I don't know much about this , but Im told machines are starting to provide instructions to accomplish this type of synchronization. The obvious conclusion is that once the hardware starts to be able to do this synchronization with a few instructions, the entire library call to set or read an atomic and perform synchronization may be inlinable without having a call of any kind, just straight line instructions. At this point, the optimizer will need to understand that those instructions are barriers.

If you are using an atomic variable simply as an variable, and don't care about the synchronization aspects (ie, you just want to always see a valid value for the variable), then that maps to the 'relaxed' mode. There may be some academic babble about certain provisions, but this is effectively what it boils down to. The relaxed mode is what you use when you don't care about all that memory flushing and just want to see the values of the atomic itself. So this is the fastest model, but don't depend on the values of other shared variables. This is also what you get when you use the basic atomic store and load macros in C.

The sequential mode has the possibility of being VERY slow if you have a widely distributed system. Thats where the third mode comes in, the release/acquire model. Proper utilization of it can remove many of the waits present in the sequential model since different processes don't have to wait for *all* cache flushes, just ones directly related to a specific atomic variable in a specific other process. The model is provided to allow code to run more efficiently, but requires a better understanding of the subtleties of multi-processor side effects in the code you write. I still don't really get it completely, but I'm not implementing the synchronization parts, so I only need to understand some of it :-) It is possible to optimize these operations, ie you can do CSE and dead store elimination which can also help the code run faster. That comes later tho.

The optimization flags I'm currently working on are orthogonal to all this, even though it uses the term memory-model. When a program is written for multi-processing the programmer usually attempts to write it such that there are no data races, otherwise there may be inconsistencies during execution. If a program has been developed and is data race free, the flags are meant to guarantee that the resulting code will also be data race free, regardless of whether optimizations is on or off. Does that make anything clearer? Its true that a bunch of these things are all intertwined, and that's one of the reasons it comes across as being so complicated.

Its up to the library guys to make whatever process synchronization is required to happen, I leave that to them. They say they have a handle on it, we'll see. When they do, then we might get to inline it and do some interesting things.

Andrew

Reply via email to