On 12 November 2015 at 11:03, Savolainen, Petri (Nokia - FI/Espoo) < petri.savolai...@nokia.com> wrote: >> > if (msg == UPDATE_FOO) { >> > // Make sure that load of “foo” is not moved before schedule() >> > odp_sync_loads(); >> >> odp_sync_acquire() would be a better name. >> >> > >> > foo++; >> > // Make sure that store of “foo” is not moved after queue_enq() >> > odp_sync_stores(); >> >> odp_sync_release() would be a better name. >> >> I am somewhat OK with odp_sync_acquire/odp_sync_release (should also >> take vector of memory segments per suggestion below) as long as it is >> required that producer and consumer use the same ODP calls. > > Naming could be acq/rel instead of loads/stores, if those are considered as well established terms. The problem with acq/rel names are that this call does not do the acq or rel operation. Also the synchronization functionality is a bit different: > > "While an atomic store-release operation prevents all preceding writes from moving past the *store-release*, an atomic_thread_fence with memory_order_release ordering prevents *all preceding writes from moving past all subsequent stores*". > > Thread fence synchronizes between *all stores* before and *all stores* after the fence - thus the name odp_sync_stores() fits better. > > Store-release synchronizes stores before the operation with the operation, but not with *all other* stores following the operation. Yes release ordering is associated with a specific store-release instruction which is not explicit when using atomic_thread_fence. The reason why I still thought odp_sync_release is a better name is that it also ought to prevent loads from before the sync operation to move down past the fence. If loads are not ordered, you could be reading data written by another CPU after the store fence. I don't know how likely this is in practice, this depends on actual CPU implementations.
The GCC inline asm uses the memory clobber that will prevent the compiler from moving loads but it does not prevent the HW from completing the load after the store fence which could happen if you load something (perhaps for later use) before the fence and miss in the cache). The definition ought to be clear on that loads before the fence are also ordered (have completed) before all stores after the fence. And the implementation(s) need to implement this behaviour. ARM DMB without ST option orders both loads and stores. And as I wrote, if synchronisation between producer and consumer is not done using (coherent) shared memory but through some other mechanism (e.g. device write which generates interrupt), DSB is needed instead of DMB. > > > >> >> > msg = FOO_UPDATED; >> > queue_enq(msg); >> > } >> > >> > ev = >> schedule(); >> > msg >> = event_to_message(ev); >> > >> > if >> (msg == FOO_UPDATED) { >> > >> // Make sure that load of “foo” is not moved before schedule() >> > >> odp_sync_loads(); >> > >> printf(“foo is now %\n”, foo); >> > } >> > >> > >> > As you can see it would a lot of waste if application must always >> sync for stores and loads explicitly. >> > I see your point. >> > I re-read carefully the documentation and maybe we can work around >> it. >> > The sync_stores explicitly states that all store operations are >> globally visible. So cache needs to be flushed. >> >> If the Kalray HW isn't cache coherent between CPU's, neither old-style >> (volatile + barriers) or C11-style shared memory programming will >> work. Probably some platform specific code is necessary in order to >> share data between CPU's and to properly synchronise the access. >> Perhaps the acquire and release operations should take a vector of >> addresses and sizes, specifying which data is being shared so to >> minimise cache flushing (in the producer) and invalidation (in the >> consumer). > > Vector / non-coherent versions would another set of functions. These two should be just regular memory fences for coherent memory systems. > > >> >> > >> > The load part only states that previous memory operations are >> complete, and cannot be reordered around it. >> > It does not explicitly states that the cache needs to be up-to-date. >> We internally use "read memory barrier" internally as DINVAL + FENCE >> but it is not necessary the case. >> > >> > If my understanding of the function fits with yours, I'm good with >> this patches. If not, we need to discuss things some more. >> > >> > Nicolas >> > >> > >> > >> > The above highlights how it should *not* be used (== not needed with >> ready-made ODP sync mechanisms). It’s intended to be used with non-ODP >> sync mechanisms like the one under. Store and load syncs are usually a >> pair (sync_stores -> sync_loads). >> >> If either or both of the producer and consumer cannot use some common >> ODP mechanism, the mechanism they actually use must be well defined >> (e.g. must use DMB or DSB on ARM) so that both sides pair properly. So >> for one side to use e.g. odp_sync_stores() (without knowing the actual >> implementation) and the other side use whatever mechanism (not >> odp_sync_loads) is a potential cause of problems. It is difficult for >> me to see a generic implementation (in ODP) that can always pair with >> whatever undefined mechanism is used by the peer. I think the code in >> an ODP application that synchronises with a non-ODP producer or >> consumer must use the specific mechanism agreed on and ODP cannot >> provide any generic function that will work regardless. > > It should be enough that we specify this to function the same way as C11 atomic_thread_fence. There's the spec the other side needs to understand what to do. This is what my patch defines in practice. The description at http://en.cppreference.com/w/cpp/atomic/atomic_thread_fence is not authoritative, it is just a simplified description. The C++11 standard (chapter 29.8 Fences, chapter 7.17.4 in the C11 standard, the text is the same) is almost impossible to understand. *"A release fence A synchronizes with an acquire fence B if there exist atomic operations X and Y, both operating on some atomic object M, such that A is sequenced before X, X modifies M, Y is sequenced before B, and Y reads the value written by X or a value written by any side effect in the hypothetical release sequence X would head if it were a release operation."* But in our potential use cases, we don't know if there is some atomic object M that is operated on using atomic operations (X and Y). So depending on the atomic_thread_fence() mechanism from C11/C++11 is a little suspicious. This guy I trust: http://preshing.com/20130922/acquire-and-release-fences/ *A release fence prevents the memory reordering of any read or write which precedes it in program order with any write which follows it in program order.* When programming in C++11, you can invoke them using the following functions: #include <atomic> std::atomic_thread_fence(std::memory_order_acquire); std::atomic_thread_fence(std::memory_order_release); Our functions might better be called odp_release_fence() and odp_acquire_fence(). Acquire and release fences are different from acquire and release operations (those are associated with a specific load or store operation) as you also have noted. The implementation(s) probably should use some suitable architecture-specific mechanism and not depend on mechanisms defined by C11/C++11, I don't think the GCC __atomic_thread_fence builtin differs in semantics here (being modelled after C++11). > > Did you see actual text in the patch? It's clipped out it this thread. > > >> >> > >> > >> > -Petri >> > >> > >> > // data and flag in shared memory >> > volatile int flag = 0; >> >> int flag = 0; is enough >> >> > >> > int foo = 0; >> > >> > Thread A Thread B >> > >> > foo = 0xbeef; >> > >> > // Make sure that store >> > // of "foo" is visible before >> > // store of "flag" >> > odp_sync_stores(); >> > flag = 1; >> >> Must be implemented as: >> __atomic_store_n(&flag, 1, __ATOMIC_RELEASE); >> All loads and stores before writing to the flag that releases the >> shared data must be observable before the store to flag is observable. >> > > This example illustrates any non-ODP synchronization method (which may be more complex than single store). Sure, when under ODP the simple release store should use atomic_store_rel(), but this use case is for non-ODP provided synchronization (non-ODP algorithm, 3rd partly lib, etc). This sync may be heavier than store_release, but that's what you need to do if cannot use store_release ... > > > -Petri > > >> >> > >> > while (flag == 0) >> >> while (__atomic_load_n(&flag, __ATOMIC_ACQUIRE) == 0) >> No load or store after the load of the flag must be observable by >> other threads before the flag has been "acquired". Now speculative >> stores are not so common but loads could be reordered so reading the >> (stale) shared data before the flag has been set. >> >> > spin_and_wait(); >> > >> > // Make sure that load of >> > // "foo" does not happen before >> > // we see the flag changing >> > odp_sync_loads(); >> > >> > printf(“foo is now %\n”, foo); >> > >> > >> > >> > >> > >> > >> > _______________________________________________ >> > lng-odp mailing list >> > lng-odp@lists.linaro.org >> > https://lists.linaro.org/mailman/listinfo/lng-odp
_______________________________________________ lng-odp mailing list lng-odp@lists.linaro.org https://lists.linaro.org/mailman/listinfo/lng-odp