Re: [lng-odp] [API-NEXT PATCH] api: sync: update spec and add odp_sync_loads

Ola Liljedahl Thu, 26 Nov 2015 07:20:06 -0800

On 12 November 2015 at 11:03, Savolainen, Petri (Nokia - FI/Espoo) <
petri.savolai...@nokia.com> wrote:
>> > if (msg == UPDATE_FOO) {
>> >     // Make sure that load of “foo” is not moved before schedule()
>> >     odp_sync_loads();
>>
>> odp_sync_acquire() would be a better name.
>>
>> >
>> >     foo++;
>> >     // Make sure that store of “foo” is not moved after queue_enq()
>> >    odp_sync_stores();
>>
>> odp_sync_release() would be a better name.
>>
>> I am somewhat OK with odp_sync_acquire/odp_sync_release (should also
>> take vector of memory segments per suggestion below) as long as it is
>> required that producer and consumer use the same ODP calls.
>
> Naming could be acq/rel instead of loads/stores, if those are considered
as well established terms. The problem with acq/rel names are that this
call does not do the acq or rel operation. Also the synchronization
functionality is a bit different:
>
> "While an atomic store-release operation prevents all preceding writes
from moving past the *store-release*, an atomic_thread_fence with
memory_order_release ordering prevents *all preceding writes from moving
past all subsequent stores*".
>
> Thread fence synchronizes between *all stores* before and *all stores*
after the fence - thus the name odp_sync_stores() fits better.
>
> Store-release synchronizes stores before the operation with the
operation, but not with *all other* stores following the operation.
Yes release ordering is associated with a specific store-release
instruction which is not explicit when using atomic_thread_fence. The
reason why I still thought odp_sync_release is a better name is that it
also ought to prevent loads from before the sync operation to move down
past the fence. If loads are not ordered, you could be reading data written
by another CPU after the store fence. I don't know how likely this is in
practice, this depends on actual CPU implementations.


The GCC inline asm uses the memory clobber that will prevent the compiler
from moving loads but it does not prevent the HW from completing the load
after the store fence which could happen if you load something (perhaps for
later use) before the fence and miss in the cache).

The definition ought to be clear on that loads before the fence are also
ordered (have completed) before all stores after the fence. And the
implementation(s) need to implement this behaviour. ARM DMB without ST
option orders both loads and stores. And as I wrote, if synchronisation
between producer and consumer is not done using (coherent) shared memory
but through some other mechanism (e.g. device write which generates
interrupt), DSB is needed instead of DMB.

>
>
>
>>
>> >     msg = FOO_UPDATED;
>> >     queue_enq(msg);
>> > }
>> >
>> >                                                                  ev =
>> schedule();
>> >                                                                  msg
>> = event_to_message(ev);
>> >
>> >                                                                  if
>> (msg == FOO_UPDATED) {
>> >
>> // Make sure that load of “foo” is not moved before schedule()
>> >
>> odp_sync_loads();
>> >
>> printf(“foo is now %\n”, foo);
>> >                                                                  }
>> >
>> >
>> > As you can see it would a lot of waste if application must always
>> sync for stores and loads explicitly.
>> > I see your point.
>> > I re-read carefully the documentation and maybe we can work around
>> it.
>> > The sync_stores explicitly states that all store operations are
>> globally visible. So cache needs to be flushed.
>>
>> If the Kalray HW isn't cache coherent between CPU's, neither old-style
>> (volatile + barriers) or C11-style shared memory programming will
>> work. Probably some platform specific code is necessary in order to
>> share data between CPU's and to properly synchronise the access.
>> Perhaps the acquire and release operations should take a vector of
>> addresses and sizes, specifying which data is being shared so to
>> minimise cache flushing (in the producer) and invalidation (in the
>> consumer).
>
> Vector / non-coherent versions would another set of functions. These two
should be just regular memory fences for coherent memory systems.
>
>
>>
>> >
>> > The load part only states that previous memory operations are
>> complete, and cannot be reordered around it.
>> > It does not explicitly states that the cache needs to be up-to-date.
>> We internally use "read memory barrier" internally as DINVAL + FENCE
>> but it is not necessary the case.
>> >
>> > If my understanding of the function fits with yours, I'm good with
>> this patches. If not, we need to discuss things some more.
>> >
>> > Nicolas
>> >
>> >
>> >
>> > The above highlights how it should *not* be used (== not needed with
>> ready-made ODP sync mechanisms). It’s intended to be used with non-ODP
>> sync mechanisms like the one under. Store and load syncs are usually a
>> pair (sync_stores -> sync_loads).
>>
>> If either or both of the producer and consumer cannot use some common
>> ODP mechanism, the mechanism they actually use must be well defined
>> (e.g. must use DMB or DSB on ARM) so that both sides pair properly. So
>> for one side to use e.g. odp_sync_stores() (without knowing the actual
>> implementation) and the other side use whatever mechanism (not
>> odp_sync_loads) is a potential cause of problems. It is difficult for
>> me to see a generic implementation (in ODP) that can always pair with
>> whatever undefined mechanism is used by the peer. I think the code in
>> an ODP application that synchronises with a non-ODP producer or
>> consumer must use the specific mechanism agreed on and ODP cannot
>> provide any generic function that will work regardless.
>
> It should be enough that we specify this to function the same way as C11
atomic_thread_fence. There's the spec the other side needs to understand
what to do. This is what my patch defines in practice.
The description at
http://en.cppreference.com/w/cpp/atomic/atomic_thread_fence is not
authoritative, it is just a simplified description. The C++11 standard
(chapter 29.8 Fences, chapter 7.17.4 in the C11 standard, the text is the
same) is almost impossible to understand.

*"A release fence A synchronizes with an acquire fence B if there exist
atomic operations X and Y, both operating on some atomic object M, such
that A is sequenced before X, X modifies M, Y is sequenced before B, and Y
reads the value written by X or a value written by any side effect in the
hypothetical release sequence X would head if it were a release operation."*

But in our potential use cases, we don't know if there is some atomic
object M that is operated on using atomic operations (X and Y). So
depending on the atomic_thread_fence() mechanism from C11/C++11 is a little
suspicious.

This guy I trust: http://preshing.com/20130922/acquire-and-release-fences/
*A release fence prevents the memory reordering of any read or write which
precedes it in program order with any write which follows it in program
order.*

When programming in C++11, you can invoke them using the following
functions:

#include <atomic>
std::atomic_thread_fence(std::memory_order_acquire);
std::atomic_thread_fence(std::memory_order_release);

Our functions might better be called odp_release_fence() and
odp_acquire_fence(). Acquire and release fences are different from acquire
and release operations (those are associated with a specific load or store
operation) as you also have noted. The implementation(s) probably should
use some suitable architecture-specific mechanism and not depend on
mechanisms defined by C11/C++11, I don't think the
GCC __atomic_thread_fence builtin differs in semantics here (being modelled
after C++11).

>
> Did you see actual text in the patch? It's clipped out it this thread.
>
>
>>
>> >
>> >
>> > -Petri
>> >
>> >
>> > // data and flag in shared memory
>> > volatile int flag = 0;
>>
>> int flag = 0; is enough
>>
>> >
>> > int foo = 0;
>> >
>> > Thread A                         Thread B
>> >
>> > foo = 0xbeef;
>> >
>> > // Make sure that store
>> > // of "foo" is visible before
>> > // store of "flag"
>> > odp_sync_stores();
>> > flag = 1;
>>
>> Must be implemented as:
>> __atomic_store_n(&flag, 1, __ATOMIC_RELEASE);
>> All loads and stores before writing to the flag that releases the
>> shared data must be observable before the store to flag is observable.
>>
>
> This example illustrates any non-ODP synchronization method (which may be
more complex than single  store). Sure, when under ODP the simple release
store should use atomic_store_rel(), but this use case is for non-ODP
provided synchronization (non-ODP algorithm, 3rd partly lib, etc). This
sync may be heavier than store_release, but that's what you need to do if
cannot use store_release ...
>
>
> -Petri
>
>
>>
>> >
>> >                                  while (flag == 0)
>>
>> while (__atomic_load_n(&flag, __ATOMIC_ACQUIRE) == 0)
>> No load or store after the load of the flag must be observable by
>> other threads before the flag has been "acquired". Now speculative
>> stores are not so common but loads could be reordered so reading the
>> (stale) shared data before the flag has been set.
>>
>> >                                      spin_and_wait();
>> >
>> >                                  // Make sure that load of
>> >                                  // "foo" does not happen before
>> >                                  // we see the flag changing
>> >                                  odp_sync_loads();
>> >
>> >                                  printf(“foo is now %\n”, foo);
>> >
>> >
>> >
>> >
>> >
>> >
>> > _______________________________________________
>> > lng-odp mailing list
>> > lng-odp@lists.linaro.org
>> > https://lists.linaro.org/mailman/listinfo/lng-odp

_______________________________________________
lng-odp mailing list
lng-odp@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/lng-odp

Re: [lng-odp] [API-NEXT PATCH] api: sync: update spec and add odp_sync_loads

Reply via email to