I was wrong about the Intel architecture: it has a pretty strong
memory-ordering guarantee. But other architectures, like PowerPC and ARM,
have a weak memory-ordering guarantee. Here is a good summary:
https://en.wikipedia.org/wiki/Memory_ordering. The doc in footnote 3 is a
good document to read.
I'm not sure we should include memory barriers inside the apr_atomic_read32
and apr_atomic_set32 functions, because you don't necessarily need a full
memory barrier on each read or set. Instead, perhaps we should add three
functions to APR:
1. apr_atomic_rmb(): read memory barrier.
2. apr_atomic_wmb(): write memory barrier.
3. apr_atomic_mb(): full memory barrier.
As a use-case, the new util_time.c algorithm could be changed to be:
For the read-from-cache portion:
const apr_uint32_t key = apr_atomic_read32(&cache_element->key);
/* Above is done speculatively, no memory barrier used.
if (seconds == key && seconds != 0) {
/* seconds == 0 may mean cache is uninitialized, so don't use cache
*/
*xt = cache_element->xt;
/* After copying xt, make sure cache_element was not marked invalid
* by another thread beginning an update, and that cache_element
* really contained data for our second.
* Requires memory barrier. */
apr_atomic_rmb();
if (apr_atomic_read32(&cache_element->key)==seconds) {
xt->tm_usec = (int)apr_time_usec(t);
return APR_SUCCESS;
}
}
And the write-to-cache portion would be:
if (key == apr_atomic_cas32(&cache_element->key, ~seconds, key))
{ /* We won the race to update this cache_element.
* Above marks cache_element as invalid by using ~seconds,
* because we are starting an update: it's the start of a
* transaction. */
cache_element->xt = *xt;
/* Finished copying, now update key with our key,
* ending the transaction. Need to use a
* memory barrier.
*/
apr_atomic_wmb();
apr_atomic_set32(&cache_element->key, seconds);
}
On Sat, Jan 5, 2013 at 8:40 AM, Daniel Lescohier
<[email protected]>wrote:
> I'd have to go back and review the Intel documentation to be sure, but for
> this particular algorithm, an explicit memory barrier may be required on
> Intel architecture, also. If I remember correctly, without a memory
> barrier, Intel arch only guarantees total memory ordering within one cache
> line. For this algorithm, we have an array of 16 cache_elements of 48
> bytes each, so half of the cache_elements cross 64-byte cache lines. When
> reading the cache_element->key after the copy of the cache_element value,
> we need to make sure that the cache_element value read is ordered before
> the read of the cache_element->key, so one needs a memory barrier just
> before the read of the cache_element->key to guarantee the ordering.
>
>
> On Sat, Jan 5, 2013 at 5:08 AM, Igor Galić <[email protected]> wrote:
>
>>
>> > > Sigh. I was too much focused on x86. There the compiler barrier
>> > > caused
>> > > by the function call is enough. But you are right, on other
>> > > architectures these functions may also require cpu memory barriers.
>> >
>> > " on x86 … is enough" - would it harm x86 to add CPU barriers, or
>> > do we want to have # define distinctions per arch?
>>
>> ignore me, I just realized it's going to be different
>> calls per arch anyway!
>>
>> --
>> Igor Galić
>>
>> Tel: +43 (0) 664 886 22 883
>> Mail: [email protected]
>> URL: http://brainsware.org/
>> GPG: 6880 4155 74BD FD7C B515 2EA5 4B1D 9E08 A097 C9AE
>>
>>
>