I was wrong about the Intel architecture: it has a pretty strong
memory-ordering guarantee.  But other architectures, like PowerPC and ARM,
have a weak memory-ordering guarantee.  Here is a good summary:
https://en.wikipedia.org/wiki/Memory_ordering.  The doc in footnote 3 is a
good document to read.

I'm not sure we should include memory barriers inside the apr_atomic_read32
and apr_atomic_set32 functions, because you don't necessarily need a full
memory barrier on each read or set.  Instead, perhaps we should add three
functions to APR:

   1. apr_atomic_rmb(): read memory barrier.
   2. apr_atomic_wmb(): write memory barrier.
   3. apr_atomic_mb(): full memory barrier.

As a use-case, the new util_time.c algorithm could be changed to be:

For the read-from-cache portion:

    const apr_uint32_t key = apr_atomic_read32(&cache_element->key);
    /* Above is done speculatively, no memory barrier used.
    if (seconds == key && seconds != 0) {
        /* seconds == 0 may mean cache is uninitialized, so don't use cache
*/
        *xt = cache_element->xt;
        /* After copying xt, make sure cache_element was not marked invalid
         * by another thread beginning an update, and that cache_element
         * really contained data for our second.
         * Requires memory barrier. */
        apr_atomic_rmb();
        if (apr_atomic_read32(&cache_element->key)==seconds) {
            xt->tm_usec = (int)apr_time_usec(t);
            return APR_SUCCESS;
        }
    }

And the write-to-cache portion would be:

        if (key == apr_atomic_cas32(&cache_element->key, ~seconds, key))
        {   /* We won the race to update this cache_element.
             * Above marks cache_element as invalid by using ~seconds,
             * because we are starting an update: it's the start of a
             * transaction. */
            cache_element->xt = *xt;
            /* Finished copying, now update key with our key,
             * ending the transaction. Need to use a
             * memory barrier.
             */
            apr_atomic_wmb();
            apr_atomic_set32(&cache_element->key, seconds);
        }

On Sat, Jan 5, 2013 at 8:40 AM, Daniel Lescohier
<daniel.lescoh...@cbsi.com>wrote:

> I'd have to go back and review the Intel documentation to be sure, but for
> this particular algorithm, an explicit memory barrier may be required on
> Intel architecture, also.  If I remember correctly, without a memory
> barrier, Intel arch only guarantees total memory ordering within one cache
> line.  For this algorithm, we have an array of 16 cache_elements of 48
> bytes each, so half of the cache_elements cross 64-byte cache lines.  When
> reading the cache_element->key after the copy of the cache_element value,
> we need to make sure that the cache_element value read is ordered before
> the read of the cache_element->key, so one needs a memory barrier just
> before the read of the cache_element->key to guarantee the ordering.
>
>
> On Sat, Jan 5, 2013 at 5:08 AM, Igor Galić <i.ga...@brainsware.org> wrote:
>
>>
>> > > Sigh. I was too much focused on x86. There the compiler barrier
>> > > caused
>> > > by the function call is enough. But you are right, on other
>> > > architectures these functions may also require cpu memory barriers.
>> >
>> > " on x86 … is enough" - would it harm x86 to add CPU barriers, or
>> > do we want to have # define distinctions per arch?
>>
>> ignore me, I just realized it's going to be different
>> calls per arch anyway!
>>
>> --
>> Igor Galić
>>
>> Tel: +43 (0) 664 886 22 883
>> Mail: i.ga...@brainsware.org
>> URL: http://brainsware.org/
>> GPG: 6880 4155 74BD FD7C B515  2EA5 4B1D 9E08 A097 C9AE
>>
>>
>

Reply via email to