I'd have to go back and review the Intel documentation to be sure, but for this particular algorithm, an explicit memory barrier may be required on Intel architecture, also. If I remember correctly, without a memory barrier, Intel arch only guarantees total memory ordering within one cache line. For this algorithm, we have an array of 16 cache_elements of 48 bytes each, so half of the cache_elements cross 64-byte cache lines. When reading the cache_element->key after the copy of the cache_element value, we need to make sure that the cache_element value read is ordered before the read of the cache_element->key, so one needs a memory barrier just before the read of the cache_element->key to guarantee the ordering.
On Sat, Jan 5, 2013 at 5:08 AM, Igor Galić <i.ga...@brainsware.org> wrote: > > > > Sigh. I was too much focused on x86. There the compiler barrier > > > caused > > > by the function call is enough. But you are right, on other > > > architectures these functions may also require cpu memory barriers. > > > > " on x86 … is enough" - would it harm x86 to add CPU barriers, or > > do we want to have # define distinctions per arch? > > ignore me, I just realized it's going to be different > calls per arch anyway! > > -- > Igor Galić > > Tel: +43 (0) 664 886 22 883 > Mail: i.ga...@brainsware.org > URL: http://brainsware.org/ > GPG: 6880 4155 74BD FD7C B515 2EA5 4B1D 9E08 A097 C9AE > >