"On x86 loads can only be reordered with store operations to a different 
memory location, since it's checking the store buffer first. Also, loads 
are not reordered with other loads."

This isn't completely true. Even though the reordering of loads can't be 
observed, it doesn't mean it doesn't happen. Loads can be speculatively 
executed out of order. But when a potential reordering of loads is 
detected, a machine clear is issued. Which nukes the pipipeline and the 
loads are re-issued. And hopefully this time the speculation pays off.

For more info see:

https://fgiesen.wordpress.com/2013/01/31/cores-dont-like-to-share/

It can also happen that  a load wrongly identifies a store in the store 
buffer and the load needs to be reissued. 

https://richardstartin.github.io/posts/4k-aliasing

"Since loads could be reordered with earlier stores, we need `mfence` to 
force the core to serialize instructions and flush the store buffer before 
proceeding."

Some terminology:
SB = store buffer
ROB = reorder buffer
LSU = load store unit

To serialize an instruction you need to stop issuing instructions into the 
ROB until both the ROB and SB have been drained. 

What an mfence does is to stop the LSU from executing loads till the SB is 
drained to the coherent cache. And this is already happening as fast as 
possible. So the mfence will serialize with respect to earlier loads, but 
it isn't a fully serializing instruction because:
1) doesn't stop issuing instructions to the ROB (only stop LSU from 
executing loads)
2) doesn't wait for the ROB to be drained (only waits for SB to be drained)

The lfence AFAIK has no meaning with respect to other plain loads and 
stores, only weakly ordered ones (streaming read as Avi already pointed 
out). What an lfence does is to stop the issuing instructions into the ROB 
until the ROB is drained. But it will not wait for the SB to be drained. So 
it is a partly serializing instruction. Both lfence and sfence are pretty 
useless for normal loads/stores.







On Friday, August 18, 2023 at 8:28:02 PM UTC+3 [email protected] wrote:

> Hello everyone :wave:
>
> I'm trying to understand the intricacies of low-level concurrent 
> programming, focusing on x86 for the time being. Specifically I'd like to 
> write atomics that would work correctly for x86, but the more I dig into 
> this, the more confused I'm getting given all the parts involved.
>
> Here's my current code, I was wondering if people with a better knowledge 
> of the architecture could point me to the issues in my reasoning how this 
> should be done.
>
> enum struct Memory_Order: u32 {
>   Whatever,
>   Acquire,
>   Release,
>   Acquire_Release,
>   Sequential
> };
>
> template <typename T>
> struct Atomic {
>   using Value_Type = T;
>   volatile T value;
> };
>
> template <typename T>
> using Atomic_Value = typename Atomic<T>::Value_Type;
>
> #define compiler_barrier() do { asm volatile ("" ::: "memory"); } while (0)
> #define full_fence() do { asm volatile ("mfence" ::: "memory"); } while (0)
>
> template <Memory_Order order = Memory_Order::Whatever, typename T>
> static T atomic_load (const Atomic<T> *atomic) {
>   using enum Memory_Order;
>
>   static_assert(sizeof(T) <= sizeof(void*));
>   static_assert((order == Whatever) || (order == Acquire) || (order == 
> Sequential));
>
>   if constexpr (order == Sequential) full_fence();
>   auto result = atomic->value;
>   if constexpr (order != Whatever) compiler_barrier();
>
>   return result;
> }
>
> template <Memory_Order order = Memory_Order::Whatever, typename T>
> static void atomic_store (Atomic<T> *atomic, Atomic_Value<T> value) {
>   using enum Memory_Order;
>
>   static_assert(sizeof(T) <= sizeof(void*));
>   static_assert((order == Whatever) || (order == Release) || (order == 
> Sequential));
>
>   if constexpr (order == Whatever) {
>     atomic->value = value;
>   }
>   else if constexpr (order == Release) {
>     compiler_barrier();
>     atomic->value = value;
>   }
>   else {
>     asm volatile (
>       "lock xchg %1, %0"
>       : "+r"(value), "+m"(atomic->value)
>       :
>       : "memory"
>     );
>   }
> }
>  
> On x86 loads can only be reordered with store operations to a different 
> memory location, since it's checking the store buffer first. Also loads are 
> not reordered with other loads.
>
> This effectively guarantees an acquire semantics by default for all loads 
> on x86, thus, we don't need to have any explicit memory barrier and only 
> prevent the compiler from reordering instructions.
>
> Since loads could be reordered with earlier stores, we need `mfence` to 
> force the core to serialize instructions and flush the store buffer before 
> proceeding.
>
> In case of atomic_store, my understanding of locked instructions, that 
> they guarantee sequential consistency, thus xchg is good enough for that 
> case. For the Release semantics having a compiler barrier is also enough. 
>
> My uncertainty is with speculative execution of loads and if that requires 
> the use of `lfence` for the Acquire case before the load?
>
> Kind regards,
> Aleksandr.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web, visit 
https://groups.google.com/d/msgid/mechanical-sympathy/5cd61907-8866-4ec7-b74b-6fc933279edan%40googlegroups.com.

Reply via email to