On Feb 13, 2007, at 7:37 PM, Brian W. Barrett wrote:

On Feb 13, 2007, at 5:16 PM, Jeff Squyres wrote:

On Feb 13, 2007, at 7:10 PM, George Bosilca wrote:

It's already in the 1.2 !!! I don't know much you care about
performance, but I do. This patch increase by 10% the latency. It
might be correct for the pathscale compiler, but it didn't look as a
huge requirement for all others compilers. A memory barrier for an
initialization and an unlock definitively looks like killing an ant
with a nuclear strike.

Can we roll this back and find some other way?

Yes, we can.

It's not actually the memory barrier we need, it's the tell the
compiler to not do anything stupid because we expect memory to be
invalidated that we need.  I'll commit a new, different fix tonight.

Upon further review, I'm wrong again. The original patch was wrong (not sure what I was thinking this afternoon) and my statement above is wrong. So the problem starts with the code:

a = 1
mylock->lock = 0
b = 2

Which is essentially what you have after inlining the atomic unlock as it occurred today. It's not totally unreasonable for a compiler (and we have seen this in practice, including with GCC on LA-MPI and likely are having it happen now, just don't realize it) to reorder that to:

a = 1
b = 2
mylock->lock = 0

or

mylock->lock = 0
a = 1
b = 2

After all, there's no memory dependencies in the three lines of code. When we had the compare and swap for unlock, there was a memory dependency, because the compare and swap inline assembly hinted to the compiler that memory was changed by the op and it shouldn't reorder memory accesses across that boundary or the compare and swap wasn't inlined. Compilers are pretty much not going to reorder memory accesses across a function unless it's 100% clear that there is not a side effect that might be important, which is basically never in C.

Ok, so we can tell the compiler not to reorder memory access with a little care (either compiler hints using inline assembly statements that include the "memory" invalidation hint) or by making atomic_unlock a function.

But now we start running on hardware, and the memory controller is free to start reordering code. We don't have any instructions telling the CPU / memory controller not to reorder our original instructions, so it can still do either one of the two bad cases. Still not good for us and definitely could lead to incorrect programs. So we need a memory barrier or we have potentially invalid code.

The full memory barrier is totally overkill for this situation, but some memory barrier is needed. While not quite correct, I believe that something like;

static inline void
opal_atomic_unlock(opal_atomic_lock_t *lock)
{
  opal_atomic_wmb();
  lock->u.lock=OPAL_ATOMIC_UNLOCKED;
}

would be more correct than having the barrier after the write and slightly better performance than the full atomic barrier. On x86 and x86_64, memory barriers are "free", in that all they do is limit the compiler's reordering of memory access. But on PPC, Sparc, and Alpha, it would have a performance cost. Don't know what that cost is, but I know that we need to pay it for correctness.

Long term, we should probably try to implement spinlocks as inline assembly. This wouldn't provide a whole lot of performance difference, but at least I could make sure the memory barrier is in the right place and help the compiler not be stupid.

By the way, this is what the Linux kernel does, adding credence to my argument, I hope ;).


Brian

--
  Brian Barrett
  Open MPI Team, CCS-1
  Los Alamos National Laboratory


Reply via email to