On Feb 13, 2007, at 7:37 PM, Brian W. Barrett wrote:
On Feb 13, 2007, at 5:16 PM, Jeff Squyres wrote:
On Feb 13, 2007, at 7:10 PM, George Bosilca wrote:
It's already in the 1.2 !!! I don't know much you care about
performance, but I do. This patch increase by 10% the latency. It
might be correct for the pathscale compiler, but it didn't look as a
huge requirement for all others compilers. A memory barrier for an
initialization and an unlock definitively looks like killing an ant
with a nuclear strike.
Can we roll this back and find some other way?
Yes, we can.
It's not actually the memory barrier we need, it's the tell the
compiler to not do anything stupid because we expect memory to be
invalidated that we need. I'll commit a new, different fix tonight.
Upon further review, I'm wrong again. The original patch was wrong
(not sure what I was thinking this afternoon) and my statement above
is wrong. So the problem starts with the code:
a = 1
mylock->lock = 0
b = 2
Which is essentially what you have after inlining the atomic unlock
as it occurred today. It's not totally unreasonable for a compiler
(and we have seen this in practice, including with GCC on LA-MPI and
likely are having it happen now, just don't realize it) to reorder
that to:
a = 1
b = 2
mylock->lock = 0
or
mylock->lock = 0
a = 1
b = 2
After all, there's no memory dependencies in the three lines of
code. When we had the compare and swap for unlock, there was a
memory dependency, because the compare and swap inline assembly
hinted to the compiler that memory was changed by the op and it
shouldn't reorder memory accesses across that boundary or the compare
and swap wasn't inlined. Compilers are pretty much not going to
reorder memory accesses across a function unless it's 100% clear that
there is not a side effect that might be important, which is
basically never in C.
Ok, so we can tell the compiler not to reorder memory access with a
little care (either compiler hints using inline assembly statements
that include the "memory" invalidation hint) or by making
atomic_unlock a function.
But now we start running on hardware, and the memory controller is
free to start reordering code. We don't have any instructions
telling the CPU / memory controller not to reorder our original
instructions, so it can still do either one of the two bad cases.
Still not good for us and definitely could lead to incorrect
programs. So we need a memory barrier or we have potentially invalid
code.
The full memory barrier is totally overkill for this situation, but
some memory barrier is needed. While not quite correct, I believe
that something like;
static inline void
opal_atomic_unlock(opal_atomic_lock_t *lock)
{
opal_atomic_wmb();
lock->u.lock=OPAL_ATOMIC_UNLOCKED;
}
would be more correct than having the barrier after the write and
slightly better performance than the full atomic barrier. On x86 and
x86_64, memory barriers are "free", in that all they do is limit the
compiler's reordering of memory access. But on PPC, Sparc, and
Alpha, it would have a performance cost. Don't know what that cost
is, but I know that we need to pay it for correctness.
Long term, we should probably try to implement spinlocks as inline
assembly. This wouldn't provide a whole lot of performance
difference, but at least I could make sure the memory barrier is in
the right place and help the compiler not be stupid.
By the way, this is what the Linux kernel does, adding credence to my
argument, I hope ;).
Brian
--
Brian Barrett
Open MPI Team, CCS-1
Los Alamos National Laboratory