On Mar 11, 2009, at 12:19 PM, Eugene Loh wrote:
I don't understand what's going on, but I guess each process is
calling
sm_btl_first_time_init(), during which it initializes its own
shm_bases
value, FIFOs, and shm_fifo pointer. If a remote process saw those
memory operations in that order, then initialization of the shm_fifo
pointer would be an indicator that the rest of the data structures had
been initialized. But there are no memory barriers between those
operations to order them. So, perhaps testing the shm_fifo pointer
doesn't really mean much. I don't know enough about memory
coherency to
know.
FWIW, George and I puzzled through some of this code yesterday. We
didn't see anything that was obviously wrong, even though we were
actively trying to think of whacky race conditions that could be
happening. :-(
George said he'd continue to investigate.
--
Jeff Squyres
Cisco Systems