Ralph Castain wrote:
Could be nobody is saying anything...but I would be surprised if - nobody- barked at a segfault during startup.
Well, if it segfaulted during startup, someone's first reaction would probably be, "Oh really?" They would try again, have success, attribute to cosmic rays, and move on. But, yes, it is presumably rare (reasonably measured in parts per million), and the failure is early and obvious. And in code that is due to change very soon.
I don't understand what's going on, but I guess each process is calling sm_btl_first_time_init(), during which it initializes its own shm_bases value, FIFOs, and shm_fifo pointer. If a remote process saw those memory operations in that order, then initialization of the shm_fifo pointer would be an indicator that the rest of the data structures had been initialized. But there are no memory barriers between those operations to order them. So, perhaps testing the shm_fifo pointer doesn't really mean much. I don't know enough about memory coherency to know.
I think Terry has seen https://svn.open-mpi.org/source/xref/ompi_1.3/ompi/mca/btl/sm/btl_sm.c?r=20298#520 produce a wild "diff" value (between local and remote "bases"), even though it was supposed to be 0. I could see this happening if one saw the updates to the remote bases and shm_fifo values in the "wrong" order.
Jeff said he saw a problem at https://svn.open-mpi.org/source/xref/ompi_1.3/ompi/mca/btl/sm/btl_sm.c?r=20298#529 . He says he sees reasonable values for .fifo[j][...], so this would be harder to explain.