Sylvain Jeaugey wrote:
Hi Ralph,
I managed to have a deadlock after a whole night, but not the same you
have : after a quick analysis, process 0 seems to be blocked in the
very first send through shared memory. Still maybe a bug, but not the
same as yours IMO.
Yes, that's the one Terry and I have tried to hunt down. Kind of
elusive. Apparently, there is a race condition in sm start-up. It
*appears* as though a process (the lowest rank on a node?) computes
offsets into shared memory using bad values and ends up with a FIFO
pointer to the wrong spot. Up through 1.3.1, this meant that OMPI would
fail in add_procs()... Jeff and Terry have seen a couple of these. With
changes to sm in 1.3.2, the failure expresses itself differently... not
until the first send (namely, first use of a remote FIFO). At least
that's my understanding. George added some sync to the code to make it
bulletproof. But doesn't seem to have fixed the problem. Sigh.
Anyhow, I think you ran into a different but known yet not understood
problem.