Sylvain Jeaugey wrote:

Hi Ralph,

I managed to have a deadlock after a whole night, but not the same you have : after a quick analysis, process 0 seems to be blocked in the very first send through shared memory. Still maybe a bug, but not the same as yours IMO.

Yes, that's the one Terry and I have tried to hunt down. Kind of elusive. Apparently, there is a race condition in sm start-up. It *appears* as though a process (the lowest rank on a node?) computes offsets into shared memory using bad values and ends up with a FIFO pointer to the wrong spot. Up through 1.3.1, this meant that OMPI would fail in add_procs()... Jeff and Terry have seen a couple of these. With changes to sm in 1.3.2, the failure expresses itself differently... not until the first send (namely, first use of a remote FIFO). At least that's my understanding. George added some sync to the code to make it bulletproof. But doesn't seem to have fixed the problem. Sigh.

Anyhow, I think you ran into a different but known yet not understood problem.

Reply via email to