Folks,

here is the description of a hang i briefly mentionned a few days ago.

with the trunk (i did not check 1.8 ...) simply run on one node :
mpirun -np 2 --mca btl sm,self ./abort

(the abort test is taken from the ibm test suite : process 0 call
MPI_Abort while process 1 enters an infinite loop)

there is a race condition : sometimes it hangs, sometimes it aborts
nicely as expected.
when the hang occurs, both abort processes have exited and mpirun waits
forever

i made some investigations and i have now a better idea of what happens
(but i am still clueless on how to fix this)

when process 0 abort, it :
- closes the tcp socket connected to mpirun
- closes the pipe connected to mpirun
- send SIGCHLD to mpirun

then on mpirun :
when SIGCHLD is received, the handler basically writes 17 (the signal
number) to a socketpair.
then libevent will return from a poll and here is the race condition,
basically :
if revents is non zero for the three fds (socket, pipe and socketpair)
then the program will abort nicely
if revents is non zero for both socket and pipe but is zero for the
socketpair, then the mpirun will hang

i digged a bit deeper and found that when the event on the socketpair is
processed, it will end up calling
odls_base_default_wait_local_proc.
if proc->state is 5 (aka ORTE_PROC_STATE_REGISTERED), then the program
will abort nicely
*but* if proc->state is 6 (aka ORTE_PROC_STATE_IOF_COMPLETE), then the
program will hang

an other way to put this is that
when the program aborts nicely, the call sequence is
odls_base_default_wait_local_proc
proc_errors(vpid=0)
proc_errors(vpid=0)
proc_errors(vpid=1)
proc_errors(vpid=1)

when the program hangs, the call sequence is
proc_errors(vpid=0)
odls_base_default_wait_local_proc
proc_errors(vpid=0)
proc_errors(vpid=1)
proc_errors(vpid=1)

i will resume this on Monday unless someone can fix this in the mean
time :-)

Cheers,

Gilles

Reply via email to