Folks, here is the description of a hang i briefly mentionned a few days ago.
with the trunk (i did not check 1.8 ...) simply run on one node : mpirun -np 2 --mca btl sm,self ./abort (the abort test is taken from the ibm test suite : process 0 call MPI_Abort while process 1 enters an infinite loop) there is a race condition : sometimes it hangs, sometimes it aborts nicely as expected. when the hang occurs, both abort processes have exited and mpirun waits forever i made some investigations and i have now a better idea of what happens (but i am still clueless on how to fix this) when process 0 abort, it : - closes the tcp socket connected to mpirun - closes the pipe connected to mpirun - send SIGCHLD to mpirun then on mpirun : when SIGCHLD is received, the handler basically writes 17 (the signal number) to a socketpair. then libevent will return from a poll and here is the race condition, basically : if revents is non zero for the three fds (socket, pipe and socketpair) then the program will abort nicely if revents is non zero for both socket and pipe but is zero for the socketpair, then the mpirun will hang i digged a bit deeper and found that when the event on the socketpair is processed, it will end up calling odls_base_default_wait_local_proc. if proc->state is 5 (aka ORTE_PROC_STATE_REGISTERED), then the program will abort nicely *but* if proc->state is 6 (aka ORTE_PROC_STATE_IOF_COMPLETE), then the program will hang an other way to put this is that when the program aborts nicely, the call sequence is odls_base_default_wait_local_proc proc_errors(vpid=0) proc_errors(vpid=0) proc_errors(vpid=1) proc_errors(vpid=1) when the program hangs, the call sequence is proc_errors(vpid=0) odls_base_default_wait_local_proc proc_errors(vpid=0) proc_errors(vpid=1) proc_errors(vpid=1) i will resume this on Monday unless someone can fix this in the mean time :-) Cheers, Gilles