I may have an idea of what’s going on here - I just need to finish something else first and then I’ll take a look.
> On Jun 4, 2016, at 4:20 PM, George Bosilca <bosi...@icl.utk.edu> wrote: > >> >> On Jun 5, 2016, at 07:53 , Ralph Castain <r...@open-mpi.org >> <mailto:r...@open-mpi.org>> wrote: >> >>> >>> On Jun 4, 2016, at 1:11 PM, George Bosilca <bosi...@icl.utk.edu >>> <mailto:bosi...@icl.utk.edu>> wrote: >>> >>> >>> >>> On Sat, Jun 4, 2016 at 11:05 PM, Ralph Castain <r...@open-mpi.org >>> <mailto:r...@open-mpi.org>> wrote: >>> He can try adding "-mca state_base_verbose 5”, but if we are failing to >>> catch sigchld, I’m not sure what debugging info is going to help resolve >>> that problem. These aren’t even fast-running apps, so there was plenty of >>> time to register for the signal prior to termination. >>> >>> I vaguely recollect that we have occasionally seen this on Mac before and >>> it had something to do with oddness in sigchld handling… >>> >>> Assuming sigchld has some oddness on OSX. Why is then mpirun deadlocking >>> instead of quitting which will then allow the OS to clean all children? >> >> I don’t think mpirun is actually “deadlocked” - I think it may just be >> waiting for sigchld to tell it that the local processes have terminated. >> >> However, that wouldn't explain why you see what looks like libraries being >> unloaded. That implies mpirun is actually finalizing, but failing to fully >> exit - which would indeed be more of a deadlock. >> >> So the question is: are you truly seeing us missing sigchld (as was >> suggested earlier in this thread), > > In theory the processes remains in zombie state until the parent calls > waitpid on them, at which moment they are supposed to disappear. Based on > this, as the processes are still in zombie state, I assumed that mpirun was > not calling waitpid. One could also assume we are again hit by the fork race > condition we had a while back, but as all local processes are in zombie mode, > this is hardly believable. > >> or did mpirun correctly see all the child processes terminate and is >> actually hanging while trying to exit (as was also suggested earlier)? > > One way or another the stack of the main thread looks busted. While the > discussion about this was going on I was able to replicate the bug with only > ORTE involved. Simply running > > for i in `seq 1 1 1000`; do echo “$i"; mpirun -np 4 hostname; done > > ‘deadlock’ or whatever name we want to call this reliably before hitting the > 300 iteration. Unfortunately adding the verbose option alter the behavior > enough that the issue does not reproduce. > > George. > >> >> Adding the state verbosity should tell us which of those two is true, >> assuming it doesn’t affect the timing so much that everything works :-/ >> >> >>> >>> George. >>> >>> >>> >>> >>> > On Jun 4, 2016, at 7:01 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com >>> > <mailto:jsquy...@cisco.com>> wrote: >>> > >>> > Meh. Ok. Should George run with some verbose level to get more info? >>> > >>> >> On Jun 4, 2016, at 6:43 AM, Ralph Castain <r...@open-mpi.org >>> >> <mailto:r...@open-mpi.org>> wrote: >>> >> >>> >> Neither of those threads have anything to do with catching the sigchld - >>> >> threads 4-5 are listening for OOB and PMIx connection requests. It looks >>> >> more like mpirun thought it had picked everything up and has begun >>> >> shutting down, but I can’t really tell for certain. >>> >> >>> >>> On Jun 4, 2016, at 6:29 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com >>> >>> <mailto:jsquy...@cisco.com>> wrote: >>> >>> >>> >>> On Jun 3, 2016, at 11:07 PM, George Bosilca <bosi...@icl.utk.edu >>> >>> <mailto:bosi...@icl.utk.edu>> wrote: >>> >>>> >>> >>>> After finalize. As I said in my original email I se all the output the >>> >>>> application is generating, and all processes (which are local as this >>> >>>> happens on my laptop) are in zombie mode (Z+). This basically means >>> >>>> whoever was supposed to get the SIGCHLD, didn't do it's job of >>> >>>> cleaning them up. >>> >>> >>> >>> Ah -- so perhaps threads 1,2,3 are red herrings: the real problem here >>> >>> is that the parent didn't catch the child exits (which presumably >>> >>> should have been caught in threads 4 or 5). >>> >>> >>> >>> Ralph: is there any state from threads 4 or 5 that would be helpful to >>> >>> examine to see if they somehow missed catching children exits? >>> >>> >>> >>> -- >>> >>> Jeff Squyres >>> >>> jsquy...@cisco.com <mailto:jsquy...@cisco.com> >>> >>> For corporate legal information go to: >>> >>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>> >>> <http://www.cisco.com/web/about/doing_business/legal/cri/> >>> >>> >>> >>> _______________________________________________ >>> >>> devel mailing list >>> >>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>> >>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> <https://www.open-mpi.org/mailman/listinfo.cgi/devel> >>> >>> Link to this post: >>> >>> http://www.open-mpi.org/community/lists/devel/2016/06/19070.php >>> >>> <http://www.open-mpi.org/community/lists/devel/2016/06/19070.php> >>> >> >>> >> _______________________________________________ >>> >> devel mailing list >>> >> de...@open-mpi.org <mailto:de...@open-mpi.org> >>> >> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >> <https://www.open-mpi.org/mailman/listinfo.cgi/devel> >>> >> Link to this post: >>> >> http://www.open-mpi.org/community/lists/devel/2016/06/19071.php >>> >> <http://www.open-mpi.org/community/lists/devel/2016/06/19071.php> >>> > >>> > >>> > -- >>> > Jeff Squyres >>> > jsquy...@cisco.com <mailto:jsquy...@cisco.com> >>> > For corporate legal information go to: >>> > http://www.cisco.com/web/about/doing_business/legal/cri/ >>> > <http://www.cisco.com/web/about/doing_business/legal/cri/> >>> > >>> > _______________________________________________ >>> > devel mailing list >>> > de...@open-mpi.org <mailto:de...@open-mpi.org> >>> > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >>> > <https://www.open-mpi.org/mailman/listinfo.cgi/devel> >>> > Link to this post: >>> > http://www.open-mpi.org/community/lists/devel/2016/06/19072.php >>> > <http://www.open-mpi.org/community/lists/devel/2016/06/19072.php> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >>> <https://www.open-mpi.org/mailman/listinfo.cgi/devel> >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2016/06/19073.php >>> <http://www.open-mpi.org/community/lists/devel/2016/06/19073.php> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >>> <https://www.open-mpi.org/mailman/listinfo.cgi/devel> >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2016/06/19074.php >>> <http://www.open-mpi.org/community/lists/devel/2016/06/19074.php> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org <mailto:de...@open-mpi.org> >> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >> <https://www.open-mpi.org/mailman/listinfo.cgi/devel> >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2016/06/19075.php >> <http://www.open-mpi.org/community/lists/devel/2016/06/19075.php> > _______________________________________________ > devel mailing list > de...@open-mpi.org <mailto:de...@open-mpi.org> > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel > <https://www.open-mpi.org/mailman/listinfo.cgi/devel> > Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/06/19076.php > <http://www.open-mpi.org/community/lists/devel/2016/06/19076.php>