On Sat, Jun 4, 2016 at 11:05 PM, Ralph Castain <r...@open-mpi.org> wrote:
> He can try adding "-mca state_base_verbose 5”, but if we are failing to > catch sigchld, I’m not sure what debugging info is going to help resolve > that problem. These aren’t even fast-running apps, so there was plenty of > time to register for the signal prior to termination. > > I vaguely recollect that we have occasionally seen this on Mac before and > it had something to do with oddness in sigchld handling… > Assuming sigchld has some oddness on OSX. Why is then mpirun deadlocking instead of quitting which will then allow the OS to clean all children? George. > > > > On Jun 4, 2016, at 7:01 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> > wrote: > > > > Meh. Ok. Should George run with some verbose level to get more info? > > > >> On Jun 4, 2016, at 6:43 AM, Ralph Castain <r...@open-mpi.org> wrote: > >> > >> Neither of those threads have anything to do with catching the sigchld > - threads 4-5 are listening for OOB and PMIx connection requests. It looks > more like mpirun thought it had picked everything up and has begun shutting > down, but I can’t really tell for certain. > >> > >>> On Jun 4, 2016, at 6:29 AM, Jeff Squyres (jsquyres) < > jsquy...@cisco.com> wrote: > >>> > >>> On Jun 3, 2016, at 11:07 PM, George Bosilca <bosi...@icl.utk.edu> > wrote: > >>>> > >>>> After finalize. As I said in my original email I se all the output > the application is generating, and all processes (which are local as this > happens on my laptop) are in zombie mode (Z+). This basically means whoever > was supposed to get the SIGCHLD, didn't do it's job of cleaning them up. > >>> > >>> Ah -- so perhaps threads 1,2,3 are red herrings: the real problem here > is that the parent didn't catch the child exits (which presumably should > have been caught in threads 4 or 5). > >>> > >>> Ralph: is there any state from threads 4 or 5 that would be helpful to > examine to see if they somehow missed catching children exits? > >>> > >>> -- > >>> Jeff Squyres > >>> jsquy...@cisco.com > >>> For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > >>> > >>> _______________________________________________ > >>> devel mailing list > >>> de...@open-mpi.org > >>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel > >>> Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/06/19070.php > >> > >> _______________________________________________ > >> devel mailing list > >> de...@open-mpi.org > >> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel > >> Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/06/19071.php > > > > > > -- > > Jeff Squyres > > jsquy...@cisco.com > > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/06/19072.php > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/06/19073.php