On Sat, Jun 4, 2016 at 11:05 PM, Ralph Castain <r...@open-mpi.org> wrote:

> He can try adding "-mca state_base_verbose 5”, but if we are failing to
> catch sigchld, I’m not sure what debugging info is going to help resolve
> that problem. These aren’t even fast-running apps, so there was plenty of
> time to register for the signal prior to termination.
>
> I vaguely recollect that we have occasionally seen this on Mac before and
> it had something to do with oddness in sigchld handling…
>

Assuming sigchld has some oddness on OSX. Why is then mpirun deadlocking
instead of quitting which will then allow the OS to clean all children?

  George.



>
>
> > On Jun 4, 2016, at 7:01 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com>
> wrote:
> >
> > Meh.  Ok.  Should George run with some verbose level to get more info?
> >
> >> On Jun 4, 2016, at 6:43 AM, Ralph Castain <r...@open-mpi.org> wrote:
> >>
> >> Neither of those threads have anything to do with catching the sigchld
> - threads 4-5 are listening for OOB and PMIx connection requests. It looks
> more like mpirun thought it had picked everything up and has begun shutting
> down, but I can’t really tell for certain.
> >>
> >>> On Jun 4, 2016, at 6:29 AM, Jeff Squyres (jsquyres) <
> jsquy...@cisco.com> wrote:
> >>>
> >>> On Jun 3, 2016, at 11:07 PM, George Bosilca <bosi...@icl.utk.edu>
> wrote:
> >>>>
> >>>> After finalize. As I said in my original email I se all the output
> the application is generating, and all processes (which are local as this
> happens on my laptop) are in zombie mode (Z+). This basically means whoever
> was supposed to get the SIGCHLD, didn't do it's job of cleaning them up.
> >>>
> >>> Ah -- so perhaps threads 1,2,3 are red herrings: the real problem here
> is that the parent didn't catch the child exits (which presumably should
> have been caught in threads 4 or 5).
> >>>
> >>> Ralph: is there any state from threads 4 or 5 that would be helpful to
> examine to see if they somehow missed catching children exits?
> >>>
> >>> --
> >>> Jeff Squyres
> >>> jsquy...@cisco.com
> >>> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> >>>
> >>> _______________________________________________
> >>> devel mailing list
> >>> de...@open-mpi.org
> >>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/06/19070.php
> >>
> >> _______________________________________________
> >> devel mailing list
> >> de...@open-mpi.org
> >> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
> >> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/06/19071.php
> >
> >
> > --
> > Jeff Squyres
> > jsquy...@cisco.com
> > For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> >
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/06/19072.php
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/06/19073.php

Reply via email to