> On Jun 4, 2016, at 1:11 PM, George Bosilca <bosi...@icl.utk.edu> wrote:
> 
> 
> 
> On Sat, Jun 4, 2016 at 11:05 PM, Ralph Castain <r...@open-mpi.org 
> <mailto:r...@open-mpi.org>> wrote:
> He can try adding "-mca state_base_verbose 5”, but if we are failing to catch 
> sigchld, I’m not sure what debugging info is going to help resolve that 
> problem. These aren’t even fast-running apps, so there was plenty of time to 
> register for the signal prior to termination.
> 
> I vaguely recollect that we have occasionally seen this on Mac before and it 
> had something to do with oddness in sigchld handling…
> 
> Assuming sigchld has some oddness on OSX. Why is then mpirun deadlocking 
> instead of quitting which will then allow the OS to clean all children?

I don’t think mpirun is actually “deadlocked” - I think it may just be waiting 
for sigchld to tell it that the local processes have terminated.

However, that wouldn't explain why you see what looks like libraries being 
unloaded. That implies mpirun is actually finalizing, but failing to fully exit 
- which would indeed be more of a deadlock.

So the question is: are you truly seeing us missing sigchld (as was suggested 
earlier in this thread), or did mpirun correctly see all the child processes 
terminate and is actually hanging while trying to exit (as was also suggested 
earlier)?

Adding the state verbosity should tell us which of those two is true, assuming 
it doesn’t affect the timing so much that everything works :-/


> 
>   George.
> 
>  
> 
> 
> > On Jun 4, 2016, at 7:01 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com 
> > <mailto:jsquy...@cisco.com>> wrote:
> >
> > Meh.  Ok.  Should George run with some verbose level to get more info?
> >
> >> On Jun 4, 2016, at 6:43 AM, Ralph Castain <r...@open-mpi.org 
> >> <mailto:r...@open-mpi.org>> wrote:
> >>
> >> Neither of those threads have anything to do with catching the sigchld - 
> >> threads 4-5 are listening for OOB and PMIx connection requests. It looks 
> >> more like mpirun thought it had picked everything up and has begun 
> >> shutting down, but I can’t really tell for certain.
> >>
> >>> On Jun 4, 2016, at 6:29 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com 
> >>> <mailto:jsquy...@cisco.com>> wrote:
> >>>
> >>> On Jun 3, 2016, at 11:07 PM, George Bosilca <bosi...@icl.utk.edu 
> >>> <mailto:bosi...@icl.utk.edu>> wrote:
> >>>>
> >>>> After finalize. As I said in my original email I se all the output the 
> >>>> application is generating, and all processes (which are local as this 
> >>>> happens on my laptop) are in zombie mode (Z+). This basically means 
> >>>> whoever was supposed to get the SIGCHLD, didn't do it's job of cleaning 
> >>>> them up.
> >>>
> >>> Ah -- so perhaps threads 1,2,3 are red herrings: the real problem here is 
> >>> that the parent didn't catch the child exits (which presumably should 
> >>> have been caught in threads 4 or 5).
> >>>
> >>> Ralph: is there any state from threads 4 or 5 that would be helpful to 
> >>> examine to see if they somehow missed catching children exits?
> >>>
> >>> --
> >>> Jeff Squyres
> >>> jsquy...@cisco.com <mailto:jsquy...@cisco.com>
> >>> For corporate legal information go to: 
> >>> http://www.cisco.com/web/about/doing_business/legal/cri/ 
> >>> <http://www.cisco.com/web/about/doing_business/legal/cri/>
> >>>
> >>> _______________________________________________
> >>> devel mailing list
> >>> de...@open-mpi.org <mailto:de...@open-mpi.org>
> >>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel 
> >>> <https://www.open-mpi.org/mailman/listinfo.cgi/devel>
> >>> Link to this post: 
> >>> http://www.open-mpi.org/community/lists/devel/2016/06/19070.php 
> >>> <http://www.open-mpi.org/community/lists/devel/2016/06/19070.php>
> >>
> >> _______________________________________________
> >> devel mailing list
> >> de...@open-mpi.org <mailto:de...@open-mpi.org>
> >> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel 
> >> <https://www.open-mpi.org/mailman/listinfo.cgi/devel>
> >> Link to this post: 
> >> http://www.open-mpi.org/community/lists/devel/2016/06/19071.php 
> >> <http://www.open-mpi.org/community/lists/devel/2016/06/19071.php>
> >
> >
> > --
> > Jeff Squyres
> > jsquy...@cisco.com <mailto:jsquy...@cisco.com>
> > For corporate legal information go to: 
> > http://www.cisco.com/web/about/doing_business/legal/cri/ 
> > <http://www.cisco.com/web/about/doing_business/legal/cri/>
> >
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org <mailto:de...@open-mpi.org>
> > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel 
> > <https://www.open-mpi.org/mailman/listinfo.cgi/devel>
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/devel/2016/06/19072.php 
> > <http://www.open-mpi.org/community/lists/devel/2016/06/19072.php>
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org <mailto:de...@open-mpi.org>
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel 
> <https://www.open-mpi.org/mailman/listinfo.cgi/devel>
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2016/06/19073.php 
> <http://www.open-mpi.org/community/lists/devel/2016/06/19073.php>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org <mailto:de...@open-mpi.org>
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel 
> <https://www.open-mpi.org/mailman/listinfo.cgi/devel>
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2016/06/19074.php 
> <http://www.open-mpi.org/community/lists/devel/2016/06/19074.php>

Reply via email to