I may have an idea of what’s going on here - I just need to finish something 
else first and then I’ll take a look.


> On Jun 4, 2016, at 4:20 PM, George Bosilca <bosi...@icl.utk.edu> wrote:
> 
>> 
>> On Jun 5, 2016, at 07:53 , Ralph Castain <r...@open-mpi.org 
>> <mailto:r...@open-mpi.org>> wrote:
>> 
>>> 
>>> On Jun 4, 2016, at 1:11 PM, George Bosilca <bosi...@icl.utk.edu 
>>> <mailto:bosi...@icl.utk.edu>> wrote:
>>> 
>>> 
>>> 
>>> On Sat, Jun 4, 2016 at 11:05 PM, Ralph Castain <r...@open-mpi.org 
>>> <mailto:r...@open-mpi.org>> wrote:
>>> He can try adding "-mca state_base_verbose 5”, but if we are failing to 
>>> catch sigchld, I’m not sure what debugging info is going to help resolve 
>>> that problem. These aren’t even fast-running apps, so there was plenty of 
>>> time to register for the signal prior to termination.
>>> 
>>> I vaguely recollect that we have occasionally seen this on Mac before and 
>>> it had something to do with oddness in sigchld handling…
>>> 
>>> Assuming sigchld has some oddness on OSX. Why is then mpirun deadlocking 
>>> instead of quitting which will then allow the OS to clean all children?
>> 
>> I don’t think mpirun is actually “deadlocked” - I think it may just be 
>> waiting for sigchld to tell it that the local processes have terminated.
>> 
>> However, that wouldn't explain why you see what looks like libraries being 
>> unloaded. That implies mpirun is actually finalizing, but failing to fully 
>> exit - which would indeed be more of a deadlock.
>> 
>> So the question is: are you truly seeing us missing sigchld (as was 
>> suggested earlier in this thread),
> 
> In theory the processes remains in zombie state until the parent calls 
> waitpid on them, at which moment they are supposed to disappear. Based on 
> this, as the processes are still in zombie state, I assumed that mpirun was 
> not calling waitpid. One could also assume we are again hit by the fork race 
> condition we had a while back, but as all local processes are in zombie mode, 
> this is hardly believable.
> 
>> or did mpirun correctly see all the child processes terminate and is 
>> actually hanging while trying to exit (as was also suggested earlier)?
> 
> One way or another the stack of the main thread looks busted. While the 
> discussion about this was going on I was able to replicate the bug with only 
> ORTE involved. Simply running 
> 
> for i in `seq 1 1 1000`; do echo “$i"; mpirun -np 4 hostname; done
> 
> ‘deadlock’ or whatever name we want to call this reliably before hitting the 
> 300 iteration. Unfortunately adding the verbose option alter the behavior 
> enough that the issue does not reproduce.
> 
>   George.
> 
>> 
>> Adding the state verbosity should tell us which of those two is true, 
>> assuming it doesn’t affect the timing so much that everything works :-/
>> 
>> 
>>> 
>>>   George.
>>> 
>>>  
>>> 
>>> 
>>> > On Jun 4, 2016, at 7:01 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com 
>>> > <mailto:jsquy...@cisco.com>> wrote:
>>> >
>>> > Meh.  Ok.  Should George run with some verbose level to get more info?
>>> >
>>> >> On Jun 4, 2016, at 6:43 AM, Ralph Castain <r...@open-mpi.org 
>>> >> <mailto:r...@open-mpi.org>> wrote:
>>> >>
>>> >> Neither of those threads have anything to do with catching the sigchld - 
>>> >> threads 4-5 are listening for OOB and PMIx connection requests. It looks 
>>> >> more like mpirun thought it had picked everything up and has begun 
>>> >> shutting down, but I can’t really tell for certain.
>>> >>
>>> >>> On Jun 4, 2016, at 6:29 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com 
>>> >>> <mailto:jsquy...@cisco.com>> wrote:
>>> >>>
>>> >>> On Jun 3, 2016, at 11:07 PM, George Bosilca <bosi...@icl.utk.edu 
>>> >>> <mailto:bosi...@icl.utk.edu>> wrote:
>>> >>>>
>>> >>>> After finalize. As I said in my original email I se all the output the 
>>> >>>> application is generating, and all processes (which are local as this 
>>> >>>> happens on my laptop) are in zombie mode (Z+). This basically means 
>>> >>>> whoever was supposed to get the SIGCHLD, didn't do it's job of 
>>> >>>> cleaning them up.
>>> >>>
>>> >>> Ah -- so perhaps threads 1,2,3 are red herrings: the real problem here 
>>> >>> is that the parent didn't catch the child exits (which presumably 
>>> >>> should have been caught in threads 4 or 5).
>>> >>>
>>> >>> Ralph: is there any state from threads 4 or 5 that would be helpful to 
>>> >>> examine to see if they somehow missed catching children exits?
>>> >>>
>>> >>> --
>>> >>> Jeff Squyres
>>> >>> jsquy...@cisco.com <mailto:jsquy...@cisco.com>
>>> >>> For corporate legal information go to: 
>>> >>> http://www.cisco.com/web/about/doing_business/legal/cri/ 
>>> >>> <http://www.cisco.com/web/about/doing_business/legal/cri/>
>>> >>>
>>> >>> _______________________________________________
>>> >>> devel mailing list
>>> >>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>> >>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>> >>> <https://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>> >>> Link to this post: 
>>> >>> http://www.open-mpi.org/community/lists/devel/2016/06/19070.php 
>>> >>> <http://www.open-mpi.org/community/lists/devel/2016/06/19070.php>
>>> >>
>>> >> _______________________________________________
>>> >> devel mailing list
>>> >> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>> >> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>> >> <https://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>> >> Link to this post: 
>>> >> http://www.open-mpi.org/community/lists/devel/2016/06/19071.php 
>>> >> <http://www.open-mpi.org/community/lists/devel/2016/06/19071.php>
>>> >
>>> >
>>> > --
>>> > Jeff Squyres
>>> > jsquy...@cisco.com <mailto:jsquy...@cisco.com>
>>> > For corporate legal information go to: 
>>> > http://www.cisco.com/web/about/doing_business/legal/cri/ 
>>> > <http://www.cisco.com/web/about/doing_business/legal/cri/>
>>> >
>>> > _______________________________________________
>>> > devel mailing list
>>> > de...@open-mpi.org <mailto:de...@open-mpi.org>
>>> > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>> > <https://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>> > Link to this post: 
>>> > http://www.open-mpi.org/community/lists/devel/2016/06/19072.php 
>>> > <http://www.open-mpi.org/community/lists/devel/2016/06/19072.php>
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>> <https://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2016/06/19073.php 
>>> <http://www.open-mpi.org/community/lists/devel/2016/06/19073.php>
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>> <https://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2016/06/19074.php 
>>> <http://www.open-mpi.org/community/lists/devel/2016/06/19074.php>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel 
>> <https://www.open-mpi.org/mailman/listinfo.cgi/devel>
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2016/06/19075.php 
>> <http://www.open-mpi.org/community/lists/devel/2016/06/19075.php>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org <mailto:de...@open-mpi.org>
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel 
> <https://www.open-mpi.org/mailman/listinfo.cgi/devel>
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2016/06/19076.php 
> <http://www.open-mpi.org/community/lists/devel/2016/06/19076.php>

Reply via email to