Huh - okay, must be a difference in our race conditions. I can run it for more 
than 1k cycles without hitting it. I’ll poke some more later


> On Jun 6, 2016, at 8:06 AM, George Bosilca <bosi...@icl.utk.edu> wrote:
> 
> Ralph,
> 
> Things got better, in the sense that before I was getting about 1 deadlock 
> for each 300 runs, now the number if more 1 out of every 500.
> 
>   George.
> 
> 
> On Tue, Jun 7, 2016 at 12:04 AM, George Bosilca <bosi...@icl.utk.edu 
> <mailto:bosi...@icl.utk.edu>> wrote:
> Ralph,
> 
> Not there yet. I got similar deadlocks, but the stack not looks slightly 
> different. I only have 1 single thread doing something useful (aka being in 
> listen_thread_fn), every other thread is having a similar stack:
> 
>   * frame #0: 0x00007fff93306de6 libsystem_kernel.dylib`__psynch_mutexwait + 
> 10
>     frame #1: 0x00007fff9a000e4a 
> libsystem_pthread.dylib`_pthread_mutex_lock_wait + 89
>     frame #2: 0x00007fff99ffe5f5 
> libsystem_pthread.dylib`_pthread_mutex_lock_slow + 300
>     frame #3: 0x00007fff8c2a00f8 libdyld.dylib`dyldGlobalLockAcquire() + 16
>     frame #4: 0x00007fff6bbc3177 
> dyld`ImageLoaderMachOCompressed::doBindFastLazySymbol(unsigned int, 
> ImageLoader::LinkContext const&, void (*)(), void (*)()) + 55
>     frame #5: 0x00007fff6bbad063 dyld`dyld::fastBindLazySymbol(ImageLoader**, 
> unsigned long) + 90
>     frame #6: 0x00007fff8c2a0262 libdyld.dylib`dyld_stub_binder + 282
>     frame #7: 0x00000001019d39b0 libopen-pal.0.dylib`obj_order_type + 3776
> 
>   George.
> 
> 
> On Mon, Jun 6, 2016 at 1:41 PM, Ralph Castain <r...@open-mpi.org 
> <mailto:r...@open-mpi.org>> wrote:
> I think I have this fixed here: https://github.com/open-mpi/ompi/pull/1756 
> <https://github.com/open-mpi/ompi/pull/1756>
> 
> George - can you please try it on your system?
> 
> 
>> On Jun 5, 2016, at 4:18 PM, Ralph Castain <r...@open-mpi.org 
>> <mailto:r...@open-mpi.org>> wrote:
>> 
>> Yeah, I can reproduce on my box. What is happening is that we aren’t 
>> properly protected during finalize, and so we tear down some component that 
>> is registered for a callback, and then the callback occurs. So we just need 
>> to ensure that we finalize in the right order
>> 
>>> On Jun 5, 2016, at 3:51 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com 
>>> <mailto:jsquy...@cisco.com>> wrote:
>>> 
>>> Ok, good.
>>> 
>>> FWIW: I tried all Friday afternoon to reproduce on my OS X laptop (i.e., I 
>>> did something like George's shell script loop), and just now I ran George's 
>>> exact loop, but I haven't been able to reproduce.  In this case, I'm 
>>> falling on the wrong side of whatever race condition is happening...
>>> 
>>> 
>>>> On Jun 4, 2016, at 7:57 PM, Ralph Castain <r...@open-mpi.org 
>>>> <mailto:r...@open-mpi.org>> wrote:
>>>> 
>>>> I may have an idea of what’s going on here - I just need to finish 
>>>> something else first and then I’ll take a look.
>>>> 
>>>> 
>>>>> On Jun 4, 2016, at 4:20 PM, George Bosilca <bosi...@icl.utk.edu 
>>>>> <mailto:bosi...@icl.utk.edu>> wrote:
>>>>> 
>>>>>> 
>>>>>> On Jun 5, 2016, at 07:53 , Ralph Castain <r...@open-mpi.org 
>>>>>> <mailto:r...@open-mpi.org>> wrote:
>>>>>> 
>>>>>>> 
>>>>>>> On Jun 4, 2016, at 1:11 PM, George Bosilca <bosi...@icl.utk.edu 
>>>>>>> <mailto:bosi...@icl.utk.edu>> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Sat, Jun 4, 2016 at 11:05 PM, Ralph Castain <r...@open-mpi.org 
>>>>>>> <mailto:r...@open-mpi.org>> wrote:
>>>>>>> He can try adding "-mca state_base_verbose 5”, but if we are failing to 
>>>>>>> catch sigchld, I’m not sure what debugging info is going to help 
>>>>>>> resolve that problem. These aren’t even fast-running apps, so there was 
>>>>>>> plenty of time to register for the signal prior to termination.
>>>>>>> 
>>>>>>> I vaguely recollect that we have occasionally seen this on Mac before 
>>>>>>> and it had something to do with oddness in sigchld handling…
>>>>>>> 
>>>>>>> Assuming sigchld has some oddness on OSX. Why is then mpirun 
>>>>>>> deadlocking instead of quitting which will then allow the OS to clean 
>>>>>>> all children?
>>>>>> 
>>>>>> I don’t think mpirun is actually “deadlocked” - I think it may just be 
>>>>>> waiting for sigchld to tell it that the local processes have terminated.
>>>>>> 
>>>>>> However, that wouldn't explain why you see what looks like libraries 
>>>>>> being unloaded. That implies mpirun is actually finalizing, but failing 
>>>>>> to fully exit - which would indeed be more of a deadlock.
>>>>>> 
>>>>>> So the question is: are you truly seeing us missing sigchld (as was 
>>>>>> suggested earlier in this thread),
>>>>> 
>>>>> In theory the processes remains in zombie state until the parent calls 
>>>>> waitpid on them, at which moment they are supposed to disappear. Based on 
>>>>> this, as the processes are still in zombie state, I assumed that mpirun 
>>>>> was not calling waitpid. One could also assume we are again hit by the 
>>>>> fork race condition we had a while back, but as all local processes are 
>>>>> in zombie mode, this is hardly believable.
>>>>> 
>>>>>> or did mpirun correctly see all the child processes terminate and is 
>>>>>> actually hanging while trying to exit (as was also suggested earlier)?
>>>>> 
>>>>> One way or another the stack of the main thread looks busted. While the 
>>>>> discussion about this was going on I was able to replicate the bug with 
>>>>> only ORTE involved. Simply running 
>>>>> 
>>>>> for i in `seq 1 1 1000`; do echo “$i"; mpirun -np 4 hostname; done
>>>>> 
>>>>> ‘deadlock’ or whatever name we want to call this reliably before hitting 
>>>>> the 300 iteration. Unfortunately adding the verbose option alter the 
>>>>> behavior enough that the issue does not reproduce.
>>>>> 
>>>>> George.
>>>>> 
>>>>>> 
>>>>>> Adding the state verbosity should tell us which of those two is true, 
>>>>>> assuming it doesn’t affect the timing so much that everything works :-/
>>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> George.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On Jun 4, 2016, at 7:01 AM, Jeff Squyres (jsquyres) 
>>>>>>>> <jsquy...@cisco.com <mailto:jsquy...@cisco.com>> wrote:
>>>>>>>> 
>>>>>>>> Meh.  Ok.  Should George run with some verbose level to get more info?
>>>>>>>> 
>>>>>>>>> On Jun 4, 2016, at 6:43 AM, Ralph Castain <r...@open-mpi.org 
>>>>>>>>> <mailto:r...@open-mpi.org>> wrote:
>>>>>>>>> 
>>>>>>>>> Neither of those threads have anything to do with catching the 
>>>>>>>>> sigchld - threads 4-5 are listening for OOB and PMIx connection 
>>>>>>>>> requests. It looks more like mpirun thought it had picked everything 
>>>>>>>>> up and has begun shutting down, but I can’t really tell for certain.
>>>>>>>>> 
>>>>>>>>>> On Jun 4, 2016, at 6:29 AM, Jeff Squyres (jsquyres) 
>>>>>>>>>> <jsquy...@cisco.com <mailto:jsquy...@cisco.com>> wrote:
>>>>>>>>>> 
>>>>>>>>>> On Jun 3, 2016, at 11:07 PM, George Bosilca <bosi...@icl.utk.edu 
>>>>>>>>>> <mailto:bosi...@icl.utk.edu>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> After finalize. As I said in my original email I se all the output 
>>>>>>>>>>> the application is generating, and all processes (which are local 
>>>>>>>>>>> as this happens on my laptop) are in zombie mode (Z+). This 
>>>>>>>>>>> basically means whoever was supposed to get the SIGCHLD, didn't do 
>>>>>>>>>>> it's job of cleaning them up.
>>>>>>>>>> 
>>>>>>>>>> Ah -- so perhaps threads 1,2,3 are red herrings: the real problem 
>>>>>>>>>> here is that the parent didn't catch the child exits (which 
>>>>>>>>>> presumably should have been caught in threads 4 or 5).
>>>>>>>>>> 
>>>>>>>>>> Ralph: is there any state from threads 4 or 5 that would be helpful 
>>>>>>>>>> to examine to see if they somehow missed catching children exits?
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> Jeff Squyres
>>>>>>>>>> jsquy...@cisco.com <mailto:jsquy...@cisco.com>
>>>>>>>>>> For corporate legal information go to: 
>>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ 
>>>>>>>>>> <http://www.cisco.com/web/about/doing_business/legal/cri/>
>>>>>>>>>> 
>>>>>>>>>> _______________________________________________
>>>>>>>>>> devel mailing list
>>>>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>>>>>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>>>>>>>>> <https://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>>>>>>>> Link to this post: 
>>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2016/06/19070.php 
>>>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2016/06/19070.php>
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> devel mailing list
>>>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>>>>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>>>>>>>> <https://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>>>>>>> Link to this post: 
>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2016/06/19071.php 
>>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2016/06/19071.php>
>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Jeff Squyres
>>>>>>>> jsquy...@cisco.com <mailto:jsquy...@cisco.com>
>>>>>>>> For corporate legal information go to: 
>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ 
>>>>>>>> <http://www.cisco.com/web/about/doing_business/legal/cri/>
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> devel mailing list
>>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>>>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>>>>>>> <https://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>>>>>> Link to this post: 
>>>>>>>> http://www.open-mpi.org/community/lists/devel/2016/06/19072.php 
>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2016/06/19072.php>
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>>>>>> <https://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>>>>> Link to this post: 
>>>>>>> http://www.open-mpi.org/community/lists/devel/2016/06/19073.php 
>>>>>>> <http://www.open-mpi.org/community/lists/devel/2016/06/19073.php>
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>>>>>> <https://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>>>>> Link to this post: 
>>>>>>> http://www.open-mpi.org/community/lists/devel/2016/06/19074.php 
>>>>>>> <http://www.open-mpi.org/community/lists/devel/2016/06/19074.php>
>>>>>> 
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>>>>> <https://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>>>> Link to this post: 
>>>>>> http://www.open-mpi.org/community/lists/devel/2016/06/19075.php 
>>>>>> <http://www.open-mpi.org/community/lists/devel/2016/06/19075.php>
>>>>> 
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>>>> <https://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>>> Link to this post: 
>>>>> http://www.open-mpi.org/community/lists/devel/2016/06/19076.php 
>>>>> <http://www.open-mpi.org/community/lists/devel/2016/06/19076.php>
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>>> <https://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/devel/2016/06/19077.php 
>>>> <http://www.open-mpi.org/community/lists/devel/2016/06/19077.php>
>>> 
>>> 
>>> -- 
>>> Jeff Squyres
>>> jsquy...@cisco.com <mailto:jsquy...@cisco.com>
>>> For corporate legal information go to: 
>>> http://www.cisco.com/web/about/doing_business/legal/cri/ 
>>> <http://www.cisco.com/web/about/doing_business/legal/cri/>
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>> <https://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2016/06/19078.php 
>>> <http://www.open-mpi.org/community/lists/devel/2016/06/19078.php>
>> 
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org <mailto:de...@open-mpi.org>
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel 
> <https://www.open-mpi.org/mailman/listinfo.cgi/devel>
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2016/06/19080.php 
> <http://www.open-mpi.org/community/lists/devel/2016/06/19080.php>
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2016/06/19085.php 
> <http://www.open-mpi.org/community/lists/devel/2016/06/19085.php>

Reply via email to