Ralph,

Things got better, in the sense that before I was getting about 1 deadlock
for each 300 runs, now the number if more 1 out of every 500.

  George.


On Tue, Jun 7, 2016 at 12:04 AM, George Bosilca <bosi...@icl.utk.edu> wrote:

> Ralph,
>
> Not there yet. I got similar deadlocks, but the stack not looks slightly
> different. I only have 1 single thread doing something useful (aka being in
> listen_thread_fn), every other thread is having a similar stack:
>
>   * frame #0: 0x00007fff93306de6 libsystem_kernel.dylib`__psynch_mutexwait
> + 10
>     frame #1: 0x00007fff9a000e4a
> libsystem_pthread.dylib`_pthread_mutex_lock_wait + 89
>     frame #2: 0x00007fff99ffe5f5
> libsystem_pthread.dylib`_pthread_mutex_lock_slow + 300
>     frame #3: 0x00007fff8c2a00f8 libdyld.dylib`dyldGlobalLockAcquire() + 16
>     frame #4: 0x00007fff6bbc3177
> dyld`ImageLoaderMachOCompressed::doBindFastLazySymbol(unsigned int,
> ImageLoader::LinkContext const&, void (*)(), void (*)()) + 55
>     frame #5: 0x00007fff6bbad063
> dyld`dyld::fastBindLazySymbol(ImageLoader**, unsigned long) + 90
>     frame #6: 0x00007fff8c2a0262 libdyld.dylib`dyld_stub_binder + 282
>     frame #7: 0x00000001019d39b0 libopen-pal.0.dylib`obj_order_type + 3776
>
>   George.
>
>
> On Mon, Jun 6, 2016 at 1:41 PM, Ralph Castain <r...@open-mpi.org> wrote:
>
>> I think I have this fixed here:
>> https://github.com/open-mpi/ompi/pull/1756
>>
>> George - can you please try it on your system?
>>
>>
>> On Jun 5, 2016, at 4:18 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>
>> Yeah, I can reproduce on my box. What is happening is that we aren’t
>> properly protected during finalize, and so we tear down some component that
>> is registered for a callback, and then the callback occurs. So we just need
>> to ensure that we finalize in the right order
>>
>> On Jun 5, 2016, at 3:51 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com>
>> wrote:
>>
>> Ok, good.
>>
>> FWIW: I tried all Friday afternoon to reproduce on my OS X laptop (i.e.,
>> I did something like George's shell script loop), and just now I ran
>> George's exact loop, but I haven't been able to reproduce.  In this case,
>> I'm falling on the wrong side of whatever race condition is happening...
>>
>>
>> On Jun 4, 2016, at 7:57 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>
>> I may have an idea of what’s going on here - I just need to finish
>> something else first and then I’ll take a look.
>>
>>
>> On Jun 4, 2016, at 4:20 PM, George Bosilca <bosi...@icl.utk.edu> wrote:
>>
>>
>> On Jun 5, 2016, at 07:53 , Ralph Castain <r...@open-mpi.org> wrote:
>>
>>
>> On Jun 4, 2016, at 1:11 PM, George Bosilca <bosi...@icl.utk.edu> wrote:
>>
>>
>>
>> On Sat, Jun 4, 2016 at 11:05 PM, Ralph Castain <r...@open-mpi.org> wrote:
>> He can try adding "-mca state_base_verbose 5”, but if we are failing to
>> catch sigchld, I’m not sure what debugging info is going to help resolve
>> that problem. These aren’t even fast-running apps, so there was plenty of
>> time to register for the signal prior to termination.
>>
>> I vaguely recollect that we have occasionally seen this on Mac before and
>> it had something to do with oddness in sigchld handling…
>>
>> Assuming sigchld has some oddness on OSX. Why is then mpirun deadlocking
>> instead of quitting which will then allow the OS to clean all children?
>>
>>
>> I don’t think mpirun is actually “deadlocked” - I think it may just be
>> waiting for sigchld to tell it that the local processes have terminated.
>>
>> However, that wouldn't explain why you see what looks like libraries
>> being unloaded. That implies mpirun is actually finalizing, but failing to
>> fully exit - which would indeed be more of a deadlock.
>>
>> So the question is: are you truly seeing us missing sigchld (as was
>> suggested earlier in this thread),
>>
>>
>> In theory the processes remains in zombie state until the parent calls
>> waitpid on them, at which moment they are supposed to disappear. Based on
>> this, as the processes are still in zombie state, I assumed that mpirun was
>> not calling waitpid. One could also assume we are again hit by the fork
>> race condition we had a while back, but as all local processes are in
>> zombie mode, this is hardly believable.
>>
>> or did mpirun correctly see all the child processes terminate and is
>> actually hanging while trying to exit (as was also suggested earlier)?
>>
>>
>> One way or another the stack of the main thread looks busted. While the
>> discussion about this was going on I was able to replicate the bug with
>> only ORTE involved. Simply running
>>
>> for i in `seq 1 1 1000`; do echo “$i"; mpirun -np 4 hostname; done
>>
>> ‘deadlock’ or whatever name we want to call this reliably before hitting
>> the 300 iteration. Unfortunately adding the verbose option alter the
>> behavior enough that the issue does not reproduce.
>>
>> George.
>>
>>
>> Adding the state verbosity should tell us which of those two is true,
>> assuming it doesn’t affect the timing so much that everything works :-/
>>
>>
>>
>> George.
>>
>>
>>
>>
>> On Jun 4, 2016, at 7:01 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com>
>> wrote:
>>
>> Meh.  Ok.  Should George run with some verbose level to get more info?
>>
>> On Jun 4, 2016, at 6:43 AM, Ralph Castain <r...@open-mpi.org> wrote:
>>
>> Neither of those threads have anything to do with catching the sigchld -
>> threads 4-5 are listening for OOB and PMIx connection requests. It looks
>> more like mpirun thought it had picked everything up and has begun shutting
>> down, but I can’t really tell for certain.
>>
>> On Jun 4, 2016, at 6:29 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com>
>> wrote:
>>
>> On Jun 3, 2016, at 11:07 PM, George Bosilca <bosi...@icl.utk.edu> wrote:
>>
>>
>> After finalize. As I said in my original email I se all the output the
>> application is generating, and all processes (which are local as this
>> happens on my laptop) are in zombie mode (Z+). This basically means whoever
>> was supposed to get the SIGCHLD, didn't do it's job of cleaning them up.
>>
>>
>> Ah -- so perhaps threads 1,2,3 are red herrings: the real problem here is
>> that the parent didn't catch the child exits (which presumably should have
>> been caught in threads 4 or 5).
>>
>> Ralph: is there any state from threads 4 or 5 that would be helpful to
>> examine to see if they somehow missed catching children exits?
>>
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2016/06/19070.php
>>
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2016/06/19071.php
>>
>>
>>
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2016/06/19072.php
>>
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2016/06/19073.php
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2016/06/19074.php
>>
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2016/06/19075.php
>>
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2016/06/19076.php
>>
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2016/06/19077.php
>>
>>
>>
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2016/06/19078.php
>>
>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2016/06/19080.php
>>
>
>

Reply via email to