On Fri, Jun 3, 2016 at 11:10 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com > wrote:
> That's disappointing / puzzling. > > Threads 4 and 5 look like they're in the PMIX / ORTE progress threads, > respectively. > > But I don'tt see any obvious signs of what thread 1, 2, 3 are for. Huh. > > When is this hang happening -- during init? Middle of the program? > During finalize? > After finalize. As I said in my original email I se all the output the application is generating, and all processes (which are local as this happens on my laptop) are in zombie mode (Z+). This basically means whoever was supposed to get the SIGCHLD, didn't do it's job of cleaning them up. George. > > > > On Jun 2, 2016, at 6:00 PM, George Bosilca <bosi...@icl.utk.edu> wrote: > > > > Sure, but they mostly look similar. > > > > George. > > > > > > (lldb) thread list > > Process 76811 stopped > > thread #1: tid = 0x272b40e, 0x00007fff93306de6 > libsystem_kernel.dylib`__psynch_mutexwait + 10, queue = > 'com.apple.main-thread', stop reason = signal SIGSTOP > > thread #2: tid = 0x272b40f, 0x00007fff93306de6 > libsystem_kernel.dylib`__psynch_mutexwait + 10 > > thread #3: tid = 0x272b410, 0x00007fff93306de6 > libsystem_kernel.dylib`__psynch_mutexwait + 10 > > thread #4: tid = 0x272b411, 0x00007fff9330707a > libsystem_kernel.dylib`__select + 10 > > * thread #5: tid = 0x272b412, 0x00007fff9330707a > libsystem_kernel.dylib`__select + 10 > > (lldb) > > > > > > (lldb) thread select 1 > > * thread #1: tid = 0x272b40e, 0x00007fff93306de6 > libsystem_kernel.dylib`__psynch_mutexwait + 10, queue = > 'com.apple.main-thread', stop reason = signal SIGSTOP > > frame #0: 0x00007fff93306de6 > libsystem_kernel.dylib`__psynch_mutexwait + 10 > > libsystem_kernel.dylib`__psynch_mutexwait: > > -> 0x7fff93306de6 <+10>: jae 0x7fff93306df0 ; <+20> > > 0x7fff93306de8 <+12>: movq %rax, %rdi > > 0x7fff93306deb <+15>: jmp 0x7fff933017cd ; > cerror_nocancel > > 0x7fff93306df0 <+20>: retq > > (lldb) bt > > * thread #1: tid = 0x272b40e, 0x00007fff93306de6 > libsystem_kernel.dylib`__psynch_mutexwait + 10, queue = > 'com.apple.main-thread', stop reason = signal SIGSTOP > > * frame #0: 0x00007fff93306de6 > libsystem_kernel.dylib`__psynch_mutexwait + 10 > > frame #1: 0x00007fff9a000e4a > libsystem_pthread.dylib`_pthread_mutex_lock_wait + 89 > > frame #2: 0x00007fff99ffe5f5 > libsystem_pthread.dylib`_pthread_mutex_lock_slow + 300 > > frame #3: 0x00007fff8c2a00f8 libdyld.dylib`dyldGlobalLockAcquire() + > 16 > > frame #4: 0x00007fff6ca8e177 > dyld`ImageLoaderMachOCompressed::doBindFastLazySymbol(unsigned int, > ImageLoader::LinkContext const&, void (*)(), void (*)()) + 55 > > frame #5: 0x00007fff6ca78063 > dyld`dyld::fastBindLazySymbol(ImageLoader**, unsigned long) + 90 > > frame #6: 0x00007fff8c2a0262 libdyld.dylib`dyld_stub_binder + 282 > > frame #7: 0x000000010a5b29b0 libopen-pal.0.dylib`obj_order_type + > 3776 > > > > > > (lldb) thread select 2 > > * thread #2: tid = 0x272b40f, 0x00007fff93306de6 > libsystem_kernel.dylib`__psynch_mutexwait + 10 > > frame #0: 0x00007fff93306de6 > libsystem_kernel.dylib`__psynch_mutexwait + 10 > > libsystem_kernel.dylib`__psynch_mutexwait: > > -> 0x7fff93306de6 <+10>: jae 0x7fff93306df0 ; <+20> > > 0x7fff93306de8 <+12>: movq %rax, %rdi > > 0x7fff93306deb <+15>: jmp 0x7fff933017cd ; > cerror_nocancel > > 0x7fff93306df0 <+20>: retq > > (lldb) bt > > * thread #2: tid = 0x272b40f, 0x00007fff93306de6 > libsystem_kernel.dylib`__psynch_mutexwait + 10 > > * frame #0: 0x00007fff93306de6 > libsystem_kernel.dylib`__psynch_mutexwait + 10 > > frame #1: 0x00007fff9a000e4a > libsystem_pthread.dylib`_pthread_mutex_lock_wait + 89 > > frame #2: 0x00007fff99ffe5f5 > libsystem_pthread.dylib`_pthread_mutex_lock_slow + 300 > > frame #3: 0x00007fff8c2a00f8 libdyld.dylib`dyldGlobalLockAcquire() + > 16 > > frame #4: 0x00007fff6ca8e177 > dyld`ImageLoaderMachOCompressed::doBindFastLazySymbol(unsigned int, > ImageLoader::LinkContext const&, void (*)(), void (*)()) + 55 > > frame #5: 0x00007fff6ca78063 > dyld`dyld::fastBindLazySymbol(ImageLoader**, unsigned long) + 90 > > frame #6: 0x00007fff8c2a0262 libdyld.dylib`dyld_stub_binder + 282 > > frame #7: 0x000000010a5b29b0 libopen-pal.0.dylib`obj_order_type + > 3776 > > > > > > (lldb) thread select 3 > > * thread #3: tid = 0x272b410, 0x00007fff93306de6 > libsystem_kernel.dylib`__psynch_mutexwait + 10 > > frame #0: 0x00007fff93306de6 > libsystem_kernel.dylib`__psynch_mutexwait + 10 > > libsystem_kernel.dylib`__psynch_mutexwait: > > -> 0x7fff93306de6 <+10>: jae 0x7fff93306df0 ; <+20> > > 0x7fff93306de8 <+12>: movq %rax, %rdi > > 0x7fff93306deb <+15>: jmp 0x7fff933017cd ; > cerror_nocancel > > 0x7fff93306df0 <+20>: retq > > (lldb) bt > > * thread #3: tid = 0x272b410, 0x00007fff93306de6 > libsystem_kernel.dylib`__psynch_mutexwait + 10 > > * frame #0: 0x00007fff93306de6 > libsystem_kernel.dylib`__psynch_mutexwait + 10 > > frame #1: 0x00007fff9a000e4a > libsystem_pthread.dylib`_pthread_mutex_lock_wait + 89 > > frame #2: 0x00007fff99ffe5f5 > libsystem_pthread.dylib`_pthread_mutex_lock_slow + 300 > > frame #3: 0x00007fff8c2a00f8 libdyld.dylib`dyldGlobalLockAcquire() + > 16 > > frame #4: 0x00007fff6ca8e177 > dyld`ImageLoaderMachOCompressed::doBindFastLazySymbol(unsigned int, > ImageLoader::LinkContext const&, void (*)(), void (*)()) + 55 > > frame #5: 0x00007fff6ca78063 > dyld`dyld::fastBindLazySymbol(ImageLoader**, unsigned long) + 90 > > frame #6: 0x00007fff8c2a0262 libdyld.dylib`dyld_stub_binder + 282 > > frame #7: 0x000000010a5b29b0 libopen-pal.0.dylib`obj_order_type + > 3776 > > > > > > (lldb) thread select 4 > > * thread #4: tid = 0x272b411, 0x00007fff9330707a > libsystem_kernel.dylib`__select + 10 > > frame #0: 0x00007fff9330707a libsystem_kernel.dylib`__select + 10 > > libsystem_kernel.dylib`__select: > > -> 0x7fff9330707a <+10>: jae 0x7fff93307084 ; <+20> > > 0x7fff9330707c <+12>: movq %rax, %rdi > > 0x7fff9330707f <+15>: jmp 0x7fff933017f2 ; cerror > > 0x7fff93307084 <+20>: retq > > (lldb) bt > > * thread #4: tid = 0x272b411, 0x00007fff9330707a > libsystem_kernel.dylib`__select + 10 > > * frame #0: 0x00007fff9330707a libsystem_kernel.dylib`__select + 10 > > frame #1: 0x000000010a9b1273 > mca_pmix_pmix114.so`listen_thread(obj=0x0000000000000000) + 371 at > pmix_server_listener.c:226 > > frame #2: 0x00007fff9a00099d libsystem_pthread.dylib`_pthread_body + > 131 > > frame #3: 0x00007fff9a00091a libsystem_pthread.dylib`_pthread_start > + 168 > > frame #4: 0x00007fff99ffe351 libsystem_pthread.dylib`thread_start + > 13 > > > > > > (lldb) thread select 5 > > * thread #5: tid = 0x272b412, 0x00007fff9330707a > libsystem_kernel.dylib`__select + 10 > > frame #0: 0x00007fff9330707a libsystem_kernel.dylib`__select + 10 > > libsystem_kernel.dylib`__select: > > -> 0x7fff9330707a <+10>: jae 0x7fff93307084 ; <+20> > > 0x7fff9330707c <+12>: movq %rax, %rdi > > 0x7fff9330707f <+15>: jmp 0x7fff933017f2 ; cerror > > 0x7fff93307084 <+20>: retq > > (lldb) bt > > * thread #5: tid = 0x272b412, 0x00007fff9330707a > libsystem_kernel.dylib`__select + 10 > > * frame #0: 0x00007fff9330707a libsystem_kernel.dylib`__select + 10 > > frame #1: 0x000000010a3c13cc > libopen-rte.0.dylib`listen_thread_fn(obj=0x000000010a46e8c0) + 428 at > listener.c:261 > > frame #2: 0x00007fff9a00099d libsystem_pthread.dylib`_pthread_body + > 131 > > frame #3: 0x00007fff9a00091a libsystem_pthread.dylib`_pthread_start > + 168 > > frame #4: 0x00007fff99ffe351 libsystem_pthread.dylib`thread_start + > 13 > > > > > > > > > > On Fri, Jun 3, 2016 at 9:50 AM, Jeff Squyres (jsquyres) < > jsquy...@cisco.com> wrote: > > George -- > > > > You might want to get bt's from *all* the threads...? > > > > > > > On Jun 2, 2016, at 5:31 PM, George Bosilca <bosi...@icl.utk.edu> > wrote: > > > > > > The timeout never triggers and when I attach to the mpirun process I > see an extremely strange stack: > > > > > > (lldb) bt > > > * thread #1: tid = 0x272b40e, 0x00007fff93306de6 > libsystem_kernel.dylib`__psynch_mutexwait + 10, queue = > 'com.apple.main-thread', stop reason = signal SIGSTOP > > > * frame #0: 0x00007fff93306de6 > libsystem_kernel.dylib`__psynch_mutexwait + 10 > > > frame #1: 0x00007fff9a000e4a > libsystem_pthread.dylib`_pthread_mutex_lock_wait + 89 > > > frame #2: 0x00007fff99ffe5f5 > libsystem_pthread.dylib`_pthread_mutex_lock_slow + 300 > > > frame #3: 0x00007fff8c2a00f8 libdyld.dylib`dyldGlobalLockAcquire() > + 16 > > > frame #4: 0x00007fff6ca8e177 > dyld`ImageLoaderMachOCompressed::doBindFastLazySymbol(unsigned int, > ImageLoader::LinkContext const&, void (*)(), void (*)()) + 55 > > > frame #5: 0x00007fff6ca78063 > dyld`dyld::fastBindLazySymbol(ImageLoader**, unsigned long) + 90 > > > frame #6: 0x00007fff8c2a0262 libdyld.dylib`dyld_stub_binder + 282 > > > frame #7: 0x000000010a5b29b0 libopen-pal.0.dylib`obj_order_type + > 3776 > > > > > > This seems to indicate that we are trying to access a function from a > dylib that has been or is in the process of being unloaded. > > > > > > George. > > > > > > > > > On Thu, Jun 2, 2016 at 8:34 AM, Nathan Hjelm <hje...@me.com> wrote: > > > The osc hang is fixed by a PR to fix bugs in start in cm and ob1. See > #1729. > > > > > > -Nathan > > > > > > On Jun 2, 2016, at 5:17 AM, Gilles Gouaillardet < > gilles.gouaillar...@gmail.com> wrote: > > > > > >> fwiw, > > >> > > >> the onsided/c_fence_lock test from the ibm test suite hangs > > >> > > >> (mpirun -np 2 ./c_fence_lock) > > >> > > >> i ran a git bisect and it incriminates commit > b90c83840f472de3219b87cd7e1a364eec5c5a29 > > >> > > >> commit b90c83840f472de3219b87cd7e1a364eec5c5a29 > > >> Author: bosilca <bosi...@users.noreply.github.com> > > >> Date: Tue May 24 18:20:51 2016 -0500 > > >> > > >> Refactor the request completion (#1422) > > >> > > >> * Remodel the request. > > >> Added the wait sync primitive and integrate it into the PML and > MTL > > >> infrastructure. The multi-threaded requests are now significantly > > >> less heavy and less noisy (only the threads associated with > completed > > >> requests are signaled). > > >> > > >> * Fix the condition to release the request. > > >> > > >> > > >> > > >> > > >> I also noted a warning is emitted when running only one task > > >> > > >> ./c_fence_lock > > >> > > >> but I did not git bisect, so that might not be related > > >> > > >> Cheers, > > >> > > >> > > >> > > >> Gilles > > >> > > >> > > >> On Thursday, June 2, 2016, Ralph Castain <r...@open-mpi.org> wrote: > > >> Yes, please! I’d like to know what mpirun thinks is happening - if > you like, just set the —timeout N —report-state-on-timeout flags and tell > me what comes out > > >> > > >>> On Jun 1, 2016, at 7:57 PM, George Bosilca <bosi...@icl.utk.edu> > wrote: > > >>> > > >>> I don't think it matters. I was running the IBM collective and pt2pt > tests, but each time it deadlocked was in a different test. If you are > interested in some particular values, I would be happy to attach a debugger > next time it happens. > > >>> > > >>> George. > > >>> > > >>> > > >>> On Wed, Jun 1, 2016 at 10:47 PM, Ralph Castain <r...@open-mpi.org> > wrote: > > >>> What kind of apps are they? Or does it matter what you are running? > > >>> > > >>> > > >>> > On Jun 1, 2016, at 7:37 PM, George Bosilca <bosi...@icl.utk.edu> > wrote: > > >>> > > > >>> > I have a seldomly occurring deadlock on a OS X laptop if I use > more than 2 processes). It is coming up once every 200 runs or so. > > >>> > > > >>> > Here is what I could gather from my experiments: All the MPI > processes seem to have correctly completed (I get all the expected output > and the MPI processes are in a waiting state), but somehow the mpirun does > not detect their completion. As a result, mpirun never returns. > > >>> > > > >>> > George. > > >>> > > > >>> > _______________________________________________ > > >>> > devel mailing list > > >>> > de...@open-mpi.org > > >>> > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel > > >>> > Searchable archives: > http://www.open-mpi.org/community/lists/devel/2016/06/19054.php > > >>> > > >>> _______________________________________________ > > >>> devel mailing list > > >>> de...@open-mpi.org > > >>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel > > >>> Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/06/19054.php > > >>> > > >>> _______________________________________________ > > >>> devel mailing list > > >>> de...@open-mpi.org > > >>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel > > >>> Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/06/19055.php > > >> > > >> _______________________________________________ > > >> devel mailing list > > >> de...@open-mpi.org > > >> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel > > >> Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/06/19059.php > > > > > > _______________________________________________ > > > devel mailing list > > > de...@open-mpi.org > > > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel > > > Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/06/19060.php > > > > > > _______________________________________________ > > > devel mailing list > > > de...@open-mpi.org > > > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel > > > Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/06/19061.php > > > > > > -- > > Jeff Squyres > > jsquy...@cisco.com > > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/06/19062.php > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/06/19063.php > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/06/19066.php