Huh - okay, must be a difference in our race conditions. I can run it for more than 1k cycles without hitting it. I’ll poke some more later
> On Jun 6, 2016, at 8:06 AM, George Bosilca <bosi...@icl.utk.edu> wrote: > > Ralph, > > Things got better, in the sense that before I was getting about 1 deadlock > for each 300 runs, now the number if more 1 out of every 500. > > George. > > > On Tue, Jun 7, 2016 at 12:04 AM, George Bosilca <bosi...@icl.utk.edu > <mailto:bosi...@icl.utk.edu>> wrote: > Ralph, > > Not there yet. I got similar deadlocks, but the stack not looks slightly > different. I only have 1 single thread doing something useful (aka being in > listen_thread_fn), every other thread is having a similar stack: > > * frame #0: 0x00007fff93306de6 libsystem_kernel.dylib`__psynch_mutexwait + > 10 > frame #1: 0x00007fff9a000e4a > libsystem_pthread.dylib`_pthread_mutex_lock_wait + 89 > frame #2: 0x00007fff99ffe5f5 > libsystem_pthread.dylib`_pthread_mutex_lock_slow + 300 > frame #3: 0x00007fff8c2a00f8 libdyld.dylib`dyldGlobalLockAcquire() + 16 > frame #4: 0x00007fff6bbc3177 > dyld`ImageLoaderMachOCompressed::doBindFastLazySymbol(unsigned int, > ImageLoader::LinkContext const&, void (*)(), void (*)()) + 55 > frame #5: 0x00007fff6bbad063 dyld`dyld::fastBindLazySymbol(ImageLoader**, > unsigned long) + 90 > frame #6: 0x00007fff8c2a0262 libdyld.dylib`dyld_stub_binder + 282 > frame #7: 0x00000001019d39b0 libopen-pal.0.dylib`obj_order_type + 3776 > > George. > > > On Mon, Jun 6, 2016 at 1:41 PM, Ralph Castain <r...@open-mpi.org > <mailto:r...@open-mpi.org>> wrote: > I think I have this fixed here: https://github.com/open-mpi/ompi/pull/1756 > <https://github.com/open-mpi/ompi/pull/1756> > > George - can you please try it on your system? > > >> On Jun 5, 2016, at 4:18 PM, Ralph Castain <r...@open-mpi.org >> <mailto:r...@open-mpi.org>> wrote: >> >> Yeah, I can reproduce on my box. What is happening is that we aren’t >> properly protected during finalize, and so we tear down some component that >> is registered for a callback, and then the callback occurs. So we just need >> to ensure that we finalize in the right order >> >>> On Jun 5, 2016, at 3:51 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com >>> <mailto:jsquy...@cisco.com>> wrote: >>> >>> Ok, good. >>> >>> FWIW: I tried all Friday afternoon to reproduce on my OS X laptop (i.e., I >>> did something like George's shell script loop), and just now I ran George's >>> exact loop, but I haven't been able to reproduce. In this case, I'm >>> falling on the wrong side of whatever race condition is happening... >>> >>> >>>> On Jun 4, 2016, at 7:57 PM, Ralph Castain <r...@open-mpi.org >>>> <mailto:r...@open-mpi.org>> wrote: >>>> >>>> I may have an idea of what’s going on here - I just need to finish >>>> something else first and then I’ll take a look. >>>> >>>> >>>>> On Jun 4, 2016, at 4:20 PM, George Bosilca <bosi...@icl.utk.edu >>>>> <mailto:bosi...@icl.utk.edu>> wrote: >>>>> >>>>>> >>>>>> On Jun 5, 2016, at 07:53 , Ralph Castain <r...@open-mpi.org >>>>>> <mailto:r...@open-mpi.org>> wrote: >>>>>> >>>>>>> >>>>>>> On Jun 4, 2016, at 1:11 PM, George Bosilca <bosi...@icl.utk.edu >>>>>>> <mailto:bosi...@icl.utk.edu>> wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Sat, Jun 4, 2016 at 11:05 PM, Ralph Castain <r...@open-mpi.org >>>>>>> <mailto:r...@open-mpi.org>> wrote: >>>>>>> He can try adding "-mca state_base_verbose 5”, but if we are failing to >>>>>>> catch sigchld, I’m not sure what debugging info is going to help >>>>>>> resolve that problem. These aren’t even fast-running apps, so there was >>>>>>> plenty of time to register for the signal prior to termination. >>>>>>> >>>>>>> I vaguely recollect that we have occasionally seen this on Mac before >>>>>>> and it had something to do with oddness in sigchld handling… >>>>>>> >>>>>>> Assuming sigchld has some oddness on OSX. Why is then mpirun >>>>>>> deadlocking instead of quitting which will then allow the OS to clean >>>>>>> all children? >>>>>> >>>>>> I don’t think mpirun is actually “deadlocked” - I think it may just be >>>>>> waiting for sigchld to tell it that the local processes have terminated. >>>>>> >>>>>> However, that wouldn't explain why you see what looks like libraries >>>>>> being unloaded. That implies mpirun is actually finalizing, but failing >>>>>> to fully exit - which would indeed be more of a deadlock. >>>>>> >>>>>> So the question is: are you truly seeing us missing sigchld (as was >>>>>> suggested earlier in this thread), >>>>> >>>>> In theory the processes remains in zombie state until the parent calls >>>>> waitpid on them, at which moment they are supposed to disappear. Based on >>>>> this, as the processes are still in zombie state, I assumed that mpirun >>>>> was not calling waitpid. One could also assume we are again hit by the >>>>> fork race condition we had a while back, but as all local processes are >>>>> in zombie mode, this is hardly believable. >>>>> >>>>>> or did mpirun correctly see all the child processes terminate and is >>>>>> actually hanging while trying to exit (as was also suggested earlier)? >>>>> >>>>> One way or another the stack of the main thread looks busted. While the >>>>> discussion about this was going on I was able to replicate the bug with >>>>> only ORTE involved. Simply running >>>>> >>>>> for i in `seq 1 1 1000`; do echo “$i"; mpirun -np 4 hostname; done >>>>> >>>>> ‘deadlock’ or whatever name we want to call this reliably before hitting >>>>> the 300 iteration. Unfortunately adding the verbose option alter the >>>>> behavior enough that the issue does not reproduce. >>>>> >>>>> George. >>>>> >>>>>> >>>>>> Adding the state verbosity should tell us which of those two is true, >>>>>> assuming it doesn’t affect the timing so much that everything works :-/ >>>>>> >>>>>> >>>>>>> >>>>>>> George. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>> On Jun 4, 2016, at 7:01 AM, Jeff Squyres (jsquyres) >>>>>>>> <jsquy...@cisco.com <mailto:jsquy...@cisco.com>> wrote: >>>>>>>> >>>>>>>> Meh. Ok. Should George run with some verbose level to get more info? >>>>>>>> >>>>>>>>> On Jun 4, 2016, at 6:43 AM, Ralph Castain <r...@open-mpi.org >>>>>>>>> <mailto:r...@open-mpi.org>> wrote: >>>>>>>>> >>>>>>>>> Neither of those threads have anything to do with catching the >>>>>>>>> sigchld - threads 4-5 are listening for OOB and PMIx connection >>>>>>>>> requests. It looks more like mpirun thought it had picked everything >>>>>>>>> up and has begun shutting down, but I can’t really tell for certain. >>>>>>>>> >>>>>>>>>> On Jun 4, 2016, at 6:29 AM, Jeff Squyres (jsquyres) >>>>>>>>>> <jsquy...@cisco.com <mailto:jsquy...@cisco.com>> wrote: >>>>>>>>>> >>>>>>>>>> On Jun 3, 2016, at 11:07 PM, George Bosilca <bosi...@icl.utk.edu >>>>>>>>>> <mailto:bosi...@icl.utk.edu>> wrote: >>>>>>>>>>> >>>>>>>>>>> After finalize. As I said in my original email I se all the output >>>>>>>>>>> the application is generating, and all processes (which are local >>>>>>>>>>> as this happens on my laptop) are in zombie mode (Z+). This >>>>>>>>>>> basically means whoever was supposed to get the SIGCHLD, didn't do >>>>>>>>>>> it's job of cleaning them up. >>>>>>>>>> >>>>>>>>>> Ah -- so perhaps threads 1,2,3 are red herrings: the real problem >>>>>>>>>> here is that the parent didn't catch the child exits (which >>>>>>>>>> presumably should have been caught in threads 4 or 5). >>>>>>>>>> >>>>>>>>>> Ralph: is there any state from threads 4 or 5 that would be helpful >>>>>>>>>> to examine to see if they somehow missed catching children exits? >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Jeff Squyres >>>>>>>>>> jsquy...@cisco.com <mailto:jsquy...@cisco.com> >>>>>>>>>> For corporate legal information go to: >>>>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>>>>>> <http://www.cisco.com/web/about/doing_business/legal/cri/> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> devel mailing list >>>>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>>>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>> <https://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>>>> Link to this post: >>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2016/06/19070.php >>>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2016/06/19070.php> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> devel mailing list >>>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>> <https://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>>> Link to this post: >>>>>>>>> http://www.open-mpi.org/community/lists/devel/2016/06/19071.php >>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2016/06/19071.php> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Jeff Squyres >>>>>>>> jsquy...@cisco.com <mailto:jsquy...@cisco.com> >>>>>>>> For corporate legal information go to: >>>>>>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>>>>>> <http://www.cisco.com/web/about/doing_business/legal/cri/> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> devel mailing list >>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>> <https://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>> Link to this post: >>>>>>>> http://www.open-mpi.org/community/lists/devel/2016/06/19072.php >>>>>>>> <http://www.open-mpi.org/community/lists/devel/2016/06/19072.php> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>> <https://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>> Link to this post: >>>>>>> http://www.open-mpi.org/community/lists/devel/2016/06/19073.php >>>>>>> <http://www.open-mpi.org/community/lists/devel/2016/06/19073.php> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>> <https://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>> Link to this post: >>>>>>> http://www.open-mpi.org/community/lists/devel/2016/06/19074.php >>>>>>> <http://www.open-mpi.org/community/lists/devel/2016/06/19074.php> >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> <https://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/devel/2016/06/19075.php >>>>>> <http://www.open-mpi.org/community/lists/devel/2016/06/19075.php> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> <https://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/devel/2016/06/19076.php >>>>> <http://www.open-mpi.org/community/lists/devel/2016/06/19076.php> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> <https://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2016/06/19077.php >>>> <http://www.open-mpi.org/community/lists/devel/2016/06/19077.php> >>> >>> >>> -- >>> Jeff Squyres >>> jsquy...@cisco.com <mailto:jsquy...@cisco.com> >>> For corporate legal information go to: >>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>> <http://www.cisco.com/web/about/doing_business/legal/cri/> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel >>> <https://www.open-mpi.org/mailman/listinfo.cgi/devel> >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2016/06/19078.php >>> <http://www.open-mpi.org/community/lists/devel/2016/06/19078.php> >> > > > _______________________________________________ > devel mailing list > de...@open-mpi.org <mailto:de...@open-mpi.org> > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel > <https://www.open-mpi.org/mailman/listinfo.cgi/devel> > Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/06/19080.php > <http://www.open-mpi.org/community/lists/devel/2016/06/19080.php> > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2016/06/19085.php > <http://www.open-mpi.org/community/lists/devel/2016/06/19085.php>