George --

You might want to get bt's from *all* the threads...?


> On Jun 2, 2016, at 5:31 PM, George Bosilca <bosi...@icl.utk.edu> wrote:
> 
> The timeout never triggers and when I attach to the mpirun process I see an 
> extremely strange stack:
> 
> (lldb) bt
> * thread #1: tid = 0x272b40e, 0x00007fff93306de6 
> libsystem_kernel.dylib`__psynch_mutexwait + 10, queue = 
> 'com.apple.main-thread', stop reason = signal SIGSTOP
>   * frame #0: 0x00007fff93306de6 libsystem_kernel.dylib`__psynch_mutexwait + 
> 10
>     frame #1: 0x00007fff9a000e4a 
> libsystem_pthread.dylib`_pthread_mutex_lock_wait + 89
>     frame #2: 0x00007fff99ffe5f5 
> libsystem_pthread.dylib`_pthread_mutex_lock_slow + 300
>     frame #3: 0x00007fff8c2a00f8 libdyld.dylib`dyldGlobalLockAcquire() + 16
>     frame #4: 0x00007fff6ca8e177 
> dyld`ImageLoaderMachOCompressed::doBindFastLazySymbol(unsigned int, 
> ImageLoader::LinkContext const&, void (*)(), void (*)()) + 55
>     frame #5: 0x00007fff6ca78063 dyld`dyld::fastBindLazySymbol(ImageLoader**, 
> unsigned long) + 90
>     frame #6: 0x00007fff8c2a0262 libdyld.dylib`dyld_stub_binder + 282
>     frame #7: 0x000000010a5b29b0 libopen-pal.0.dylib`obj_order_type + 3776
> 
> This seems to indicate that we are trying to access a function from a dylib 
> that has been or is in the process of being unloaded.
> 
>   George.
> 
> 
> On Thu, Jun 2, 2016 at 8:34 AM, Nathan Hjelm <hje...@me.com> wrote:
> The osc hang is fixed by a PR to fix bugs in start in cm and ob1. See #1729.
> 
> -Nathan
> 
> On Jun 2, 2016, at 5:17 AM, Gilles Gouaillardet 
> <gilles.gouaillar...@gmail.com> wrote:
> 
>> fwiw,
>> 
>> the onsided/c_fence_lock test from the ibm test suite hangs
>> 
>> (mpirun -np 2 ./c_fence_lock)
>> 
>> i ran a git bisect and it incriminates commit 
>> b90c83840f472de3219b87cd7e1a364eec5c5a29
>> 
>> commit b90c83840f472de3219b87cd7e1a364eec5c5a29
>> Author: bosilca <bosi...@users.noreply.github.com>
>> Date:   Tue May 24 18:20:51 2016 -0500
>> 
>>     Refactor the request completion (#1422)
>>     
>>     * Remodel the request.
>>     Added the wait sync primitive and integrate it into the PML and MTL
>>     infrastructure. The multi-threaded requests are now significantly
>>     less heavy and less noisy (only the threads associated with completed
>>     requests are signaled).
>>     
>>     * Fix the condition to release the request.
>> 
>> 
>> 
>> 
>> I also noted a warning is emitted when running only one task
>> 
>> ./c_fence_lock
>> 
>> but I did not git bisect, so that might not be related
>> 
>> Cheers,
>> 
>> 
>> 
>> Gilles
>> 
>> 
>> On Thursday, June 2, 2016, Ralph Castain <r...@open-mpi.org> wrote:
>> Yes, please! I’d like to know what mpirun thinks is happening - if you like, 
>> just set the —timeout N —report-state-on-timeout flags and tell me what 
>> comes out
>> 
>>> On Jun 1, 2016, at 7:57 PM, George Bosilca <bosi...@icl.utk.edu> wrote:
>>> 
>>> I don't think it matters. I was running the IBM collective and pt2pt tests, 
>>> but each time it deadlocked was in a different test. If you are interested 
>>> in some particular values, I would be happy to attach a debugger next time 
>>> it happens.
>>> 
>>>   George.
>>> 
>>> 
>>> On Wed, Jun 1, 2016 at 10:47 PM, Ralph Castain <r...@open-mpi.org> wrote:
>>> What kind of apps are they? Or does it matter what you are running?
>>> 
>>> 
>>> > On Jun 1, 2016, at 7:37 PM, George Bosilca <bosi...@icl.utk.edu> wrote:
>>> >
>>> > I have a seldomly occurring deadlock on a OS X laptop if I use more than 
>>> > 2 processes). It is coming up once every 200 runs or so.
>>> >
>>> > Here is what I could gather from my experiments: All the MPI processes 
>>> > seem to have correctly completed (I get all the expected output and the 
>>> > MPI processes are in a waiting state), but somehow the mpirun does not 
>>> > detect their completion. As a result, mpirun never returns.
>>> >
>>> >   George.
>>> >
>>> > _______________________________________________
>>> > devel mailing list
>>> > de...@open-mpi.org
>>> > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> > Searchable archives: 
>>> > http://www.open-mpi.org/community/lists/devel/2016/06/19054.php
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2016/06/19054.php
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2016/06/19055.php
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2016/06/19059.php
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2016/06/19060.php
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2016/06/19061.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to