I haven’t been able to replicate this when using the branch in this PR:

https://github.com/open-mpi/ompi/pull/1073 
<https://github.com/open-mpi/ompi/pull/1073>

Would you mind giving it a try? It fixes some other race conditions and might 
pick this one up too.


> On Oct 27, 2015, at 10:04 AM, Ralph Castain <r...@open-mpi.org> wrote:
> 
> Okay, I’ll take a look - I’ve been chasing a race condition that might be 
> related
> 
>> On Oct 27, 2015, at 9:54 AM, George Bosilca <bosi...@icl.utk.edu 
>> <mailto:bosi...@icl.utk.edu>> wrote:
>> 
>> No, it's using 2 nodes.
>>   George.
>> 
>> 
>> On Tue, Oct 27, 2015 at 12:35 PM, Ralph Castain <r...@open-mpi.org 
>> <mailto:r...@open-mpi.org>> wrote:
>> Is this on a single node?
>> 
>>> On Oct 27, 2015, at 9:25 AM, George Bosilca <bosi...@icl.utk.edu 
>>> <mailto:bosi...@icl.utk.edu>> wrote:
>>> 
>>> I get intermittent deadlocks wit the latest trunk. The smallest reproducer 
>>> is a shell for loop around a small (2 processes) short (20 seconds) MPI 
>>> application. After few tens of iterations the MPI_Init will deadlock with 
>>> the following backtrace:
>>> 
>>> #0  0x00007fa94b5d9aed in nanosleep () from /lib64/libc.so.6
>>> #1  0x00007fa94b60ec94 in usleep () from /lib64/libc.so.6
>>> #2  0x00007fa94960ba08 in OPAL_PMIX_PMIX1XX_PMIx_Fence (procs=0x0, 
>>> nprocs=0, info=0x7ffd7934fb90, 
>>>     ninfo=1) at src/client/pmix_client_fence.c:100
>>> #3  0x00007fa9498376a2 in pmix1_fence (procs=0x0, collect_data=1) at 
>>> pmix1_client.c:305
>>> #4  0x00007fa94bb39ba4 in ompi_mpi_init (argc=3, argv=0x7ffd793500a8, 
>>> requested=3, 
>>>     provided=0x7ffd7934ff94) at runtime/ompi_mpi_init.c:645
>>> #5  0x00007fa94bb77281 in PMPI_Init_thread (argc=0x7ffd7934ff8c, 
>>> argv=0x7ffd7934ff80, required=3, 
>>>     provided=0x7ffd7934ff94) at pinit_thread.c:69
>>> #6  0x000000000040150f in main (argc=3, argv=0x7ffd793500a8) at 
>>> osu_mbw_mr.c:86
>>> 
>>> On my machines this is reproducible at 100% after anywhere between 50 and 
>>> 100 iterations.
>>> 
>>>   Thanks,
>>>     George.
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2015/10/18280.php 
>>> <http://www.open-mpi.org/community/lists/devel/2015/10/18280.php>
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2015/10/18281.php 
>> <http://www.open-mpi.org/community/lists/devel/2015/10/18281.php>
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2015/10/18282.php
> 

Reply via email to