It appear the branch solve the problem at least partially. I asked one of
my students to hammer it pretty badly, and he reported that the deadlocks
still occur. He also graciously provided some stacktraces:

#0  0x00007f4bd5274aed in nanosleep () from /lib64/libc.so.6
#1  0x00007f4bd52a9c94 in usleep () from /lib64/libc.so.6
#2  0x00007f4bd2e42b00 in OPAL_PMIX_PMIX1XX_PMIx_Fence (procs=0x0,
nprocs=0, info=0x7fff3c561960,
    ninfo=1) at src/client/pmix_client_fence.c:100
#3  0x00007f4bd306e6d2 in pmix1_fence (procs=0x0, collect_data=1) at
pmix1_client.c:306
#4  0x00007f4bd57d5cc3 in ompi_mpi_init (argc=3, argv=0x7fff3c561ea8,
requested=3,
    provided=0x7fff3c561d84) at runtime/ompi_mpi_init.c:644
#5  0x00007f4bd5813399 in PMPI_Init_thread (argc=0x7fff3c561d7c,
argv=0x7fff3c561d70, required=3,
    provided=0x7fff3c561d84) at pinit_thread.c:69
#6  0x0000000000401516 in main (argc=3, argv=0x7fff3c561ea8) at
osu_mbw_mr.c:86

And another process:

#0  0x00007f7b9d7d8bdc in recv () from /lib64/libpthread.so.0
#1  0x00007f7b9b0aa42d in opal_pmix_pmix1xx_pmix_usock_recv_blocking
(sd=13, data=0x7ffd62139004 "",
    size=4) at src/usock/usock.c:168
#2  0x00007f7b9b0af5d9 in recv_connect_ack (sd=13) at
src/client/pmix_client.c:844
#3  0x00007f7b9b0b085e in usock_connect (addr=0x7ffd62139330) at
src/client/pmix_client.c:1110
#4  0x00007f7b9b0acc24 in connect_to_server (address=0x7ffd62139330,
cbdata=0x7ffd621390e0)
    at src/client/pmix_client.c:181
#5  0x00007f7b9b0ad569 in OPAL_PMIX_PMIX1XX_PMIx_Init (proc=0x7f7b9b4e9b60)
    at src/client/pmix_client.c:362
#6  0x00007f7b9b2dbd9d in pmix1_client_init () at pmix1_client.c:99
#7  0x00007f7b9b4eb95f in pmi_component_query (module=0x7ffd62139490,
priority=0x7ffd6213948c)
    at ess_pmi_component.c:90
#8  0x00007f7b9ce70ec5 in mca_base_select (type_name=0x7f7b9d20e059 "ess",
output_id=-1,
    components_available=0x7f7b9d431eb0, best_module=0x7ffd621394d0,
best_component=0x7ffd621394d8,
    priority_out=0x0) at mca_base_components_select.c:77
#9  0x00007f7b9d1a956b in orte_ess_base_select () at
base/ess_base_select.c:40
#10 0x00007f7b9d160449 in orte_init (pargc=0x0, pargv=0x0, flags=32) at
runtime/orte_init.c:219
#11 0x00007f7b9da4377a in ompi_mpi_init (argc=3, argv=0x7ffd621397f8,
requested=3,
    provided=0x7ffd621396d4) at runtime/ompi_mpi_init.c:488
#12 0x00007f7b9da81399 in PMPI_Init_thread (argc=0x7ffd621396cc,
argv=0x7ffd621396c0, required=3,
    provided=0x7ffd621396d4) at pinit_thread.c:69
#13 0x0000000000401516 in main (argc=3, argv=0x7ffd621397f8) at
osu_mbw_mr.c:86

  George.



On Tue, Oct 27, 2015 at 2:36 PM, Ralph Castain <r...@open-mpi.org> wrote:

> I haven’t been able to replicate this when using the branch in this PR:
>
> https://github.com/open-mpi/ompi/pull/1073
>
> Would you mind giving it a try? It fixes some other race conditions and
> might pick this one up too.
>
>
> On Oct 27, 2015, at 10:04 AM, Ralph Castain <r...@open-mpi.org> wrote:
>
> Okay, I’ll take a look - I’ve been chasing a race condition that might be
> related
>
> On Oct 27, 2015, at 9:54 AM, George Bosilca <bosi...@icl.utk.edu> wrote:
>
> No, it's using 2 nodes.
>   George.
>
>
> On Tue, Oct 27, 2015 at 12:35 PM, Ralph Castain <r...@open-mpi.org> wrote:
>
>> Is this on a single node?
>>
>> On Oct 27, 2015, at 9:25 AM, George Bosilca <bosi...@icl.utk.edu> wrote:
>>
>> I get intermittent deadlocks wit the latest trunk. The smallest
>> reproducer is a shell for loop around a small (2 processes) short (20
>> seconds) MPI application. After few tens of iterations the MPI_Init will
>> deadlock with the following backtrace:
>>
>> #0  0x00007fa94b5d9aed in nanosleep () from /lib64/libc.so.6
>> #1  0x00007fa94b60ec94 in usleep () from /lib64/libc.so.6
>> #2  0x00007fa94960ba08 in OPAL_PMIX_PMIX1XX_PMIx_Fence (procs=0x0,
>> nprocs=0, info=0x7ffd7934fb90,
>>     ninfo=1) at src/client/pmix_client_fence.c:100
>> #3  0x00007fa9498376a2 in pmix1_fence (procs=0x0, collect_data=1) at
>> pmix1_client.c:305
>> #4  0x00007fa94bb39ba4 in ompi_mpi_init (argc=3, argv=0x7ffd793500a8,
>> requested=3,
>>     provided=0x7ffd7934ff94) at runtime/ompi_mpi_init.c:645
>> #5  0x00007fa94bb77281 in PMPI_Init_thread (argc=0x7ffd7934ff8c,
>> argv=0x7ffd7934ff80, required=3,
>>     provided=0x7ffd7934ff94) at pinit_thread.c:69
>> #6  0x000000000040150f in main (argc=3, argv=0x7ffd793500a8) at
>> osu_mbw_mr.c:86
>>
>> On my machines this is reproducible at 100% after anywhere between 50 and
>> 100 iterations.
>>
>>   Thanks,
>>     George.
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2015/10/18280.php
>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2015/10/18281.php
>>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/10/18282.php
>
>
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/10/18284.php
>

Reply via email to