It appear the branch solve the problem at least partially. I asked one of my students to hammer it pretty badly, and he reported that the deadlocks still occur. He also graciously provided some stacktraces:
#0 0x00007f4bd5274aed in nanosleep () from /lib64/libc.so.6 #1 0x00007f4bd52a9c94 in usleep () from /lib64/libc.so.6 #2 0x00007f4bd2e42b00 in OPAL_PMIX_PMIX1XX_PMIx_Fence (procs=0x0, nprocs=0, info=0x7fff3c561960, ninfo=1) at src/client/pmix_client_fence.c:100 #3 0x00007f4bd306e6d2 in pmix1_fence (procs=0x0, collect_data=1) at pmix1_client.c:306 #4 0x00007f4bd57d5cc3 in ompi_mpi_init (argc=3, argv=0x7fff3c561ea8, requested=3, provided=0x7fff3c561d84) at runtime/ompi_mpi_init.c:644 #5 0x00007f4bd5813399 in PMPI_Init_thread (argc=0x7fff3c561d7c, argv=0x7fff3c561d70, required=3, provided=0x7fff3c561d84) at pinit_thread.c:69 #6 0x0000000000401516 in main (argc=3, argv=0x7fff3c561ea8) at osu_mbw_mr.c:86 And another process: #0 0x00007f7b9d7d8bdc in recv () from /lib64/libpthread.so.0 #1 0x00007f7b9b0aa42d in opal_pmix_pmix1xx_pmix_usock_recv_blocking (sd=13, data=0x7ffd62139004 "", size=4) at src/usock/usock.c:168 #2 0x00007f7b9b0af5d9 in recv_connect_ack (sd=13) at src/client/pmix_client.c:844 #3 0x00007f7b9b0b085e in usock_connect (addr=0x7ffd62139330) at src/client/pmix_client.c:1110 #4 0x00007f7b9b0acc24 in connect_to_server (address=0x7ffd62139330, cbdata=0x7ffd621390e0) at src/client/pmix_client.c:181 #5 0x00007f7b9b0ad569 in OPAL_PMIX_PMIX1XX_PMIx_Init (proc=0x7f7b9b4e9b60) at src/client/pmix_client.c:362 #6 0x00007f7b9b2dbd9d in pmix1_client_init () at pmix1_client.c:99 #7 0x00007f7b9b4eb95f in pmi_component_query (module=0x7ffd62139490, priority=0x7ffd6213948c) at ess_pmi_component.c:90 #8 0x00007f7b9ce70ec5 in mca_base_select (type_name=0x7f7b9d20e059 "ess", output_id=-1, components_available=0x7f7b9d431eb0, best_module=0x7ffd621394d0, best_component=0x7ffd621394d8, priority_out=0x0) at mca_base_components_select.c:77 #9 0x00007f7b9d1a956b in orte_ess_base_select () at base/ess_base_select.c:40 #10 0x00007f7b9d160449 in orte_init (pargc=0x0, pargv=0x0, flags=32) at runtime/orte_init.c:219 #11 0x00007f7b9da4377a in ompi_mpi_init (argc=3, argv=0x7ffd621397f8, requested=3, provided=0x7ffd621396d4) at runtime/ompi_mpi_init.c:488 #12 0x00007f7b9da81399 in PMPI_Init_thread (argc=0x7ffd621396cc, argv=0x7ffd621396c0, required=3, provided=0x7ffd621396d4) at pinit_thread.c:69 #13 0x0000000000401516 in main (argc=3, argv=0x7ffd621397f8) at osu_mbw_mr.c:86 George. On Tue, Oct 27, 2015 at 2:36 PM, Ralph Castain <r...@open-mpi.org> wrote: > I haven’t been able to replicate this when using the branch in this PR: > > https://github.com/open-mpi/ompi/pull/1073 > > Would you mind giving it a try? It fixes some other race conditions and > might pick this one up too. > > > On Oct 27, 2015, at 10:04 AM, Ralph Castain <r...@open-mpi.org> wrote: > > Okay, I’ll take a look - I’ve been chasing a race condition that might be > related > > On Oct 27, 2015, at 9:54 AM, George Bosilca <bosi...@icl.utk.edu> wrote: > > No, it's using 2 nodes. > George. > > > On Tue, Oct 27, 2015 at 12:35 PM, Ralph Castain <r...@open-mpi.org> wrote: > >> Is this on a single node? >> >> On Oct 27, 2015, at 9:25 AM, George Bosilca <bosi...@icl.utk.edu> wrote: >> >> I get intermittent deadlocks wit the latest trunk. The smallest >> reproducer is a shell for loop around a small (2 processes) short (20 >> seconds) MPI application. After few tens of iterations the MPI_Init will >> deadlock with the following backtrace: >> >> #0 0x00007fa94b5d9aed in nanosleep () from /lib64/libc.so.6 >> #1 0x00007fa94b60ec94 in usleep () from /lib64/libc.so.6 >> #2 0x00007fa94960ba08 in OPAL_PMIX_PMIX1XX_PMIx_Fence (procs=0x0, >> nprocs=0, info=0x7ffd7934fb90, >> ninfo=1) at src/client/pmix_client_fence.c:100 >> #3 0x00007fa9498376a2 in pmix1_fence (procs=0x0, collect_data=1) at >> pmix1_client.c:305 >> #4 0x00007fa94bb39ba4 in ompi_mpi_init (argc=3, argv=0x7ffd793500a8, >> requested=3, >> provided=0x7ffd7934ff94) at runtime/ompi_mpi_init.c:645 >> #5 0x00007fa94bb77281 in PMPI_Init_thread (argc=0x7ffd7934ff8c, >> argv=0x7ffd7934ff80, required=3, >> provided=0x7ffd7934ff94) at pinit_thread.c:69 >> #6 0x000000000040150f in main (argc=3, argv=0x7ffd793500a8) at >> osu_mbw_mr.c:86 >> >> On my machines this is reproducible at 100% after anywhere between 50 and >> 100 iterations. >> >> Thanks, >> George. >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2015/10/18280.php >> >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2015/10/18281.php >> > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/10/18282.php > > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/10/18284.php >