I haven’t been able to replicate this when using the branch in this PR: https://github.com/open-mpi/ompi/pull/1073 <https://github.com/open-mpi/ompi/pull/1073>
Would you mind giving it a try? It fixes some other race conditions and might pick this one up too. > On Oct 27, 2015, at 10:04 AM, Ralph Castain <r...@open-mpi.org> wrote: > > Okay, I’ll take a look - I’ve been chasing a race condition that might be > related > >> On Oct 27, 2015, at 9:54 AM, George Bosilca <bosi...@icl.utk.edu >> <mailto:bosi...@icl.utk.edu>> wrote: >> >> No, it's using 2 nodes. >> George. >> >> >> On Tue, Oct 27, 2015 at 12:35 PM, Ralph Castain <r...@open-mpi.org >> <mailto:r...@open-mpi.org>> wrote: >> Is this on a single node? >> >>> On Oct 27, 2015, at 9:25 AM, George Bosilca <bosi...@icl.utk.edu >>> <mailto:bosi...@icl.utk.edu>> wrote: >>> >>> I get intermittent deadlocks wit the latest trunk. The smallest reproducer >>> is a shell for loop around a small (2 processes) short (20 seconds) MPI >>> application. After few tens of iterations the MPI_Init will deadlock with >>> the following backtrace: >>> >>> #0 0x00007fa94b5d9aed in nanosleep () from /lib64/libc.so.6 >>> #1 0x00007fa94b60ec94 in usleep () from /lib64/libc.so.6 >>> #2 0x00007fa94960ba08 in OPAL_PMIX_PMIX1XX_PMIx_Fence (procs=0x0, >>> nprocs=0, info=0x7ffd7934fb90, >>> ninfo=1) at src/client/pmix_client_fence.c:100 >>> #3 0x00007fa9498376a2 in pmix1_fence (procs=0x0, collect_data=1) at >>> pmix1_client.c:305 >>> #4 0x00007fa94bb39ba4 in ompi_mpi_init (argc=3, argv=0x7ffd793500a8, >>> requested=3, >>> provided=0x7ffd7934ff94) at runtime/ompi_mpi_init.c:645 >>> #5 0x00007fa94bb77281 in PMPI_Init_thread (argc=0x7ffd7934ff8c, >>> argv=0x7ffd7934ff80, required=3, >>> provided=0x7ffd7934ff94) at pinit_thread.c:69 >>> #6 0x000000000040150f in main (argc=3, argv=0x7ffd793500a8) at >>> osu_mbw_mr.c:86 >>> >>> On my machines this is reproducible at 100% after anywhere between 50 and >>> 100 iterations. >>> >>> Thanks, >>> George. >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2015/10/18280.php >>> <http://www.open-mpi.org/community/lists/devel/2015/10/18280.php> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org <mailto:de...@open-mpi.org> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2015/10/18281.php >> <http://www.open-mpi.org/community/lists/devel/2015/10/18281.php> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org <mailto:de...@open-mpi.org> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2015/10/18282.php >