Looking at it, I think I see what was happening. The thread would start, but then immediately see that the active flag was false and would exit. This left the server without any listening thread - but it wouldn’t detect this had happened. It was therefore a race between whether the thread checked the flag before the server set it.
Thanks Nysal - I believe this should indeed fix the problem! > On Nov 9, 2015, at 9:04 AM, George Bosilca <bosi...@icl.utk.edu> wrote: > > Clearly Nyal got a valid point there. I launched a stress test with Nysal > suggestion in the code, and so far it's up to few hundreds iterations > without deadlock. I would not claim victory yet, I launched a 10k cycle to > see where we stand (btw this never passed before). > I'll let you know the outcome. > > George. > > > On Mon, Nov 9, 2015 at 11:55 AM, Artem Polyakov <artpo...@gmail.com > <mailto:artpo...@gmail.com>> wrote: > > > 2015-11-09 22:42 GMT+06:00 Artem Polyakov <artpo...@gmail.com > <mailto:artpo...@gmail.com>>: > This is the very good point, Nysal! > > This is definitely a problem and I can say even more: avg. 3 from every 10 > tasks was affected by this bug. Once the PR > (https://github.com/pmix/master/pull/8 > <https://github.com/pmix/master/pull/8>) was applied I was able to run 100 > testing tasks without any hangs. > > Here some more information on my symptoms. I was observing this without OMPI, > just running pmix_client test binary from PMIx test suite with SLURM PMIx > plugin. > Periodicaly application was hanging. Investigation shows that not all > processes are able to initialize correctly. > Here is how such client's backtrace looks like: > > P.S. I think that this backtrace may be relevant to George's problem as well. > In my case not all of the processes was hanging in the connect_to_server, > most of them were able to move forward and reach Fence. > George, the backtrace that you've posted was the same on both processes or it > was the "random" one from one of them? > > (gdb) bt > #0 0x00007f1448f1b7eb in recv () from /lib/x86_64-linux-gnu/libpthread.so.0 > #1 0x00007f144914c191 in pmix_usock_recv_blocking (sd=9, data=0x7fff367f7c64 > "", size=4) at src/usock/usock.c:166 > #2 0x00007f1449152d18 in recv_connect_ack (sd=9) at > src/client/pmix_client.c:837 > #3 0x00007f14491546bf in usock_connect (addr=0x7fff367f7d60) at > src/client/pmix_client.c:1103 > #4 0x00007f144914f94c in connect_to_server (address=0x7fff367f7d60, > cbdata=0x7fff367f7dd0) at src/client/pmix_client.c:179 > #5 0x00007f1449150421 in PMIx_Init (proc=0x7fff367f81d0) at > src/client/pmix_client.c:355 > #6 0x0000000000401b97 in main (argc=9, argv=0x7fff367f83d8) at > pmix_client.c:62 > > > The server-side debug has the following lines at the end of the file: > [cn33:00482] pmix:server register client slurm.pmix.22.0:10 > [cn33:00482] pmix:server _register_client for nspace slurm.pmix.22.0 rank 10 > [cn33:00482] pmix:server setup_fork for nspace slurm.pmix.22.0 rank 10 > > in normal operation the following lines should appear after lines above: > .... > [cn33:00188] listen_thread: new connection: (26, 0) > [cn33:00188] connection_handler: new connection: 26 > [cn33:00188] RECV CONNECT ACK FROM PEER ON SOCKET 26 > [cn33:00188] waiting for blocking recv of 16 bytes > [cn33:00188] blocking receive complete from remote > .... > > At the client side I see the following lines > cn33:00491] usock_peer_try_connect: attempting to connect to server > [cn33:00491] usock_peer_try_connect: attempting to connect to server on > socket 10 > [cn33:00491] pmix: SEND CONNECT ACK > [cn33:00491] sec: native create_cred > [cn33:00491] sec: using credential 1000:1000 > [cn33:00491] send blocking of 54 bytes to socket 10 > [cn33:00491] blocking send complete to socket 10 > [cn33:00491] pmix: RECV CONNECT ACK FROM SERVER > [cn33:00491] waiting for blocking recv of 4 bytes > [cn33:00491] blocking_recv received error 11:Resource temporarily unavailable > from remote - cycling > [cn33:00491] blocking_recv received error 11:Resource temporarily unavailable > from remote - cycling > [... repeated many times ...] > > With the fix for the problem highlighted by Nysal all runs cleanly. > > > 2015-11-09 10:53 GMT+06:00 Nysal Jan K A <jny...@gmail.com > <mailto:jny...@gmail.com>>: > In listen_thread(): > 194 while (pmix_server_globals.listen_thread_active) { > 195 FD_ZERO(&readfds); > 196 FD_SET(pmix_server_globals.listen_socket, &readfds); > 197 max = pmix_server_globals.listen_socket; > > Is it possible that pmix_server_globals.listen_thread_active can be false, in > which case the thread just exits and will never call accept() ? > > In pmix_start_listening(): > 147 /* fork off the listener thread */ > 148 if (0 > pthread_create(&engine, NULL, listen_thread, NULL)) { > 149 return PMIX_ERROR; > 150 } > 151 pmix_server_globals.listen_thread_active = true; > > pmix_server_globals.listen_thread_active is set to true after the thread is > created, could this cause a race ? > listen_thread_active might also need to be declared as volatile. > > Regards > --Nysal > > On Sun, Nov 8, 2015 at 10:38 PM, George Bosilca <bosi...@icl.utk.edu > <mailto:bosi...@icl.utk.edu>> wrote: > We had a power outage last week and the local disks on our cluster were wiped > out. My tester was in there. But, I can rewrite it after SC. > > George. > > On Sat, Nov 7, 2015 at 12:04 PM, Ralph Castain <r...@open-mpi.org > <mailto:r...@open-mpi.org>> wrote: > Could you send me your stress test? I’m wondering if it is just something > about how we set socket options > > >> On Nov 7, 2015, at 8:58 AM, George Bosilca <bosi...@icl.utk.edu >> <mailto:bosi...@icl.utk.edu>> wrote: >> >> I has to postpone this until after SC. However, I ran for 3 days a stress >> test of UDS reproducing the opening and sending of data (what Ralph said in >> his email) and I never could get a deadlock. >> >> George. >> >> >> On Sat, Nov 7, 2015 at 11:26 AM, Ralph Castain <r...@open-mpi.org >> <mailto:r...@open-mpi.org>> wrote: >> George was looking into it, but I don’t know if he has had time recently to >> continue the investigation. We understand “what” is happening (accept >> sometimes ignores the connection), but we don’t yet know “why”. I’ve done >> some digging around the web, and found that sometimes you can try to talk to >> a Unix Domain Socket too quickly - i.e., you open it and then send to it, >> but the OS hasn’t yet set it up. In those cases, you can hang the socket. >> However, I’ve tried adding some artificial delay, and while it helped, it >> didn’t completely solve the problem. >> >> I have an idea for a workaround (set a timer and retry after awhile), but >> would obviously prefer a real solution. I’m not even sure it will work as it >> is unclear that the server (who is the one hung in accept) will break free >> if the client closes the socket and retries. >> >> >>> On Nov 6, 2015, at 10:53 PM, Artem Polyakov <artpo...@gmail.com >>> <mailto:artpo...@gmail.com>> wrote: >>> >>> Hello, is there any progress on this topic? This affects our PMIx >>> measurements. >>> >>> 2015-10-30 21:21 GMT+06:00 Ralph Castain <r...@open-mpi.org >>> <mailto:r...@open-mpi.org>>: >>> I’ve verified that the orte/util/listener thread is not being started, so I >>> don’t think it should be involved in this problem. >>> >>> HTH >>> Ralph >>> >>>> On Oct 30, 2015, at 8:07 AM, Ralph Castain <r...@open-mpi.org >>>> <mailto:r...@open-mpi.org>> wrote: >>>> >>>> Hmmm…there is a hook that would allow the PMIx server to utilize that >>>> listener thread, but we aren’t currently using it. Each daemon plus mpirun >>>> will call orte_start_listener, but nothing is currently registering and so >>>> the listener in that code is supposed to just return without starting the >>>> thread. >>>> >>>> So the only listener thread that should exist is the one inside the PMIx >>>> server itself. If something else is happening, then that would be a bug. I >>>> can look at the orte listener code to ensure that the thread isn’t >>>> incorrectly starting. >>>> >>>> >>>>> On Oct 29, 2015, at 10:03 PM, George Bosilca <bosi...@icl.utk.edu >>>>> <mailto:bosi...@icl.utk.edu>> wrote: >>>>> >>>>> Some progress, that puzzles me but might help you understand. Once the >>>>> deadlock appears, if I manually kill the MPI process on the node where >>>>> the deadlock was created, the local orte daemon doesn't notice and will >>>>> just keep waiting. >>>>> >>>>> Quick question: I am under the impression that the issue is not in the >>>>> PMIX server but somewhere around the listener_thread_fn in >>>>> orte/util/listener.c. Possible ? >>>>> >>>>> George. >>>>> >>>>> >>>>> On Wed, Oct 28, 2015 at 3:56 AM, Ralph Castain <r...@open-mpi.org >>>>> <mailto:r...@open-mpi.org>> wrote: >>>>> Should have also clarified: the prior fixes are indeed in the current >>>>> master. >>>>> >>>>>> On Oct 28, 2015, at 12:42 AM, Ralph Castain <r...@open-mpi.org >>>>>> <mailto:r...@open-mpi.org>> wrote: >>>>>> >>>>>> Nope - I was wrong. The correction on the client side consisted of >>>>>> attempting to timeout if the blocking recv failed. We then modified the >>>>>> blocking send/recv so they would handle errors. >>>>>> >>>>>> So that problem occurred -after- the server had correctly called accept. >>>>>> The listener code is in >>>>>> opal/mca/pmix/pmix1xx/pmix/src/server/pmix_server_listener.c >>>>>> >>>>>> It looks to me like the only way we could drop the accept (assuming the >>>>>> OS doesn’t lose it) is if the file descriptor lies outside the expected >>>>>> range once we fall out of select: >>>>>> >>>>>> >>>>>> /* Spin accepting connections until all active listen sockets >>>>>> * do not have any incoming connections, pushing each connection >>>>>> * onto the event queue for processing >>>>>> */ >>>>>> do { >>>>>> accepted_connections = 0; >>>>>> /* according to the man pages, select replaces the given >>>>>> descriptor >>>>>> * set with a subset consisting of those descriptors that >>>>>> are ready >>>>>> * for the specified operation - in this case, a read. So we >>>>>> need to >>>>>> * first check to see if this file descriptor is included in >>>>>> the >>>>>> * returned subset >>>>>> */ >>>>>> if (0 == FD_ISSET(pmix_server_globals.listen_socket, >>>>>> &readfds)) { >>>>>> /* this descriptor is not included */ >>>>>> continue; >>>>>> } >>>>>> >>>>>> /* this descriptor is ready to be read, which means a >>>>>> connection >>>>>> * request has been received - so harvest it. All we want to >>>>>> do >>>>>> * here is accept the connection and push the info onto the >>>>>> event >>>>>> * library for subsequent processing - we don't want to >>>>>> actually >>>>>> * process the connection here as it takes too long, and so >>>>>> the >>>>>> * OS might start rejecting connections due to timeout. >>>>>> */ >>>>>> pending_connection = PMIX_NEW(pmix_pending_connection_t); >>>>>> event_assign(&pending_connection->ev, pmix_globals.evbase, >>>>>> -1, >>>>>> EV_WRITE, connection_handler, >>>>>> pending_connection); >>>>>> pending_connection->sd = >>>>>> accept(pmix_server_globals.listen_socket, >>>>>> (struct >>>>>> sockaddr*)&(pending_connection->addr), >>>>>> &addrlen); >>>>>> if (pending_connection->sd < 0) { >>>>>> PMIX_RELEASE(pending_connection); >>>>>> if (pmix_socket_errno != EAGAIN || >>>>>> pmix_socket_errno != EWOULDBLOCK) { >>>>>> if (EMFILE == pmix_socket_errno) { >>>>>> PMIX_ERROR_LOG(PMIX_ERR_OUT_OF_RESOURCE); >>>>>> } else { >>>>>> pmix_output(0, "listen_thread: accept() failed: >>>>>> %s (%d).", >>>>>> strerror(pmix_socket_errno), >>>>>> pmix_socket_errno); >>>>>> } >>>>>> goto done; >>>>>> } >>>>>> continue; >>>>>> } >>>>>> >>>>>> pmix_output_verbose(8, pmix_globals.debug_output, >>>>>> "listen_thread: new connection: (%d, >>>>>> %d)", >>>>>> pending_connection->sd, >>>>>> pmix_socket_errno); >>>>>> /* activate the event */ >>>>>> event_active(&pending_connection->ev, EV_WRITE, 1); >>>>>> accepted_connections++; >>>>>> } while (accepted_connections > 0); >>>>>> >>>>>> >>>>>>> On Oct 28, 2015, at 12:25 AM, Ralph Castain <r...@open-mpi.org >>>>>>> <mailto:r...@open-mpi.org>> wrote: >>>>>>> >>>>>>> Looking at the code, it appears that a fix was committed for this >>>>>>> problem, and that we correctly resolved the issue found by Paul. The >>>>>>> problem is that the fix didn’t get upstreamed, and so it was lost the >>>>>>> next time we refreshed PMIx. Sigh. >>>>>>> >>>>>>> Let me try to recreate the fix and have you take a gander at it. >>>>>>> >>>>>>> >>>>>>>> On Oct 28, 2015, at 12:22 AM, Ralph Castain <r...@open-mpi.org >>>>>>>> <mailto:r...@open-mpi.org>> wrote: >>>>>>>> >>>>>>>> Here is the discussion - afraid it is fairly lengthy. Ignore the hwloc >>>>>>>> references in it as that was a separate issue: >>>>>>>> >>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/09/18074.php >>>>>>>> <http://www.open-mpi.org/community/lists/devel/2015/09/18074.php> >>>>>>>> >>>>>>>> It definitely sounds like the same issue creeping in again. I’d >>>>>>>> appreciate any thoughts on how to correct it. If it helps, you could >>>>>>>> look at the PMIx master - there are standalone tests in the >>>>>>>> test/simple directory that fork/exec a child and just do the >>>>>>>> connection. >>>>>>>> >>>>>>>> https://github.com/pmix/master <https://github.com/pmix/master> >>>>>>>> >>>>>>>> The test server is simptest.c - it will spawn a single copy of >>>>>>>> simpclient.c by default. >>>>>>>> >>>>>>>> >>>>>>>>> On Oct 27, 2015, at 10:14 PM, George Bosilca <bosi...@icl.utk.edu >>>>>>>>> <mailto:bosi...@icl.utk.edu>> wrote: >>>>>>>>> >>>>>>>>> Interesting. Do you have a pointer to the commit (or/and to the >>>>>>>>> discussion)? >>>>>>>>> >>>>>>>>> I looked at the PMIX code, and I have identified few issues, but >>>>>>>>> unfortunately none of them seem to fix the problem for good. However, >>>>>>>>> now I need more than 1000 runs to get a deadlock (instead of few >>>>>>>>> tens). >>>>>>>>> >>>>>>>>> Looking with "netstat -ax" at the status of the UDS while the >>>>>>>>> processes are deadlocked, I see 2 UDS with the same name: one from >>>>>>>>> the server which is in LISTEN state, and one for the client which is >>>>>>>>> being in CONNECTING state (while the client already sent a message in >>>>>>>>> the socket and is now waiting in a blocking receive). This somehow >>>>>>>>> suggest that the server has not yet called accept on the UDS. >>>>>>>>> Unfortunately, there are 3 threads all doing different flavors of >>>>>>>>> even_base and select, so I have a hard time tracking the path of the >>>>>>>>> UDS on the server side. >>>>>>>>> >>>>>>>>> So in order to validate my assumption I wrote a minimalistic UDS >>>>>>>>> client and server application and tried different scenarios. The >>>>>>>>> conclusion is that in order to see the same type of output from >>>>>>>>> "netstat -ax" I have to call listen on the server, connect on the >>>>>>>>> client and do not call accept on the server. >>>>>>>>> >>>>>>>>> With the same occasion I also confirmed that the UDS are holding the >>>>>>>>> data sent so there is no need for further synchronization for the >>>>>>>>> case where the data is sent first. We only need to find out how the >>>>>>>>> server forgets to call accept. >>>>>>>>> >>>>>>>>> George. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Oct 27, 2015 at 7:52 PM, Ralph Castain <r...@open-mpi.org >>>>>>>>> <mailto:r...@open-mpi.org>> wrote: >>>>>>>>> Hmmm…this looks like it might be that problem we previously saw where >>>>>>>>> the blocking recv hangs in a proc when the blocking send tries to >>>>>>>>> send before the domain socket is actually ready, and so the send >>>>>>>>> fails on the other end. As I recall, it was something to do with the >>>>>>>>> socketoptions - and then Paul had a problem on some of his machines, >>>>>>>>> and we backed it out? >>>>>>>>> >>>>>>>>> I wonder if that’s what is biting us here again, and what we need is >>>>>>>>> to either remove the blocking send/recv’s altogether, or figure out a >>>>>>>>> way to wait until the socket is really ready. >>>>>>>>> >>>>>>>>> Any thoughts? >>>>>>>>> >>>>>>>>> >>>>>>>>>> On Oct 27, 2015, at 4:11 PM, George Bosilca <bosi...@icl.utk.edu >>>>>>>>>> <mailto:bosi...@icl.utk.edu>> wrote: >>>>>>>>>> >>>>>>>>>> It appear the branch solve the problem at least partially. I asked >>>>>>>>>> one of my students to hammer it pretty badly, and he reported that >>>>>>>>>> the deadlocks still occur. He also graciously provided some >>>>>>>>>> stacktraces: >>>>>>>>>> >>>>>>>>>> #0 0x00007f4bd5274aed in nanosleep () from /lib64/libc.so.6 >>>>>>>>>> #1 0x00007f4bd52a9c94 in usleep () from /lib64/libc.so.6 >>>>>>>>>> #2 0x00007f4bd2e42b00 in OPAL_PMIX_PMIX1XX_PMIx_Fence (procs=0x0, >>>>>>>>>> nprocs=0, info=0x7fff3c561960, >>>>>>>>>> ninfo=1) at src/client/pmix_client_fence.c:100 >>>>>>>>>> #3 0x00007f4bd306e6d2 in pmix1_fence (procs=0x0, collect_data=1) at >>>>>>>>>> pmix1_client.c:306 >>>>>>>>>> #4 0x00007f4bd57d5cc3 in ompi_mpi_init (argc=3, >>>>>>>>>> argv=0x7fff3c561ea8, requested=3, >>>>>>>>>> provided=0x7fff3c561d84) at runtime/ompi_mpi_init.c:644 >>>>>>>>>> #5 0x00007f4bd5813399 in PMPI_Init_thread (argc=0x7fff3c561d7c, >>>>>>>>>> argv=0x7fff3c561d70, required=3, >>>>>>>>>> provided=0x7fff3c561d84) at pinit_thread.c:69 >>>>>>>>>> #6 0x0000000000401516 in main (argc=3, argv=0x7fff3c561ea8) at >>>>>>>>>> osu_mbw_mr.c:86 >>>>>>>>>> >>>>>>>>>> And another process: >>>>>>>>>> >>>>>>>>>> #0 0x00007f7b9d7d8bdc in recv () from /lib64/libpthread.so.0 >>>>>>>>>> #1 0x00007f7b9b0aa42d in opal_pmix_pmix1xx_pmix_usock_recv_blocking >>>>>>>>>> (sd=13, data=0x7ffd62139004 "", >>>>>>>>>> size=4) at src/usock/usock.c:168 >>>>>>>>>> #2 0x00007f7b9b0af5d9 in recv_connect_ack (sd=13) at >>>>>>>>>> src/client/pmix_client.c:844 >>>>>>>>>> #3 0x00007f7b9b0b085e in usock_connect (addr=0x7ffd62139330) at >>>>>>>>>> src/client/pmix_client.c:1110 >>>>>>>>>> #4 0x00007f7b9b0acc24 in connect_to_server (address=0x7ffd62139330, >>>>>>>>>> cbdata=0x7ffd621390e0) >>>>>>>>>> at src/client/pmix_client.c:181 >>>>>>>>>> #5 0x00007f7b9b0ad569 in OPAL_PMIX_PMIX1XX_PMIx_Init >>>>>>>>>> (proc=0x7f7b9b4e9b60) >>>>>>>>>> at src/client/pmix_client.c:362 >>>>>>>>>> #6 0x00007f7b9b2dbd9d in pmix1_client_init () at pmix1_client.c:99 >>>>>>>>>> #7 0x00007f7b9b4eb95f in pmi_component_query >>>>>>>>>> (module=0x7ffd62139490, priority=0x7ffd6213948c) >>>>>>>>>> at ess_pmi_component.c:90 >>>>>>>>>> #8 0x00007f7b9ce70ec5 in mca_base_select (type_name=0x7f7b9d20e059 >>>>>>>>>> "ess", output_id=-1, >>>>>>>>>> components_available=0x7f7b9d431eb0, best_module=0x7ffd621394d0, >>>>>>>>>> best_component=0x7ffd621394d8, >>>>>>>>>> priority_out=0x0) at mca_base_components_select.c:77 >>>>>>>>>> #9 0x00007f7b9d1a956b in orte_ess_base_select () at >>>>>>>>>> base/ess_base_select.c:40 >>>>>>>>>> #10 0x00007f7b9d160449 in orte_init (pargc=0x0, pargv=0x0, flags=32) >>>>>>>>>> at runtime/orte_init.c:219 >>>>>>>>>> #11 0x00007f7b9da4377a in ompi_mpi_init (argc=3, >>>>>>>>>> argv=0x7ffd621397f8, requested=3, >>>>>>>>>> provided=0x7ffd621396d4) at runtime/ompi_mpi_init.c:488 >>>>>>>>>> #12 0x00007f7b9da81399 in PMPI_Init_thread (argc=0x7ffd621396cc, >>>>>>>>>> argv=0x7ffd621396c0, required=3, >>>>>>>>>> provided=0x7ffd621396d4) at pinit_thread.c:69 >>>>>>>>>> #13 0x0000000000401516 in main (argc=3, argv=0x7ffd621397f8) at >>>>>>>>>> osu_mbw_mr.c:86 >>>>>>>>>> >>>>>>>>>> George. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, Oct 27, 2015 at 2:36 PM, Ralph Castain <r...@open-mpi.org >>>>>>>>>> <mailto:r...@open-mpi.org>> wrote: >>>>>>>>>> I haven’t been able to replicate this when using the branch in this >>>>>>>>>> PR: >>>>>>>>>> >>>>>>>>>> https://github.com/open-mpi/ompi/pull/1073 >>>>>>>>>> <https://github.com/open-mpi/ompi/pull/1073> >>>>>>>>>> >>>>>>>>>> Would you mind giving it a try? It fixes some other race conditions >>>>>>>>>> and might pick this one up too. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> On Oct 27, 2015, at 10:04 AM, Ralph Castain <r...@open-mpi.org >>>>>>>>>>> <mailto:r...@open-mpi.org>> wrote: >>>>>>>>>>> >>>>>>>>>>> Okay, I’ll take a look - I’ve been chasing a race condition that >>>>>>>>>>> might be related >>>>>>>>>>> >>>>>>>>>>>> On Oct 27, 2015, at 9:54 AM, George Bosilca <bosi...@icl.utk.edu >>>>>>>>>>>> <mailto:bosi...@icl.utk.edu>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> No, it's using 2 nodes. >>>>>>>>>>>> George. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Oct 27, 2015 at 12:35 PM, Ralph Castain <r...@open-mpi.org >>>>>>>>>>>> <mailto:r...@open-mpi.org>> wrote: >>>>>>>>>>>> Is this on a single node? >>>>>>>>>>>> >>>>>>>>>>>>> On Oct 27, 2015, at 9:25 AM, George Bosilca <bosi...@icl.utk.edu >>>>>>>>>>>>> <mailto:bosi...@icl.utk.edu>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> I get intermittent deadlocks wit the latest trunk. The smallest >>>>>>>>>>>>> reproducer is a shell for loop around a small (2 processes) short >>>>>>>>>>>>> (20 seconds) MPI application. After few tens of iterations the >>>>>>>>>>>>> MPI_Init will deadlock with the following backtrace: >>>>>>>>>>>>> >>>>>>>>>>>>> #0 0x00007fa94b5d9aed in nanosleep () from /lib64/libc.so.6 >>>>>>>>>>>>> #1 0x00007fa94b60ec94 in usleep () from /lib64/libc.so.6 >>>>>>>>>>>>> #2 0x00007fa94960ba08 in OPAL_PMIX_PMIX1XX_PMIx_Fence >>>>>>>>>>>>> (procs=0x0, nprocs=0, info=0x7ffd7934fb90, >>>>>>>>>>>>> ninfo=1) at src/client/pmix_client_fence.c:100 >>>>>>>>>>>>> #3 0x00007fa9498376a2 in pmix1_fence (procs=0x0, collect_data=1) >>>>>>>>>>>>> at pmix1_client.c:305 >>>>>>>>>>>>> #4 0x00007fa94bb39ba4 in ompi_mpi_init (argc=3, >>>>>>>>>>>>> argv=0x7ffd793500a8, requested=3, >>>>>>>>>>>>> provided=0x7ffd7934ff94) at runtime/ompi_mpi_init.c:645 >>>>>>>>>>>>> #5 0x00007fa94bb77281 in PMPI_Init_thread (argc=0x7ffd7934ff8c, >>>>>>>>>>>>> argv=0x7ffd7934ff80, required=3, >>>>>>>>>>>>> provided=0x7ffd7934ff94) at pinit_thread.c:69 >>>>>>>>>>>>> #6 0x000000000040150f in main (argc=3, argv=0x7ffd793500a8) at >>>>>>>>>>>>> osu_mbw_mr.c:86 >>>>>>>>>>>>> >>>>>>>>>>>>> On my machines this is reproducible at 100% after anywhere >>>>>>>>>>>>> between 50 and 100 iterations. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> George. >>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> devel mailing list >>>>>>>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>>>>>>> Link to this post: >>>>>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18280.php >>>>>>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2015/10/18280.php> >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> devel mailing list >>>>>>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>>>>>> Link to this post: >>>>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18281.php >>>>>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2015/10/18281.php> >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> devel mailing list >>>>>>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>>>>>> Link to this post: >>>>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18282.php >>>>>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2015/10/18282.php> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> devel mailing list >>>>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>>>> Link to this post: >>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18284.php >>>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2015/10/18284.php> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> devel mailing list >>>>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>>>> Link to this post: >>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18292.php >>>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2015/10/18292.php> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> devel mailing list >>>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>>> Link to this post: >>>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18294.php >>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2015/10/18294.php> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> devel mailing list >>>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>>> Link to this post: >>>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18302.php >>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2015/10/18302.php> >>>>>>> >>>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18309.php >>>>> <http://www.open-mpi.org/community/lists/devel/2015/10/18309.php> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18320.php >>>>> <http://www.open-mpi.org/community/lists/devel/2015/10/18320.php> >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2015/10/18323.php >>> <http://www.open-mpi.org/community/lists/devel/2015/10/18323.php> >>> >>> >>> >>> -- >>> С Уважением, Поляков Артем Юрьевич >>> Best regards, Artem Y. Polyakov >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2015/11/18334.php >>> <http://www.open-mpi.org/community/lists/devel/2015/11/18334.php> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org <mailto:de...@open-mpi.org> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2015/11/18335.php >> <http://www.open-mpi.org/community/lists/devel/2015/11/18335.php> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org <mailto:de...@open-mpi.org> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2015/11/18336.php >> <http://www.open-mpi.org/community/lists/devel/2015/11/18336.php> > > _______________________________________________ > devel mailing list > de...@open-mpi.org <mailto:de...@open-mpi.org> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > <http://www.open-mpi.org/mailman/listinfo.cgi/devel> > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/11/18337.php > <http://www.open-mpi.org/community/lists/devel/2015/11/18337.php> > > > _______________________________________________ > devel mailing list > de...@open-mpi.org <mailto:de...@open-mpi.org> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > <http://www.open-mpi.org/mailman/listinfo.cgi/devel> > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/11/18340.php > <http://www.open-mpi.org/community/lists/devel/2015/11/18340.php> > > > _______________________________________________ > devel mailing list > de...@open-mpi.org <mailto:de...@open-mpi.org> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > <http://www.open-mpi.org/mailman/listinfo.cgi/devel> > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/11/18341.php > <http://www.open-mpi.org/community/lists/devel/2015/11/18341.php> > > > > -- > С Уважением, Поляков Артем Юрьевич > Best regards, Artem Y. Polyakov > > > > -- > С Уважением, Поляков Артем Юрьевич > Best regards, Artem Y. Polyakov > > _______________________________________________ > devel mailing list > de...@open-mpi.org <mailto:de...@open-mpi.org> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > <http://www.open-mpi.org/mailman/listinfo.cgi/devel> > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/11/18345.php > <http://www.open-mpi.org/community/lists/devel/2015/11/18345.php> > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/11/18346.php