Thanks, Nysal!! Good catch! Josh
On Mon, Nov 9, 2015 at 2:27 PM, Mark Santcroos <mark.santcr...@rutgers.edu> wrote: > It seems the change suggested by Nysal also allows me to run into the next > problem ;-) > > Mark > > > On 09 Nov 2015, at 20:19 , George Bosilca <bosi...@icl.utk.edu> wrote: > > > > All 10k tests completed successfully. Nysal pinpointed the real problem > behind the deadlocks. :+1: > > > > George. > > > > > > On Mon, Nov 9, 2015 at 1:13 PM, Ralph Castain <r...@open-mpi.org> wrote: > > Looking at it, I think I see what was happening. The thread would start, > but then immediately see that the active flag was false and would exit. > This left the server without any listening thread - but it wouldn’t detect > this had happened. It was therefore a race between whether the thread > checked the flag before the server set it. > > > > Thanks Nysal - I believe this should indeed fix the problem! > > > > > >> On Nov 9, 2015, at 9:04 AM, George Bosilca <bosi...@icl.utk.edu> wrote: > >> > >> Clearly Nyal got a valid point there. I launched a stress test with > Nysal suggestion in the code, and so far it's up to few hundreds iterations > >> without deadlock. I would not claim victory yet, I launched a 10k cycle > to see where we stand (btw this never passed before). > >> I'll let you know the outcome. > >> > >> George. > >> > >> > >> On Mon, Nov 9, 2015 at 11:55 AM, Artem Polyakov <artpo...@gmail.com> > wrote: > >> > >> > >> 2015-11-09 22:42 GMT+06:00 Artem Polyakov <artpo...@gmail.com>: > >> This is the very good point, Nysal! > >> > >> This is definitely a problem and I can say even more: avg. 3 from every > 10 tasks was affected by this bug. Once the PR ( > https://github.com/pmix/master/pull/8) was applied I was able to run 100 > testing tasks without any hangs. > >> > >> Here some more information on my symptoms. I was observing this without > OMPI, just running pmix_client test binary from PMIx test suite with SLURM > PMIx plugin. > >> Periodicaly application was hanging. Investigation shows that not all > processes are able to initialize correctly. > >> Here is how such client's backtrace looks like: > >> > >> P.S. I think that this backtrace may be relevant to George's problem as > well. In my case not all of the processes was hanging in the > connect_to_server, most of them were able to move forward and reach Fence. > >> George, the backtrace that you've posted was the same on both processes > or it was the "random" one from one of them? > >> > >> (gdb) bt > >> #0 0x00007f1448f1b7eb in recv () from > /lib/x86_64-linux-gnu/libpthread.so.0 > >> #1 0x00007f144914c191 in pmix_usock_recv_blocking (sd=9, > data=0x7fff367f7c64 "", size=4) at src/usock/usock.c:166 > >> #2 0x00007f1449152d18 in recv_connect_ack (sd=9) at > src/client/pmix_client.c:837 > >> #3 0x00007f14491546bf in usock_connect (addr=0x7fff367f7d60) at > src/client/pmix_client.c:1103 > >> #4 0x00007f144914f94c in connect_to_server (address=0x7fff367f7d60, > cbdata=0x7fff367f7dd0) at src/client/pmix_client.c:179 > >> #5 0x00007f1449150421 in PMIx_Init (proc=0x7fff367f81d0) at > src/client/pmix_client.c:355 > >> #6 0x0000000000401b97 in main (argc=9, argv=0x7fff367f83d8) at > pmix_client.c:62 > >> > >> > >> The server-side debug has the following lines at the end of the file: > >> [cn33:00482] pmix:server register client slurm.pmix.22.0:10 > >> [cn33:00482] pmix:server _register_client for nspace slurm.pmix.22.0 > rank 10 > >> [cn33:00482] pmix:server setup_fork for nspace slurm.pmix.22.0 rank 10 > >> > >> in normal operation the following lines should appear after lines above: > >> .... > >> [cn33:00188] listen_thread: new connection: (26, 0) > >> [cn33:00188] connection_handler: new connection: 26 > >> [cn33:00188] RECV CONNECT ACK FROM PEER ON SOCKET 26 > >> [cn33:00188] waiting for blocking recv of 16 bytes > >> [cn33:00188] blocking receive complete from remote > >> .... > >> > >> At the client side I see the following lines > >> cn33:00491] usock_peer_try_connect: attempting to connect to server > >> [cn33:00491] usock_peer_try_connect: attempting to connect to server on > socket 10 > >> [cn33:00491] pmix: SEND CONNECT ACK > >> [cn33:00491] sec: native create_cred > >> [cn33:00491] sec: using credential 1000:1000 > >> [cn33:00491] send blocking of 54 bytes to socket 10 > >> [cn33:00491] blocking send complete to socket 10 > >> [cn33:00491] pmix: RECV CONNECT ACK FROM SERVER > >> [cn33:00491] waiting for blocking recv of 4 bytes > >> [cn33:00491] blocking_recv received error 11:Resource temporarily > unavailable from remote - cycling > >> [cn33:00491] blocking_recv received error 11:Resource temporarily > unavailable from remote - cycling > >> [... repeated many times ...] > >> > >> With the fix for the problem highlighted by Nysal all runs cleanly. > >> > >> > >> 2015-11-09 10:53 GMT+06:00 Nysal Jan K A <jny...@gmail.com>: > >> In listen_thread(): > >> 194 while (pmix_server_globals.listen_thread_active) { > >> 195 FD_ZERO(&readfds); > >> 196 FD_SET(pmix_server_globals.listen_socket, &readfds); > >> 197 max = pmix_server_globals.listen_socket; > >> > >> Is it possible that pmix_server_globals.listen_thread_active can be > false, in which case the thread just exits and will never call accept() ? > >> > >> In pmix_start_listening(): > >> 147 /* fork off the listener thread */ > >> 148 if (0 > pthread_create(&engine, NULL, listen_thread, NULL)) > { > >> 149 return PMIX_ERROR; > >> 150 } > >> 151 pmix_server_globals.listen_thread_active = true; > >> > >> pmix_server_globals.listen_thread_active is set to true after the > thread is created, could this cause a race ? > >> listen_thread_active might also need to be declared as volatile. > >> > >> Regards > >> --Nysal > >> > >> On Sun, Nov 8, 2015 at 10:38 PM, George Bosilca <bosi...@icl.utk.edu> > wrote: > >> We had a power outage last week and the local disks on our cluster were > wiped out. My tester was in there. But, I can rewrite it after SC. > >> > >> George. > >> > >> On Sat, Nov 7, 2015 at 12:04 PM, Ralph Castain <r...@open-mpi.org> > wrote: > >> Could you send me your stress test? I’m wondering if it is just > something about how we set socket options > >> > >> > >>> On Nov 7, 2015, at 8:58 AM, George Bosilca <bosi...@icl.utk.edu> > wrote: > >>> > >>> I has to postpone this until after SC. However, I ran for 3 days a > stress test of UDS reproducing the opening and sending of data (what Ralph > said in his email) and I never could get a deadlock. > >>> > >>> George. > >>> > >>> > >>> On Sat, Nov 7, 2015 at 11:26 AM, Ralph Castain <r...@open-mpi.org> > wrote: > >>> George was looking into it, but I don’t know if he has had time > recently to continue the investigation. We understand “what” is happening > (accept sometimes ignores the connection), but we don’t yet know “why”. > I’ve done some digging around the web, and found that sometimes you can try > to talk to a Unix Domain Socket too quickly - i.e., you open it and then > send to it, but the OS hasn’t yet set it up. In those cases, you can hang > the socket. However, I’ve tried adding some artificial delay, and while it > helped, it didn’t completely solve the problem. > >>> > >>> I have an idea for a workaround (set a timer and retry after awhile), > but would obviously prefer a real solution. I’m not even sure it will work > as it is unclear that the server (who is the one hung in accept) will break > free if the client closes the socket and retries. > >>> > >>> > >>>> On Nov 6, 2015, at 10:53 PM, Artem Polyakov <artpo...@gmail.com> > wrote: > >>>> > >>>> Hello, is there any progress on this topic? This affects our PMIx > measurements. > >>>> > >>>> 2015-10-30 21:21 GMT+06:00 Ralph Castain <r...@open-mpi.org>: > >>>> I’ve verified that the orte/util/listener thread is not being > started, so I don’t think it should be involved in this problem. > >>>> > >>>> HTH > >>>> Ralph > >>>> > >>>>> On Oct 30, 2015, at 8:07 AM, Ralph Castain <r...@open-mpi.org> wrote: > >>>>> > >>>>> Hmmm…there is a hook that would allow the PMIx server to utilize > that listener thread, but we aren’t currently using it. Each daemon plus > mpirun will call orte_start_listener, but nothing is currently registering > and so the listener in that code is supposed to just return without > starting the thread. > >>>>> > >>>>> So the only listener thread that should exist is the one inside the > PMIx server itself. If something else is happening, then that would be a > bug. I can look at the orte listener code to ensure that the thread isn’t > incorrectly starting. > >>>>> > >>>>> > >>>>>> On Oct 29, 2015, at 10:03 PM, George Bosilca <bosi...@icl.utk.edu> > wrote: > >>>>>> > >>>>>> Some progress, that puzzles me but might help you understand. Once > the deadlock appears, if I manually kill the MPI process on the node where > the deadlock was created, the local orte daemon doesn't notice and will > just keep waiting. > >>>>>> > >>>>>> Quick question: I am under the impression that the issue is not in > the PMIX server but somewhere around the listener_thread_fn in > orte/util/listener.c. Possible ? > >>>>>> > >>>>>> George. > >>>>>> > >>>>>> > >>>>>> On Wed, Oct 28, 2015 at 3:56 AM, Ralph Castain <r...@open-mpi.org> > wrote: > >>>>>> Should have also clarified: the prior fixes are indeed in the > current master. > >>>>>> > >>>>>>> On Oct 28, 2015, at 12:42 AM, Ralph Castain <r...@open-mpi.org> > wrote: > >>>>>>> > >>>>>>> Nope - I was wrong. The correction on the client side consisted of > attempting to timeout if the blocking recv failed. We then modified the > blocking send/recv so they would handle errors. > >>>>>>> > >>>>>>> So that problem occurred -after- the server had correctly called > accept. The listener code is in > opal/mca/pmix/pmix1xx/pmix/src/server/pmix_server_listener.c > >>>>>>> > >>>>>>> It looks to me like the only way we could drop the accept > (assuming the OS doesn’t lose it) is if the file descriptor lies outside > the expected range once we fall out of select: > >>>>>>> > >>>>>>> > >>>>>>> /* Spin accepting connections until all active listen > sockets > >>>>>>> * do not have any incoming connections, pushing each > connection > >>>>>>> * onto the event queue for processing > >>>>>>> */ > >>>>>>> do { > >>>>>>> accepted_connections = 0; > >>>>>>> /* according to the man pages, select replaces the > given descriptor > >>>>>>> * set with a subset consisting of those descriptors > that are ready > >>>>>>> * for the specified operation - in this case, a read. > So we need to > >>>>>>> * first check to see if this file descriptor is > included in the > >>>>>>> * returned subset > >>>>>>> */ > >>>>>>> if (0 == FD_ISSET(pmix_server_globals.listen_socket, > &readfds)) { > >>>>>>> /* this descriptor is not included */ > >>>>>>> continue; > >>>>>>> } > >>>>>>> > >>>>>>> /* this descriptor is ready to be read, which means a > connection > >>>>>>> * request has been received - so harvest it. All we > want to do > >>>>>>> * here is accept the connection and push the info > onto the event > >>>>>>> * library for subsequent processing - we don't want > to actually > >>>>>>> * process the connection here as it takes too long, > and so the > >>>>>>> * OS might start rejecting connections due to timeout. > >>>>>>> */ > >>>>>>> pending_connection = > PMIX_NEW(pmix_pending_connection_t); > >>>>>>> event_assign(&pending_connection->ev, > pmix_globals.evbase, -1, > >>>>>>> EV_WRITE, connection_handler, > pending_connection); > >>>>>>> pending_connection->sd = > accept(pmix_server_globals.listen_socket, > >>>>>>> (struct > sockaddr*)&(pending_connection->addr), > >>>>>>> &addrlen); > >>>>>>> if (pending_connection->sd < 0) { > >>>>>>> PMIX_RELEASE(pending_connection); > >>>>>>> if (pmix_socket_errno != EAGAIN || > >>>>>>> pmix_socket_errno != EWOULDBLOCK) { > >>>>>>> if (EMFILE == pmix_socket_errno) { > >>>>>>> PMIX_ERROR_LOG(PMIX_ERR_OUT_OF_RESOURCE); > >>>>>>> } else { > >>>>>>> pmix_output(0, "listen_thread: accept() > failed: %s (%d).", > >>>>>>> strerror(pmix_socket_errno), > pmix_socket_errno); > >>>>>>> } > >>>>>>> goto done; > >>>>>>> } > >>>>>>> continue; > >>>>>>> } > >>>>>>> > >>>>>>> pmix_output_verbose(8, pmix_globals.debug_output, > >>>>>>> "listen_thread: new connection: > (%d, %d)", > >>>>>>> pending_connection->sd, > pmix_socket_errno); > >>>>>>> /* activate the event */ > >>>>>>> event_active(&pending_connection->ev, EV_WRITE, 1); > >>>>>>> accepted_connections++; > >>>>>>> } while (accepted_connections > 0); > >>>>>>> > >>>>>>> > >>>>>>>> On Oct 28, 2015, at 12:25 AM, Ralph Castain <r...@open-mpi.org> > wrote: > >>>>>>>> > >>>>>>>> Looking at the code, it appears that a fix was committed for this > problem, and that we correctly resolved the issue found by Paul. The > problem is that the fix didn’t get upstreamed, and so it was lost the next > time we refreshed PMIx. Sigh. > >>>>>>>> > >>>>>>>> Let me try to recreate the fix and have you take a gander at it. > >>>>>>>> > >>>>>>>> > >>>>>>>>> On Oct 28, 2015, at 12:22 AM, Ralph Castain <r...@open-mpi.org> > wrote: > >>>>>>>>> > >>>>>>>>> Here is the discussion - afraid it is fairly lengthy. Ignore the > hwloc references in it as that was a separate issue: > >>>>>>>>> > >>>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/09/18074.php > >>>>>>>>> > >>>>>>>>> It definitely sounds like the same issue creeping in again. I’d > appreciate any thoughts on how to correct it. If it helps, you could look > at the PMIx master - there are standalone tests in the test/simple > directory that fork/exec a child and just do the connection. > >>>>>>>>> > >>>>>>>>> https://github.com/pmix/master > >>>>>>>>> > >>>>>>>>> The test server is simptest.c - it will spawn a single copy of > simpclient.c by default. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>>> On Oct 27, 2015, at 10:14 PM, George Bosilca < > bosi...@icl.utk.edu> wrote: > >>>>>>>>>> > >>>>>>>>>> Interesting. Do you have a pointer to the commit (or/and to the > discussion)? > >>>>>>>>>> > >>>>>>>>>> I looked at the PMIX code, and I have identified few issues, > but unfortunately none of them seem to fix the problem for good. However, > now I need more than 1000 runs to get a deadlock (instead of few tens). > >>>>>>>>>> > >>>>>>>>>> Looking with "netstat -ax" at the status of the UDS while the > processes are deadlocked, I see 2 UDS with the same name: one from the > server which is in LISTEN state, and one for the client which is being in > CONNECTING state (while the client already sent a message in the socket and > is now waiting in a blocking receive). This somehow suggest that the server > has not yet called accept on the UDS. Unfortunately, there are 3 threads > all doing different flavors of even_base and select, so I have a hard time > tracking the path of the UDS on the server side. > >>>>>>>>>> > >>>>>>>>>> So in order to validate my assumption I wrote a minimalistic > UDS client and server application and tried different scenarios. The > conclusion is that in order to see the same type of output from "netstat > -ax" I have to call listen on the server, connect on the client and do not > call accept on the server. > >>>>>>>>>> > >>>>>>>>>> With the same occasion I also confirmed that the UDS are > holding the data sent so there is no need for further synchronization for > the case where the data is sent first. We only need to find out how the > server forgets to call accept. > >>>>>>>>>> > >>>>>>>>>> George. > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> On Tue, Oct 27, 2015 at 7:52 PM, Ralph Castain < > r...@open-mpi.org> wrote: > >>>>>>>>>> Hmmm…this looks like it might be that problem we previously saw > where the blocking recv hangs in a proc when the blocking send tries to > send before the domain socket is actually ready, and so the send fails on > the other end. As I recall, it was something to do with the socketoptions - > and then Paul had a problem on some of his machines, and we backed it out? > >>>>>>>>>> > >>>>>>>>>> I wonder if that’s what is biting us here again, and what we > need is to either remove the blocking send/recv’s altogether, or figure out > a way to wait until the socket is really ready. > >>>>>>>>>> > >>>>>>>>>> Any thoughts? > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>>> On Oct 27, 2015, at 4:11 PM, George Bosilca < > bosi...@icl.utk.edu> wrote: > >>>>>>>>>>> > >>>>>>>>>>> It appear the branch solve the problem at least partially. I > asked one of my students to hammer it pretty badly, and he reported that > the deadlocks still occur. He also graciously provided some stacktraces: > >>>>>>>>>>> > >>>>>>>>>>> #0 0x00007f4bd5274aed in nanosleep () from /lib64/libc.so.6 > >>>>>>>>>>> #1 0x00007f4bd52a9c94 in usleep () from /lib64/libc.so.6 > >>>>>>>>>>> #2 0x00007f4bd2e42b00 in OPAL_PMIX_PMIX1XX_PMIx_Fence > (procs=0x0, nprocs=0, info=0x7fff3c561960, > >>>>>>>>>>> ninfo=1) at src/client/pmix_client_fence.c:100 > >>>>>>>>>>> #3 0x00007f4bd306e6d2 in pmix1_fence (procs=0x0, > collect_data=1) at pmix1_client.c:306 > >>>>>>>>>>> #4 0x00007f4bd57d5cc3 in ompi_mpi_init (argc=3, > argv=0x7fff3c561ea8, requested=3, > >>>>>>>>>>> provided=0x7fff3c561d84) at runtime/ompi_mpi_init.c:644 > >>>>>>>>>>> #5 0x00007f4bd5813399 in PMPI_Init_thread > (argc=0x7fff3c561d7c, argv=0x7fff3c561d70, required=3, > >>>>>>>>>>> provided=0x7fff3c561d84) at pinit_thread.c:69 > >>>>>>>>>>> #6 0x0000000000401516 in main (argc=3, argv=0x7fff3c561ea8) > at osu_mbw_mr.c:86 > >>>>>>>>>>> > >>>>>>>>>>> And another process: > >>>>>>>>>>> > >>>>>>>>>>> #0 0x00007f7b9d7d8bdc in recv () from /lib64/libpthread.so.0 > >>>>>>>>>>> #1 0x00007f7b9b0aa42d in > opal_pmix_pmix1xx_pmix_usock_recv_blocking (sd=13, data=0x7ffd62139004 "", > >>>>>>>>>>> size=4) at src/usock/usock.c:168 > >>>>>>>>>>> #2 0x00007f7b9b0af5d9 in recv_connect_ack (sd=13) at > src/client/pmix_client.c:844 > >>>>>>>>>>> #3 0x00007f7b9b0b085e in usock_connect (addr=0x7ffd62139330) > at src/client/pmix_client.c:1110 > >>>>>>>>>>> #4 0x00007f7b9b0acc24 in connect_to_server > (address=0x7ffd62139330, cbdata=0x7ffd621390e0) > >>>>>>>>>>> at src/client/pmix_client.c:181 > >>>>>>>>>>> #5 0x00007f7b9b0ad569 in OPAL_PMIX_PMIX1XX_PMIx_Init > (proc=0x7f7b9b4e9b60) > >>>>>>>>>>> at src/client/pmix_client.c:362 > >>>>>>>>>>> #6 0x00007f7b9b2dbd9d in pmix1_client_init () at > pmix1_client.c:99 > >>>>>>>>>>> #7 0x00007f7b9b4eb95f in pmi_component_query > (module=0x7ffd62139490, priority=0x7ffd6213948c) > >>>>>>>>>>> at ess_pmi_component.c:90 > >>>>>>>>>>> #8 0x00007f7b9ce70ec5 in mca_base_select > (type_name=0x7f7b9d20e059 "ess", output_id=-1, > >>>>>>>>>>> components_available=0x7f7b9d431eb0, > best_module=0x7ffd621394d0, best_component=0x7ffd621394d8, > >>>>>>>>>>> priority_out=0x0) at mca_base_components_select.c:77 > >>>>>>>>>>> #9 0x00007f7b9d1a956b in orte_ess_base_select () at > base/ess_base_select.c:40 > >>>>>>>>>>> #10 0x00007f7b9d160449 in orte_init (pargc=0x0, pargv=0x0, > flags=32) at runtime/orte_init.c:219 > >>>>>>>>>>> #11 0x00007f7b9da4377a in ompi_mpi_init (argc=3, > argv=0x7ffd621397f8, requested=3, > >>>>>>>>>>> provided=0x7ffd621396d4) at runtime/ompi_mpi_init.c:488 > >>>>>>>>>>> #12 0x00007f7b9da81399 in PMPI_Init_thread > (argc=0x7ffd621396cc, argv=0x7ffd621396c0, required=3, > >>>>>>>>>>> provided=0x7ffd621396d4) at pinit_thread.c:69 > >>>>>>>>>>> #13 0x0000000000401516 in main (argc=3, argv=0x7ffd621397f8) > at osu_mbw_mr.c:86 > >>>>>>>>>>> > >>>>>>>>>>> George. > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> On Tue, Oct 27, 2015 at 2:36 PM, Ralph Castain < > r...@open-mpi.org> wrote: > >>>>>>>>>>> I haven’t been able to replicate this when using the branch in > this PR: > >>>>>>>>>>> > >>>>>>>>>>> https://github.com/open-mpi/ompi/pull/1073 > >>>>>>>>>>> > >>>>>>>>>>> Would you mind giving it a try? It fixes some other race > conditions and might pick this one up too. > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>>> On Oct 27, 2015, at 10:04 AM, Ralph Castain <r...@open-mpi.org> > wrote: > >>>>>>>>>>>> > >>>>>>>>>>>> Okay, I’ll take a look - I’ve been chasing a race condition > that might be related > >>>>>>>>>>>> > >>>>>>>>>>>>> On Oct 27, 2015, at 9:54 AM, George Bosilca < > bosi...@icl.utk.edu> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>> No, it's using 2 nodes. > >>>>>>>>>>>>> George. > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Tue, Oct 27, 2015 at 12:35 PM, Ralph Castain < > r...@open-mpi.org> wrote: > >>>>>>>>>>>>> Is this on a single node? > >>>>>>>>>>>>> > >>>>>>>>>>>>>> On Oct 27, 2015, at 9:25 AM, George Bosilca < > bosi...@icl.utk.edu> wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> I get intermittent deadlocks wit the latest trunk. The > smallest reproducer is a shell for loop around a small (2 processes) short > (20 seconds) MPI application. After few tens of iterations the MPI_Init > will deadlock with the following backtrace: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> #0 0x00007fa94b5d9aed in nanosleep () from /lib64/libc.so.6 > >>>>>>>>>>>>>> #1 0x00007fa94b60ec94 in usleep () from /lib64/libc.so.6 > >>>>>>>>>>>>>> #2 0x00007fa94960ba08 in OPAL_PMIX_PMIX1XX_PMIx_Fence > (procs=0x0, nprocs=0, info=0x7ffd7934fb90, > >>>>>>>>>>>>>> ninfo=1) at src/client/pmix_client_fence.c:100 > >>>>>>>>>>>>>> #3 0x00007fa9498376a2 in pmix1_fence (procs=0x0, > collect_data=1) at pmix1_client.c:305 > >>>>>>>>>>>>>> #4 0x00007fa94bb39ba4 in ompi_mpi_init (argc=3, > argv=0x7ffd793500a8, requested=3, > >>>>>>>>>>>>>> provided=0x7ffd7934ff94) at runtime/ompi_mpi_init.c:645 > >>>>>>>>>>>>>> #5 0x00007fa94bb77281 in PMPI_Init_thread > (argc=0x7ffd7934ff8c, argv=0x7ffd7934ff80, required=3, > >>>>>>>>>>>>>> provided=0x7ffd7934ff94) at pinit_thread.c:69 > >>>>>>>>>>>>>> #6 0x000000000040150f in main (argc=3, > argv=0x7ffd793500a8) at osu_mbw_mr.c:86 > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On my machines this is reproducible at 100% after anywhere > between 50 and 100 iterations. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>> George. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> _______________________________________________ > >>>>>>>>>>>>>> devel mailing list > >>>>>>>>>>>>>> de...@open-mpi.org > >>>>>>>>>>>>>> Subscription: > http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>>>>>>>>>>>>> Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/10/18280.php > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> _______________________________________________ > >>>>>>>>>>>>> devel mailing list > >>>>>>>>>>>>> de...@open-mpi.org > >>>>>>>>>>>>> Subscription: > http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>>>>>>>>>>>> Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/10/18281.php > >>>>>>>>>>>>> > >>>>>>>>>>>>> _______________________________________________ > >>>>>>>>>>>>> devel mailing list > >>>>>>>>>>>>> de...@open-mpi.org > >>>>>>>>>>>>> Subscription: > http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>>>>>>>>>>>> Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/10/18282.php > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> _______________________________________________ > >>>>>>>>>>> devel mailing list > >>>>>>>>>>> de...@open-mpi.org > >>>>>>>>>>> Subscription: > http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>>>>>>>>>> Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/10/18284.php > >>>>>>>>>>> > >>>>>>>>>>> _______________________________________________ > >>>>>>>>>>> devel mailing list > >>>>>>>>>>> de...@open-mpi.org > >>>>>>>>>>> Subscription: > http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>>>>>>>>>> Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/10/18292.php > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> _______________________________________________ > >>>>>>>>>> devel mailing list > >>>>>>>>>> de...@open-mpi.org > >>>>>>>>>> Subscription: > http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>>>>>>>>> Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/10/18294.php > >>>>>>>>>> > >>>>>>>>>> _______________________________________________ > >>>>>>>>>> devel mailing list > >>>>>>>>>> de...@open-mpi.org > >>>>>>>>>> Subscription: > http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>>>>>>>>> Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/10/18302.php > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>>> > >>>>>> _______________________________________________ > >>>>>> devel mailing list > >>>>>> de...@open-mpi.org > >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>>>>> Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/10/18309.php > >>>>>> > >>>>>> _______________________________________________ > >>>>>> devel mailing list > >>>>>> de...@open-mpi.org > >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>>>>> Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/10/18320.php > >>>>> > >>>> > >>>> > >>>> _______________________________________________ > >>>> devel mailing list > >>>> de...@open-mpi.org > >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>>> Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/10/18323.php > >>>> > >>>> > >>>> > >>>> -- > >>>> С Уважением, Поляков Артем Юрьевич > >>>> Best regards, Artem Y. Polyakov > >>>> _______________________________________________ > >>>> devel mailing list > >>>> de...@open-mpi.org > >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>>> Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/11/18334.php > >>> > >>> > >>> _______________________________________________ > >>> devel mailing list > >>> de...@open-mpi.org > >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>> Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/11/18335.php > >>> > >>> _______________________________________________ > >>> devel mailing list > >>> de...@open-mpi.org > >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>> Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/11/18336.php > >> > >> > >> _______________________________________________ > >> devel mailing list > >> de...@open-mpi.org > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >> Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/11/18337.php > >> > >> > >> _______________________________________________ > >> devel mailing list > >> de...@open-mpi.org > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >> Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/11/18340.php > >> > >> > >> _______________________________________________ > >> devel mailing list > >> de...@open-mpi.org > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >> Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/11/18341.php > >> > >> > >> > >> -- > >> С Уважением, Поляков Артем Юрьевич > >> Best regards, Artem Y. Polyakov > >> > >> > >> > >> -- > >> С Уважением, Поляков Артем Юрьевич > >> Best regards, Artem Y. Polyakov > >> > >> _______________________________________________ > >> devel mailing list > >> de...@open-mpi.org > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >> Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/11/18345.php > >> > >> _______________________________________________ > >> devel mailing list > >> de...@open-mpi.org > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >> Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/11/18346.php > > > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/11/18347.php > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/11/18348.php > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/11/18349.php