Clearly Nyal got a valid point there. I launched a stress test with Nysal suggestion in the code, and so far it's up to few hundreds iterations without deadlock. I would not claim victory yet, I launched a 10k cycle to see where we stand (btw this never passed before). I'll let you know the outcome.
George. On Mon, Nov 9, 2015 at 11:55 AM, Artem Polyakov <[email protected]> wrote: > > > 2015-11-09 22:42 GMT+06:00 Artem Polyakov <[email protected]>: > >> This is the very good point, Nysal! >> >> This is definitely a problem and I can say even more: avg. 3 from every >> 10 tasks was affected by this bug. Once the PR ( >> https://github.com/pmix/master/pull/8) was applied I was able to run 100 >> testing tasks without any hangs. >> >> Here some more information on my symptoms. I was observing this without >> OMPI, just running pmix_client test binary from PMIx test suite with SLURM >> PMIx plugin. >> Periodicaly application was hanging. Investigation shows that not all >> processes are able to initialize correctly. >> Here is how such client's backtrace looks like: >> > > P.S. I think that this backtrace may be relevant to George's problem as > well. In my case not all of the processes was hanging in the > connect_to_server, most of them were able to move forward and reach Fence. > George, the backtrace that you've posted was the same on both processes or > it was the "random" one from one of them? > > >> (gdb) bt >> #0 0x00007f1448f1b7eb in recv () from >> /lib/x86_64-linux-gnu/libpthread.so.0 >> #1 0x00007f144914c191 in pmix_usock_recv_blocking (sd=9, >> data=0x7fff367f7c64 "", size=4) at src/usock/usock.c:166 >> #2 0x00007f1449152d18 in recv_connect_ack (sd=9) at >> src/client/pmix_client.c:837 >> #3 0x00007f14491546bf in usock_connect (addr=0x7fff367f7d60) at >> src/client/pmix_client.c:1103 >> #4 0x00007f144914f94c in connect_to_server (address=0x7fff367f7d60, >> cbdata=0x7fff367f7dd0) at src/client/pmix_client.c:179 >> #5 0x00007f1449150421 in PMIx_Init (proc=0x7fff367f81d0) at >> src/client/pmix_client.c:355 >> #6 0x0000000000401b97 in main (argc=9, argv=0x7fff367f83d8) at >> pmix_client.c:62 >> >> >> The server-side debug has the following lines at the end of the file: >> [cn33:00482] pmix:server register client slurm.pmix.22.0:10 >> [cn33:00482] pmix:server _register_client for nspace slurm.pmix.22.0 rank >> 10 >> [cn33:00482] pmix:server setup_fork for nspace slurm.pmix.22.0 rank 10 >> >> in normal operation the following lines should appear after lines above: >> .... >> [cn33:00188] listen_thread: new connection: (26, 0) >> [cn33:00188] connection_handler: new connection: 26 >> [cn33:00188] RECV CONNECT ACK FROM PEER ON SOCKET 26 >> [cn33:00188] waiting for blocking recv of 16 bytes >> [cn33:00188] blocking receive complete from remote >> .... >> >> At the client side I see the following lines >> cn33:00491] usock_peer_try_connect: attempting to connect to server >> [cn33:00491] usock_peer_try_connect: attempting to connect to server on >> socket 10 >> [cn33:00491] pmix: SEND CONNECT ACK >> [cn33:00491] sec: native create_cred >> [cn33:00491] sec: using credential 1000:1000 >> [cn33:00491] send blocking of 54 bytes to socket 10 >> [cn33:00491] blocking send complete to socket 10 >> [cn33:00491] pmix: RECV CONNECT ACK FROM SERVER >> [cn33:00491] waiting for blocking recv of 4 bytes >> [cn33:00491] blocking_recv received error 11:Resource temporarily >> unavailable from remote - cycling >> [cn33:00491] blocking_recv received error 11:Resource temporarily >> unavailable from remote - cycling >> [... repeated many times ...] >> >> With the fix for the problem highlighted by Nysal all runs cleanly. >> >> >> 2015-11-09 10:53 GMT+06:00 Nysal Jan K A <[email protected]>: >> >>> In listen_thread(): >>> 194 while (pmix_server_globals.listen_thread_active) { >>> 195 FD_ZERO(&readfds); >>> 196 FD_SET(pmix_server_globals.listen_socket, &readfds); >>> 197 max = pmix_server_globals.listen_socket; >>> >>> Is it possible that pmix_server_globals.listen_thread_active can be >>> false, in which case the thread just exits and will never call accept() ? >>> >>> In pmix_start_listening(): >>> 147 /* fork off the listener thread */ >>> 148 if (0 > pthread_create(&engine, NULL, listen_thread, NULL)) { >>> 149 return PMIX_ERROR; >>> 150 } >>> 151 pmix_server_globals.listen_thread_active = true; >>> >>> pmix_server_globals.listen_thread_active is set to true after the thread >>> is created, could this cause a race ? >>> listen_thread_active might also need to be declared as volatile. >>> >>> Regards >>> --Nysal >>> >>> On Sun, Nov 8, 2015 at 10:38 PM, George Bosilca <[email protected]> >>> wrote: >>> >>>> We had a power outage last week and the local disks on our cluster were >>>> wiped out. My tester was in there. But, I can rewrite it after SC. >>>> >>>> George. >>>> >>>> On Sat, Nov 7, 2015 at 12:04 PM, Ralph Castain <[email protected]> >>>> wrote: >>>> >>>>> Could you send me your stress test? I’m wondering if it is just >>>>> something about how we set socket options >>>>> >>>>> >>>>> On Nov 7, 2015, at 8:58 AM, George Bosilca <[email protected]> >>>>> wrote: >>>>> >>>>> I has to postpone this until after SC. However, I ran for 3 days a >>>>> stress test of UDS reproducing the opening and sending of data (what Ralph >>>>> said in his email) and I never could get a deadlock. >>>>> >>>>> George. >>>>> >>>>> >>>>> On Sat, Nov 7, 2015 at 11:26 AM, Ralph Castain <[email protected]> >>>>> wrote: >>>>> >>>>>> George was looking into it, but I don’t know if he has had time >>>>>> recently to continue the investigation. We understand “what” is happening >>>>>> (accept sometimes ignores the connection), but we don’t yet know “why”. >>>>>> I’ve done some digging around the web, and found that sometimes you can >>>>>> try >>>>>> to talk to a Unix Domain Socket too quickly - i.e., you open it and then >>>>>> send to it, but the OS hasn’t yet set it up. In those cases, you can hang >>>>>> the socket. However, I’ve tried adding some artificial delay, and while >>>>>> it >>>>>> helped, it didn’t completely solve the problem. >>>>>> >>>>>> I have an idea for a workaround (set a timer and retry after awhile), >>>>>> but would obviously prefer a real solution. I’m not even sure it will >>>>>> work >>>>>> as it is unclear that the server (who is the one hung in accept) will >>>>>> break >>>>>> free if the client closes the socket and retries. >>>>>> >>>>>> >>>>>> On Nov 6, 2015, at 10:53 PM, Artem Polyakov <[email protected]> >>>>>> wrote: >>>>>> >>>>>> Hello, is there any progress on this topic? This affects our PMIx >>>>>> measurements. >>>>>> >>>>>> 2015-10-30 21:21 GMT+06:00 Ralph Castain <[email protected]>: >>>>>> >>>>>>> I’ve verified that the orte/util/listener thread is not being >>>>>>> started, so I don’t think it should be involved in this problem. >>>>>>> >>>>>>> HTH >>>>>>> Ralph >>>>>>> >>>>>>> On Oct 30, 2015, at 8:07 AM, Ralph Castain <[email protected]> wrote: >>>>>>> >>>>>>> Hmmm…there is a hook that would allow the PMIx server to utilize >>>>>>> that listener thread, but we aren’t currently using it. Each daemon plus >>>>>>> mpirun will call orte_start_listener, but nothing is currently >>>>>>> registering >>>>>>> and so the listener in that code is supposed to just return without >>>>>>> starting the thread. >>>>>>> >>>>>>> So the only listener thread that should exist is the one inside the >>>>>>> PMIx server itself. If something else is happening, then that would be a >>>>>>> bug. I can look at the orte listener code to ensure that the thread >>>>>>> isn’t >>>>>>> incorrectly starting. >>>>>>> >>>>>>> >>>>>>> On Oct 29, 2015, at 10:03 PM, George Bosilca <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>> Some progress, that puzzles me but might help you understand. Once >>>>>>> the deadlock appears, if I manually kill the MPI process on the node >>>>>>> where >>>>>>> the deadlock was created, the local orte daemon doesn't notice and will >>>>>>> just keep waiting. >>>>>>> >>>>>>> Quick question: I am under the impression that the issue is not in >>>>>>> the PMIX server but somewhere around the listener_thread_fn in >>>>>>> orte/util/listener.c. Possible ? >>>>>>> >>>>>>> George. >>>>>>> >>>>>>> >>>>>>> On Wed, Oct 28, 2015 at 3:56 AM, Ralph Castain <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Should have also clarified: the prior fixes are indeed in the >>>>>>>> current master. >>>>>>>> >>>>>>>> On Oct 28, 2015, at 12:42 AM, Ralph Castain <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Nope - I was wrong. The correction on the client side consisted of >>>>>>>> attempting to timeout if the blocking recv failed. We then modified the >>>>>>>> blocking send/recv so they would handle errors. >>>>>>>> >>>>>>>> So that problem occurred -after- the server had correctly called >>>>>>>> accept. The listener code is in >>>>>>>> opal/mca/pmix/pmix1xx/pmix/src/server/pmix_server_listener.c >>>>>>>> >>>>>>>> It looks to me like the only way we could drop the accept (assuming >>>>>>>> the OS doesn’t lose it) is if the file descriptor lies outside the >>>>>>>> expected >>>>>>>> range once we fall out of select: >>>>>>>> >>>>>>>> >>>>>>>> /* Spin accepting connections until all active listen >>>>>>>> sockets >>>>>>>> * do not have any incoming connections, pushing each >>>>>>>> connection >>>>>>>> * onto the event queue for processing >>>>>>>> */ >>>>>>>> do { >>>>>>>> accepted_connections = 0; >>>>>>>> /* according to the man pages, select replaces the >>>>>>>> given descriptor >>>>>>>> * set with a subset consisting of those descriptors >>>>>>>> that are ready >>>>>>>> * for the specified operation - in this case, a read. >>>>>>>> So we need to >>>>>>>> * first check to see if this file descriptor is >>>>>>>> included in the >>>>>>>> * returned subset >>>>>>>> */ >>>>>>>> if (0 == FD_ISSET(pmix_server_globals.listen_socket, >>>>>>>> &readfds)) { >>>>>>>> /* this descriptor is not included */ >>>>>>>> continue; >>>>>>>> } >>>>>>>> >>>>>>>> /* this descriptor is ready to be read, which means a >>>>>>>> connection >>>>>>>> * request has been received - so harvest it. All we >>>>>>>> want to do >>>>>>>> * here is accept the connection and push the info onto >>>>>>>> the event >>>>>>>> * library for subsequent processing - we don't want to >>>>>>>> actually >>>>>>>> * process the connection here as it takes too long, >>>>>>>> and so the >>>>>>>> * OS might start rejecting connections due to timeout. >>>>>>>> */ >>>>>>>> pending_connection = >>>>>>>> PMIX_NEW(pmix_pending_connection_t); >>>>>>>> event_assign(&pending_connection->ev, >>>>>>>> pmix_globals.evbase, -1, >>>>>>>> EV_WRITE, connection_handler, >>>>>>>> pending_connection); >>>>>>>> pending_connection->sd = >>>>>>>> accept(pmix_server_globals.listen_socket, >>>>>>>> (struct >>>>>>>> sockaddr*)&(pending_connection->addr), >>>>>>>> &addrlen); >>>>>>>> if (pending_connection->sd < 0) { >>>>>>>> PMIX_RELEASE(pending_connection); >>>>>>>> if (pmix_socket_errno != EAGAIN || >>>>>>>> pmix_socket_errno != EWOULDBLOCK) { >>>>>>>> if (EMFILE == pmix_socket_errno) { >>>>>>>> PMIX_ERROR_LOG(PMIX_ERR_OUT_OF_RESOURCE); >>>>>>>> } else { >>>>>>>> pmix_output(0, "listen_thread: accept() >>>>>>>> failed: %s (%d).", >>>>>>>> strerror(pmix_socket_errno), >>>>>>>> pmix_socket_errno); >>>>>>>> } >>>>>>>> goto done; >>>>>>>> } >>>>>>>> continue; >>>>>>>> } >>>>>>>> >>>>>>>> pmix_output_verbose(8, pmix_globals.debug_output, >>>>>>>> "listen_thread: new connection: >>>>>>>> (%d, %d)", >>>>>>>> pending_connection->sd, >>>>>>>> pmix_socket_errno); >>>>>>>> /* activate the event */ >>>>>>>> event_active(&pending_connection->ev, EV_WRITE, 1); >>>>>>>> accepted_connections++; >>>>>>>> } while (accepted_connections > 0); >>>>>>>> >>>>>>>> >>>>>>>> On Oct 28, 2015, at 12:25 AM, Ralph Castain <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Looking at the code, it appears that a fix was committed for this >>>>>>>> problem, and that we correctly resolved the issue found by Paul. The >>>>>>>> problem is that the fix didn’t get upstreamed, and so it was lost the >>>>>>>> next >>>>>>>> time we refreshed PMIx. Sigh. >>>>>>>> >>>>>>>> Let me try to recreate the fix and have you take a gander at it. >>>>>>>> >>>>>>>> >>>>>>>> On Oct 28, 2015, at 12:22 AM, Ralph Castain <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Here is the discussion - afraid it is fairly lengthy. Ignore the >>>>>>>> hwloc references in it as that was a separate issue: >>>>>>>> >>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/09/18074.php >>>>>>>> >>>>>>>> It definitely sounds like the same issue creeping in again. I’d >>>>>>>> appreciate any thoughts on how to correct it. If it helps, you could >>>>>>>> look >>>>>>>> at the PMIx master - there are standalone tests in the test/simple >>>>>>>> directory that fork/exec a child and just do the connection. >>>>>>>> >>>>>>>> https://github.com/pmix/master >>>>>>>> >>>>>>>> The test server is simptest.c - it will spawn a single copy of >>>>>>>> simpclient.c by default. >>>>>>>> >>>>>>>> >>>>>>>> On Oct 27, 2015, at 10:14 PM, George Bosilca <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Interesting. Do you have a pointer to the commit (or/and to the >>>>>>>> discussion)? >>>>>>>> >>>>>>>> I looked at the PMIX code, and I have identified few issues, but >>>>>>>> unfortunately none of them seem to fix the problem for good. However, >>>>>>>> now I >>>>>>>> need more than 1000 runs to get a deadlock (instead of few tens). >>>>>>>> >>>>>>>> Looking with "netstat -ax" at the status of the UDS while the >>>>>>>> processes are deadlocked, I see 2 UDS with the same name: one from the >>>>>>>> server which is in LISTEN state, and one for the client which is being >>>>>>>> in >>>>>>>> CONNECTING state (while the client already sent a message in the >>>>>>>> socket and >>>>>>>> is now waiting in a blocking receive). This somehow suggest that the >>>>>>>> server >>>>>>>> has not yet called accept on the UDS. Unfortunately, there are 3 >>>>>>>> threads >>>>>>>> all doing different flavors of even_base and select, so I have a hard >>>>>>>> time >>>>>>>> tracking the path of the UDS on the server side. >>>>>>>> >>>>>>>> So in order to validate my assumption I wrote a minimalistic UDS >>>>>>>> client and server application and tried different scenarios. The >>>>>>>> conclusion >>>>>>>> is that in order to see the same type of output from "netstat -ax" I >>>>>>>> have >>>>>>>> to call listen on the server, connect on the client and do not call >>>>>>>> accept >>>>>>>> on the server. >>>>>>>> >>>>>>>> With the same occasion I also confirmed that the UDS are holding >>>>>>>> the data sent so there is no need for further synchronization for the >>>>>>>> case >>>>>>>> where the data is sent first. We only need to find out how the server >>>>>>>> forgets to call accept. >>>>>>>> >>>>>>>> George. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Oct 27, 2015 at 7:52 PM, Ralph Castain <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hmmm…this looks like it might be that problem we previously saw >>>>>>>>> where the blocking recv hangs in a proc when the blocking send tries >>>>>>>>> to >>>>>>>>> send before the domain socket is actually ready, and so the send >>>>>>>>> fails on >>>>>>>>> the other end. As I recall, it was something to do with the >>>>>>>>> socketoptions - >>>>>>>>> and then Paul had a problem on some of his machines, and we backed it >>>>>>>>> out? >>>>>>>>> >>>>>>>>> I wonder if that’s what is biting us here again, and what we need >>>>>>>>> is to either remove the blocking send/recv’s altogether, or figure >>>>>>>>> out a >>>>>>>>> way to wait until the socket is really ready. >>>>>>>>> >>>>>>>>> Any thoughts? >>>>>>>>> >>>>>>>>> >>>>>>>>> On Oct 27, 2015, at 4:11 PM, George Bosilca <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> It appear the branch solve the problem at least partially. I asked >>>>>>>>> one of my students to hammer it pretty badly, and he reported that the >>>>>>>>> deadlocks still occur. He also graciously provided some stacktraces: >>>>>>>>> >>>>>>>>> #0 0x00007f4bd5274aed in nanosleep () from /lib64/libc.so.6 >>>>>>>>> #1 0x00007f4bd52a9c94 in usleep () from /lib64/libc.so.6 >>>>>>>>> #2 0x00007f4bd2e42b00 in OPAL_PMIX_PMIX1XX_PMIx_Fence (procs=0x0, >>>>>>>>> nprocs=0, info=0x7fff3c561960, >>>>>>>>> ninfo=1) at src/client/pmix_client_fence.c:100 >>>>>>>>> #3 0x00007f4bd306e6d2 in pmix1_fence (procs=0x0, collect_data=1) >>>>>>>>> at pmix1_client.c:306 >>>>>>>>> #4 0x00007f4bd57d5cc3 in ompi_mpi_init (argc=3, >>>>>>>>> argv=0x7fff3c561ea8, requested=3, >>>>>>>>> provided=0x7fff3c561d84) at runtime/ompi_mpi_init.c:644 >>>>>>>>> #5 0x00007f4bd5813399 in PMPI_Init_thread (argc=0x7fff3c561d7c, >>>>>>>>> argv=0x7fff3c561d70, required=3, >>>>>>>>> provided=0x7fff3c561d84) at pinit_thread.c:69 >>>>>>>>> #6 0x0000000000401516 in main (argc=3, argv=0x7fff3c561ea8) at >>>>>>>>> osu_mbw_mr.c:86 >>>>>>>>> >>>>>>>>> And another process: >>>>>>>>> >>>>>>>>> #0 0x00007f7b9d7d8bdc in recv () from /lib64/libpthread.so.0 >>>>>>>>> #1 0x00007f7b9b0aa42d in >>>>>>>>> opal_pmix_pmix1xx_pmix_usock_recv_blocking (sd=13, >>>>>>>>> data=0x7ffd62139004 "", >>>>>>>>> size=4) at src/usock/usock.c:168 >>>>>>>>> #2 0x00007f7b9b0af5d9 in recv_connect_ack (sd=13) at >>>>>>>>> src/client/pmix_client.c:844 >>>>>>>>> #3 0x00007f7b9b0b085e in usock_connect (addr=0x7ffd62139330) at >>>>>>>>> src/client/pmix_client.c:1110 >>>>>>>>> #4 0x00007f7b9b0acc24 in connect_to_server >>>>>>>>> (address=0x7ffd62139330, cbdata=0x7ffd621390e0) >>>>>>>>> at src/client/pmix_client.c:181 >>>>>>>>> #5 0x00007f7b9b0ad569 in OPAL_PMIX_PMIX1XX_PMIx_Init >>>>>>>>> (proc=0x7f7b9b4e9b60) >>>>>>>>> at src/client/pmix_client.c:362 >>>>>>>>> #6 0x00007f7b9b2dbd9d in pmix1_client_init () at pmix1_client.c:99 >>>>>>>>> #7 0x00007f7b9b4eb95f in pmi_component_query >>>>>>>>> (module=0x7ffd62139490, priority=0x7ffd6213948c) >>>>>>>>> at ess_pmi_component.c:90 >>>>>>>>> #8 0x00007f7b9ce70ec5 in mca_base_select >>>>>>>>> (type_name=0x7f7b9d20e059 "ess", output_id=-1, >>>>>>>>> components_available=0x7f7b9d431eb0, >>>>>>>>> best_module=0x7ffd621394d0, best_component=0x7ffd621394d8, >>>>>>>>> priority_out=0x0) at mca_base_components_select.c:77 >>>>>>>>> #9 0x00007f7b9d1a956b in orte_ess_base_select () at >>>>>>>>> base/ess_base_select.c:40 >>>>>>>>> #10 0x00007f7b9d160449 in orte_init (pargc=0x0, pargv=0x0, >>>>>>>>> flags=32) at runtime/orte_init.c:219 >>>>>>>>> #11 0x00007f7b9da4377a in ompi_mpi_init (argc=3, >>>>>>>>> argv=0x7ffd621397f8, requested=3, >>>>>>>>> provided=0x7ffd621396d4) at runtime/ompi_mpi_init.c:488 >>>>>>>>> #12 0x00007f7b9da81399 in PMPI_Init_thread (argc=0x7ffd621396cc, >>>>>>>>> argv=0x7ffd621396c0, required=3, >>>>>>>>> provided=0x7ffd621396d4) at pinit_thread.c:69 >>>>>>>>> #13 0x0000000000401516 in main (argc=3, argv=0x7ffd621397f8) at >>>>>>>>> osu_mbw_mr.c:86 >>>>>>>>> >>>>>>>>> George. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Oct 27, 2015 at 2:36 PM, Ralph Castain <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> I haven’t been able to replicate this when using the branch in >>>>>>>>>> this PR: >>>>>>>>>> >>>>>>>>>> https://github.com/open-mpi/ompi/pull/1073 >>>>>>>>>> >>>>>>>>>> Would you mind giving it a try? It fixes some other race >>>>>>>>>> conditions and might pick this one up too. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Oct 27, 2015, at 10:04 AM, Ralph Castain <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> Okay, I’ll take a look - I’ve been chasing a race condition that >>>>>>>>>> might be related >>>>>>>>>> >>>>>>>>>> On Oct 27, 2015, at 9:54 AM, George Bosilca <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> No, it's using 2 nodes. >>>>>>>>>> George. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, Oct 27, 2015 at 12:35 PM, Ralph Castain <[email protected] >>>>>>>>>> > wrote: >>>>>>>>>> >>>>>>>>>>> Is this on a single node? >>>>>>>>>>> >>>>>>>>>>> On Oct 27, 2015, at 9:25 AM, George Bosilca <[email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> I get intermittent deadlocks wit the latest trunk. The smallest >>>>>>>>>>> reproducer is a shell for loop around a small (2 processes) short >>>>>>>>>>> (20 >>>>>>>>>>> seconds) MPI application. After few tens of iterations the MPI_Init >>>>>>>>>>> will >>>>>>>>>>> deadlock with the following backtrace: >>>>>>>>>>> >>>>>>>>>>> #0 0x00007fa94b5d9aed in nanosleep () from /lib64/libc.so.6 >>>>>>>>>>> #1 0x00007fa94b60ec94 in usleep () from /lib64/libc.so.6 >>>>>>>>>>> #2 0x00007fa94960ba08 in OPAL_PMIX_PMIX1XX_PMIx_Fence >>>>>>>>>>> (procs=0x0, nprocs=0, info=0x7ffd7934fb90, >>>>>>>>>>> ninfo=1) at src/client/pmix_client_fence.c:100 >>>>>>>>>>> #3 0x00007fa9498376a2 in pmix1_fence (procs=0x0, >>>>>>>>>>> collect_data=1) at pmix1_client.c:305 >>>>>>>>>>> #4 0x00007fa94bb39ba4 in ompi_mpi_init (argc=3, >>>>>>>>>>> argv=0x7ffd793500a8, requested=3, >>>>>>>>>>> provided=0x7ffd7934ff94) at runtime/ompi_mpi_init.c:645 >>>>>>>>>>> #5 0x00007fa94bb77281 in PMPI_Init_thread (argc=0x7ffd7934ff8c, >>>>>>>>>>> argv=0x7ffd7934ff80, required=3, >>>>>>>>>>> provided=0x7ffd7934ff94) at pinit_thread.c:69 >>>>>>>>>>> #6 0x000000000040150f in main (argc=3, argv=0x7ffd793500a8) at >>>>>>>>>>> osu_mbw_mr.c:86 >>>>>>>>>>> >>>>>>>>>>> On my machines this is reproducible at 100% after anywhere >>>>>>>>>>> between 50 and 100 iterations. >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> George. >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> devel mailing list >>>>>>>>>>> [email protected] >>>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>> Link to this post: >>>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18280.php >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> devel mailing list >>>>>>>>>>> [email protected] >>>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>> Link to this post: >>>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18281.php >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> devel mailing list >>>>>>>>>> [email protected] >>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>> Link to this post: >>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18282.php >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> devel mailing list >>>>>>>>>> [email protected] >>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>> Link to this post: >>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18284.php >>>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> devel mailing list >>>>>>>>> [email protected] >>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>> Link to this post: >>>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18292.php >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> devel mailing list >>>>>>>>> [email protected] >>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>> Link to this post: >>>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18294.php >>>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> devel mailing list >>>>>>>> [email protected] >>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>> Link to this post: >>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18302.php >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> devel mailing list >>>>>>>> [email protected] >>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>> Link to this post: >>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18309.php >>>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> [email protected] >>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>> Link to this post: >>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18320.php >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> [email protected] >>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>> Link to this post: >>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18323.php >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> С Уважением, Поляков Артем Юрьевич >>>>>> Best regards, Artem Y. Polyakov >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> [email protected] >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/devel/2015/11/18334.php >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> [email protected] >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/devel/2015/11/18335.php >>>>>> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> [email protected] >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/devel/2015/11/18336.php >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> [email protected] >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/devel/2015/11/18337.php >>>>> >>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> [email protected] >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2015/11/18340.php >>>> >>> >>> >>> _______________________________________________ >>> devel mailing list >>> [email protected] >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2015/11/18341.php >>> >> >> >> >> -- >> С Уважением, Поляков Артем Юрьевич >> Best regards, Artem Y. Polyakov >> > > > > -- > С Уважением, Поляков Артем Юрьевич > Best regards, Artem Y. Polyakov > > _______________________________________________ > devel mailing list > [email protected] > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/11/18345.php >
