All 10k tests completed successfully. Nysal pinpointed the real problem behind the deadlocks. :+1:
George. On Mon, Nov 9, 2015 at 1:13 PM, Ralph Castain <r...@open-mpi.org> wrote: > Looking at it, I think I see what was happening. The thread would start, > but then immediately see that the active flag was false and would exit. > This left the server without any listening thread - but it wouldn’t detect > this had happened. It was therefore a race between whether the thread > checked the flag before the server set it. > > Thanks Nysal - I believe this should indeed fix the problem! > > > On Nov 9, 2015, at 9:04 AM, George Bosilca <bosi...@icl.utk.edu> wrote: > > Clearly Nyal got a valid point there. I launched a stress test with Nysal > suggestion in the code, and so far it's up to few hundreds iterations > without deadlock. I would not claim victory yet, I launched a 10k cycle to > see where we stand (btw this never passed before). > I'll let you know the outcome. > > George. > > > On Mon, Nov 9, 2015 at 11:55 AM, Artem Polyakov <artpo...@gmail.com> > wrote: > >> >> >> 2015-11-09 22:42 GMT+06:00 Artem Polyakov <artpo...@gmail.com>: >> >>> This is the very good point, Nysal! >>> >>> This is definitely a problem and I can say even more: avg. 3 from every >>> 10 tasks was affected by this bug. Once the PR ( >>> https://github.com/pmix/master/pull/8) was applied I was able to run >>> 100 testing tasks without any hangs. >>> >>> Here some more information on my symptoms. I was observing this without >>> OMPI, just running pmix_client test binary from PMIx test suite with SLURM >>> PMIx plugin. >>> Periodicaly application was hanging. Investigation shows that not all >>> processes are able to initialize correctly. >>> Here is how such client's backtrace looks like: >>> >> >> P.S. I think that this backtrace may be relevant to George's problem as >> well. In my case not all of the processes was hanging in the >> connect_to_server, most of them were able to move forward and reach Fence. >> George, the backtrace that you've posted was the same on both processes >> or it was the "random" one from one of them? >> >> >>> (gdb) bt >>> #0 0x00007f1448f1b7eb in recv () from >>> /lib/x86_64-linux-gnu/libpthread.so.0 >>> #1 0x00007f144914c191 in pmix_usock_recv_blocking (sd=9, >>> data=0x7fff367f7c64 "", size=4) at src/usock/usock.c:166 >>> #2 0x00007f1449152d18 in recv_connect_ack (sd=9) at >>> src/client/pmix_client.c:837 >>> #3 0x00007f14491546bf in usock_connect (addr=0x7fff367f7d60) at >>> src/client/pmix_client.c:1103 >>> #4 0x00007f144914f94c in connect_to_server (address=0x7fff367f7d60, >>> cbdata=0x7fff367f7dd0) at src/client/pmix_client.c:179 >>> #5 0x00007f1449150421 in PMIx_Init (proc=0x7fff367f81d0) at >>> src/client/pmix_client.c:355 >>> #6 0x0000000000401b97 in main (argc=9, argv=0x7fff367f83d8) at >>> pmix_client.c:62 >>> >>> >>> The server-side debug has the following lines at the end of the file: >>> [cn33:00482] pmix:server register client slurm.pmix.22.0:10 >>> [cn33:00482] pmix:server _register_client for nspace slurm.pmix.22.0 >>> rank 10 >>> [cn33:00482] pmix:server setup_fork for nspace slurm.pmix.22.0 rank 10 >>> >>> in normal operation the following lines should appear after lines above: >>> .... >>> [cn33:00188] listen_thread: new connection: (26, 0) >>> [cn33:00188] connection_handler: new connection: 26 >>> [cn33:00188] RECV CONNECT ACK FROM PEER ON SOCKET 26 >>> [cn33:00188] waiting for blocking recv of 16 bytes >>> [cn33:00188] blocking receive complete from remote >>> .... >>> >>> At the client side I see the following lines >>> cn33:00491] usock_peer_try_connect: attempting to connect to server >>> [cn33:00491] usock_peer_try_connect: attempting to connect to server on >>> socket 10 >>> [cn33:00491] pmix: SEND CONNECT ACK >>> [cn33:00491] sec: native create_cred >>> [cn33:00491] sec: using credential 1000:1000 >>> [cn33:00491] send blocking of 54 bytes to socket 10 >>> [cn33:00491] blocking send complete to socket 10 >>> [cn33:00491] pmix: RECV CONNECT ACK FROM SERVER >>> [cn33:00491] waiting for blocking recv of 4 bytes >>> [cn33:00491] blocking_recv received error 11:Resource temporarily >>> unavailable from remote - cycling >>> [cn33:00491] blocking_recv received error 11:Resource temporarily >>> unavailable from remote - cycling >>> [... repeated many times ...] >>> >>> With the fix for the problem highlighted by Nysal all runs cleanly. >>> >>> >>> 2015-11-09 10:53 GMT+06:00 Nysal Jan K A <jny...@gmail.com>: >>> >>>> In listen_thread(): >>>> 194 while (pmix_server_globals.listen_thread_active) { >>>> 195 FD_ZERO(&readfds); >>>> 196 FD_SET(pmix_server_globals.listen_socket, &readfds); >>>> 197 max = pmix_server_globals.listen_socket; >>>> >>>> Is it possible that pmix_server_globals.listen_thread_active can be >>>> false, in which case the thread just exits and will never call accept() ? >>>> >>>> In pmix_start_listening(): >>>> 147 /* fork off the listener thread */ >>>> 148 if (0 > pthread_create(&engine, NULL, listen_thread, NULL)) >>>> { >>>> 149 return PMIX_ERROR; >>>> 150 } >>>> 151 pmix_server_globals.listen_thread_active = true; >>>> >>>> pmix_server_globals.listen_thread_active is set to true after the >>>> thread is created, could this cause a race ? >>>> listen_thread_active might also need to be declared as volatile. >>>> >>>> Regards >>>> --Nysal >>>> >>>> On Sun, Nov 8, 2015 at 10:38 PM, George Bosilca <bosi...@icl.utk.edu> >>>> wrote: >>>> >>>>> We had a power outage last week and the local disks on our cluster >>>>> were wiped out. My tester was in there. But, I can rewrite it after SC. >>>>> >>>>> George. >>>>> >>>>> On Sat, Nov 7, 2015 at 12:04 PM, Ralph Castain <r...@open-mpi.org> >>>>> wrote: >>>>> >>>>>> Could you send me your stress test? I’m wondering if it is just >>>>>> something about how we set socket options >>>>>> >>>>>> >>>>>> On Nov 7, 2015, at 8:58 AM, George Bosilca <bosi...@icl.utk.edu> >>>>>> wrote: >>>>>> >>>>>> I has to postpone this until after SC. However, I ran for 3 days a >>>>>> stress test of UDS reproducing the opening and sending of data (what >>>>>> Ralph >>>>>> said in his email) and I never could get a deadlock. >>>>>> >>>>>> George. >>>>>> >>>>>> >>>>>> On Sat, Nov 7, 2015 at 11:26 AM, Ralph Castain <r...@open-mpi.org> >>>>>> wrote: >>>>>> >>>>>>> George was looking into it, but I don’t know if he has had time >>>>>>> recently to continue the investigation. We understand “what” is >>>>>>> happening >>>>>>> (accept sometimes ignores the connection), but we don’t yet know “why”. >>>>>>> I’ve done some digging around the web, and found that sometimes you can >>>>>>> try >>>>>>> to talk to a Unix Domain Socket too quickly - i.e., you open it and then >>>>>>> send to it, but the OS hasn’t yet set it up. In those cases, you can >>>>>>> hang >>>>>>> the socket. However, I’ve tried adding some artificial delay, and while >>>>>>> it >>>>>>> helped, it didn’t completely solve the problem. >>>>>>> >>>>>>> I have an idea for a workaround (set a timer and retry after >>>>>>> awhile), but would obviously prefer a real solution. I’m not even sure >>>>>>> it >>>>>>> will work as it is unclear that the server (who is the one hung in >>>>>>> accept) >>>>>>> will break free if the client closes the socket and retries. >>>>>>> >>>>>>> >>>>>>> On Nov 6, 2015, at 10:53 PM, Artem Polyakov <artpo...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>> Hello, is there any progress on this topic? This affects our PMIx >>>>>>> measurements. >>>>>>> >>>>>>> 2015-10-30 21:21 GMT+06:00 Ralph Castain <r...@open-mpi.org>: >>>>>>> >>>>>>>> I’ve verified that the orte/util/listener thread is not being >>>>>>>> started, so I don’t think it should be involved in this problem. >>>>>>>> >>>>>>>> HTH >>>>>>>> Ralph >>>>>>>> >>>>>>>> On Oct 30, 2015, at 8:07 AM, Ralph Castain <r...@open-mpi.org> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Hmmm…there is a hook that would allow the PMIx server to utilize >>>>>>>> that listener thread, but we aren’t currently using it. Each daemon >>>>>>>> plus >>>>>>>> mpirun will call orte_start_listener, but nothing is currently >>>>>>>> registering >>>>>>>> and so the listener in that code is supposed to just return without >>>>>>>> starting the thread. >>>>>>>> >>>>>>>> So the only listener thread that should exist is the one inside the >>>>>>>> PMIx server itself. If something else is happening, then that would be >>>>>>>> a >>>>>>>> bug. I can look at the orte listener code to ensure that the thread >>>>>>>> isn’t >>>>>>>> incorrectly starting. >>>>>>>> >>>>>>>> >>>>>>>> On Oct 29, 2015, at 10:03 PM, George Bosilca <bosi...@icl.utk.edu> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Some progress, that puzzles me but might help you understand. Once >>>>>>>> the deadlock appears, if I manually kill the MPI process on the node >>>>>>>> where >>>>>>>> the deadlock was created, the local orte daemon doesn't notice and will >>>>>>>> just keep waiting. >>>>>>>> >>>>>>>> Quick question: I am under the impression that the issue is not in >>>>>>>> the PMIX server but somewhere around the listener_thread_fn in >>>>>>>> orte/util/listener.c. Possible ? >>>>>>>> >>>>>>>> George. >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Oct 28, 2015 at 3:56 AM, Ralph Castain <r...@open-mpi.org> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Should have also clarified: the prior fixes are indeed in the >>>>>>>>> current master. >>>>>>>>> >>>>>>>>> On Oct 28, 2015, at 12:42 AM, Ralph Castain <r...@open-mpi.org> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Nope - I was wrong. The correction on the client side consisted of >>>>>>>>> attempting to timeout if the blocking recv failed. We then modified >>>>>>>>> the >>>>>>>>> blocking send/recv so they would handle errors. >>>>>>>>> >>>>>>>>> So that problem occurred -after- the server had correctly called >>>>>>>>> accept. The listener code is in >>>>>>>>> opal/mca/pmix/pmix1xx/pmix/src/server/pmix_server_listener.c >>>>>>>>> >>>>>>>>> It looks to me like the only way we could drop the accept >>>>>>>>> (assuming the OS doesn’t lose it) is if the file descriptor lies >>>>>>>>> outside >>>>>>>>> the expected range once we fall out of select: >>>>>>>>> >>>>>>>>> >>>>>>>>> /* Spin accepting connections until all active listen >>>>>>>>> sockets >>>>>>>>> * do not have any incoming connections, pushing each >>>>>>>>> connection >>>>>>>>> * onto the event queue for processing >>>>>>>>> */ >>>>>>>>> do { >>>>>>>>> accepted_connections = 0; >>>>>>>>> /* according to the man pages, select replaces the >>>>>>>>> given descriptor >>>>>>>>> * set with a subset consisting of those descriptors >>>>>>>>> that are ready >>>>>>>>> * for the specified operation - in this case, a read. >>>>>>>>> So we need to >>>>>>>>> * first check to see if this file descriptor is >>>>>>>>> included in the >>>>>>>>> * returned subset >>>>>>>>> */ >>>>>>>>> if (0 == FD_ISSET(pmix_server_globals.listen_socket, >>>>>>>>> &readfds)) { >>>>>>>>> /* this descriptor is not included */ >>>>>>>>> continue; >>>>>>>>> } >>>>>>>>> >>>>>>>>> /* this descriptor is ready to be read, which means a >>>>>>>>> connection >>>>>>>>> * request has been received - so harvest it. All we >>>>>>>>> want to do >>>>>>>>> * here is accept the connection and push the info >>>>>>>>> onto the event >>>>>>>>> * library for subsequent processing - we don't want >>>>>>>>> to actually >>>>>>>>> * process the connection here as it takes too long, >>>>>>>>> and so the >>>>>>>>> * OS might start rejecting connections due to timeout. >>>>>>>>> */ >>>>>>>>> pending_connection = >>>>>>>>> PMIX_NEW(pmix_pending_connection_t); >>>>>>>>> event_assign(&pending_connection->ev, >>>>>>>>> pmix_globals.evbase, -1, >>>>>>>>> EV_WRITE, connection_handler, >>>>>>>>> pending_connection); >>>>>>>>> pending_connection->sd = >>>>>>>>> accept(pmix_server_globals.listen_socket, >>>>>>>>> (struct >>>>>>>>> sockaddr*)&(pending_connection->addr), >>>>>>>>> &addrlen); >>>>>>>>> if (pending_connection->sd < 0) { >>>>>>>>> PMIX_RELEASE(pending_connection); >>>>>>>>> if (pmix_socket_errno != EAGAIN || >>>>>>>>> pmix_socket_errno != EWOULDBLOCK) { >>>>>>>>> if (EMFILE == pmix_socket_errno) { >>>>>>>>> PMIX_ERROR_LOG(PMIX_ERR_OUT_OF_RESOURCE); >>>>>>>>> } else { >>>>>>>>> pmix_output(0, "listen_thread: accept() >>>>>>>>> failed: %s (%d).", >>>>>>>>> strerror(pmix_socket_errno), >>>>>>>>> pmix_socket_errno); >>>>>>>>> } >>>>>>>>> goto done; >>>>>>>>> } >>>>>>>>> continue; >>>>>>>>> } >>>>>>>>> >>>>>>>>> pmix_output_verbose(8, pmix_globals.debug_output, >>>>>>>>> "listen_thread: new connection: >>>>>>>>> (%d, %d)", >>>>>>>>> pending_connection->sd, >>>>>>>>> pmix_socket_errno); >>>>>>>>> /* activate the event */ >>>>>>>>> event_active(&pending_connection->ev, EV_WRITE, 1); >>>>>>>>> accepted_connections++; >>>>>>>>> } while (accepted_connections > 0); >>>>>>>>> >>>>>>>>> >>>>>>>>> On Oct 28, 2015, at 12:25 AM, Ralph Castain <r...@open-mpi.org> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Looking at the code, it appears that a fix was committed for this >>>>>>>>> problem, and that we correctly resolved the issue found by Paul. The >>>>>>>>> problem is that the fix didn’t get upstreamed, and so it was lost the >>>>>>>>> next >>>>>>>>> time we refreshed PMIx. Sigh. >>>>>>>>> >>>>>>>>> Let me try to recreate the fix and have you take a gander at it. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Oct 28, 2015, at 12:22 AM, Ralph Castain <r...@open-mpi.org> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Here is the discussion - afraid it is fairly lengthy. Ignore the >>>>>>>>> hwloc references in it as that was a separate issue: >>>>>>>>> >>>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/09/18074.php >>>>>>>>> >>>>>>>>> It definitely sounds like the same issue creeping in again. I’d >>>>>>>>> appreciate any thoughts on how to correct it. If it helps, you could >>>>>>>>> look >>>>>>>>> at the PMIx master - there are standalone tests in the test/simple >>>>>>>>> directory that fork/exec a child and just do the connection. >>>>>>>>> >>>>>>>>> https://github.com/pmix/master >>>>>>>>> >>>>>>>>> The test server is simptest.c - it will spawn a single copy of >>>>>>>>> simpclient.c by default. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Oct 27, 2015, at 10:14 PM, George Bosilca <bosi...@icl.utk.edu> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Interesting. Do you have a pointer to the commit (or/and to the >>>>>>>>> discussion)? >>>>>>>>> >>>>>>>>> I looked at the PMIX code, and I have identified few issues, but >>>>>>>>> unfortunately none of them seem to fix the problem for good. However, >>>>>>>>> now I >>>>>>>>> need more than 1000 runs to get a deadlock (instead of few tens). >>>>>>>>> >>>>>>>>> Looking with "netstat -ax" at the status of the UDS while the >>>>>>>>> processes are deadlocked, I see 2 UDS with the same name: one from the >>>>>>>>> server which is in LISTEN state, and one for the client which is >>>>>>>>> being in >>>>>>>>> CONNECTING state (while the client already sent a message in the >>>>>>>>> socket and >>>>>>>>> is now waiting in a blocking receive). This somehow suggest that the >>>>>>>>> server >>>>>>>>> has not yet called accept on the UDS. Unfortunately, there are 3 >>>>>>>>> threads >>>>>>>>> all doing different flavors of even_base and select, so I have a hard >>>>>>>>> time >>>>>>>>> tracking the path of the UDS on the server side. >>>>>>>>> >>>>>>>>> So in order to validate my assumption I wrote a minimalistic UDS >>>>>>>>> client and server application and tried different scenarios. The >>>>>>>>> conclusion >>>>>>>>> is that in order to see the same type of output from "netstat -ax" I >>>>>>>>> have >>>>>>>>> to call listen on the server, connect on the client and do not call >>>>>>>>> accept >>>>>>>>> on the server. >>>>>>>>> >>>>>>>>> With the same occasion I also confirmed that the UDS are holding >>>>>>>>> the data sent so there is no need for further synchronization for the >>>>>>>>> case >>>>>>>>> where the data is sent first. We only need to find out how the server >>>>>>>>> forgets to call accept. >>>>>>>>> >>>>>>>>> George. >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Oct 27, 2015 at 7:52 PM, Ralph Castain <r...@open-mpi.org> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hmmm…this looks like it might be that problem we previously saw >>>>>>>>>> where the blocking recv hangs in a proc when the blocking send tries >>>>>>>>>> to >>>>>>>>>> send before the domain socket is actually ready, and so the send >>>>>>>>>> fails on >>>>>>>>>> the other end. As I recall, it was something to do with the >>>>>>>>>> socketoptions - >>>>>>>>>> and then Paul had a problem on some of his machines, and we backed >>>>>>>>>> it out? >>>>>>>>>> >>>>>>>>>> I wonder if that’s what is biting us here again, and what we need >>>>>>>>>> is to either remove the blocking send/recv’s altogether, or figure >>>>>>>>>> out a >>>>>>>>>> way to wait until the socket is really ready. >>>>>>>>>> >>>>>>>>>> Any thoughts? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Oct 27, 2015, at 4:11 PM, George Bosilca <bosi...@icl.utk.edu> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> It appear the branch solve the problem at least partially. I >>>>>>>>>> asked one of my students to hammer it pretty badly, and he reported >>>>>>>>>> that >>>>>>>>>> the deadlocks still occur. He also graciously provided some >>>>>>>>>> stacktraces: >>>>>>>>>> >>>>>>>>>> #0 0x00007f4bd5274aed in nanosleep () from /lib64/libc.so.6 >>>>>>>>>> #1 0x00007f4bd52a9c94 in usleep () from /lib64/libc.so.6 >>>>>>>>>> #2 0x00007f4bd2e42b00 in OPAL_PMIX_PMIX1XX_PMIx_Fence >>>>>>>>>> (procs=0x0, nprocs=0, info=0x7fff3c561960, >>>>>>>>>> ninfo=1) at src/client/pmix_client_fence.c:100 >>>>>>>>>> #3 0x00007f4bd306e6d2 in pmix1_fence (procs=0x0, collect_data=1) >>>>>>>>>> at pmix1_client.c:306 >>>>>>>>>> #4 0x00007f4bd57d5cc3 in ompi_mpi_init (argc=3, >>>>>>>>>> argv=0x7fff3c561ea8, requested=3, >>>>>>>>>> provided=0x7fff3c561d84) at runtime/ompi_mpi_init.c:644 >>>>>>>>>> #5 0x00007f4bd5813399 in PMPI_Init_thread (argc=0x7fff3c561d7c, >>>>>>>>>> argv=0x7fff3c561d70, required=3, >>>>>>>>>> provided=0x7fff3c561d84) at pinit_thread.c:69 >>>>>>>>>> #6 0x0000000000401516 in main (argc=3, argv=0x7fff3c561ea8) at >>>>>>>>>> osu_mbw_mr.c:86 >>>>>>>>>> >>>>>>>>>> And another process: >>>>>>>>>> >>>>>>>>>> #0 0x00007f7b9d7d8bdc in recv () from /lib64/libpthread.so.0 >>>>>>>>>> #1 0x00007f7b9b0aa42d in >>>>>>>>>> opal_pmix_pmix1xx_pmix_usock_recv_blocking (sd=13, >>>>>>>>>> data=0x7ffd62139004 "", >>>>>>>>>> size=4) at src/usock/usock.c:168 >>>>>>>>>> #2 0x00007f7b9b0af5d9 in recv_connect_ack (sd=13) at >>>>>>>>>> src/client/pmix_client.c:844 >>>>>>>>>> #3 0x00007f7b9b0b085e in usock_connect (addr=0x7ffd62139330) at >>>>>>>>>> src/client/pmix_client.c:1110 >>>>>>>>>> #4 0x00007f7b9b0acc24 in connect_to_server >>>>>>>>>> (address=0x7ffd62139330, cbdata=0x7ffd621390e0) >>>>>>>>>> at src/client/pmix_client.c:181 >>>>>>>>>> #5 0x00007f7b9b0ad569 in OPAL_PMIX_PMIX1XX_PMIx_Init >>>>>>>>>> (proc=0x7f7b9b4e9b60) >>>>>>>>>> at src/client/pmix_client.c:362 >>>>>>>>>> #6 0x00007f7b9b2dbd9d in pmix1_client_init () at >>>>>>>>>> pmix1_client.c:99 >>>>>>>>>> #7 0x00007f7b9b4eb95f in pmi_component_query >>>>>>>>>> (module=0x7ffd62139490, priority=0x7ffd6213948c) >>>>>>>>>> at ess_pmi_component.c:90 >>>>>>>>>> #8 0x00007f7b9ce70ec5 in mca_base_select >>>>>>>>>> (type_name=0x7f7b9d20e059 "ess", output_id=-1, >>>>>>>>>> components_available=0x7f7b9d431eb0, >>>>>>>>>> best_module=0x7ffd621394d0, best_component=0x7ffd621394d8, >>>>>>>>>> priority_out=0x0) at mca_base_components_select.c:77 >>>>>>>>>> #9 0x00007f7b9d1a956b in orte_ess_base_select () at >>>>>>>>>> base/ess_base_select.c:40 >>>>>>>>>> #10 0x00007f7b9d160449 in orte_init (pargc=0x0, pargv=0x0, >>>>>>>>>> flags=32) at runtime/orte_init.c:219 >>>>>>>>>> #11 0x00007f7b9da4377a in ompi_mpi_init (argc=3, >>>>>>>>>> argv=0x7ffd621397f8, requested=3, >>>>>>>>>> provided=0x7ffd621396d4) at runtime/ompi_mpi_init.c:488 >>>>>>>>>> #12 0x00007f7b9da81399 in PMPI_Init_thread (argc=0x7ffd621396cc, >>>>>>>>>> argv=0x7ffd621396c0, required=3, >>>>>>>>>> provided=0x7ffd621396d4) at pinit_thread.c:69 >>>>>>>>>> #13 0x0000000000401516 in main (argc=3, argv=0x7ffd621397f8) at >>>>>>>>>> osu_mbw_mr.c:86 >>>>>>>>>> >>>>>>>>>> George. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, Oct 27, 2015 at 2:36 PM, Ralph Castain <r...@open-mpi.org> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> I haven’t been able to replicate this when using the branch in >>>>>>>>>>> this PR: >>>>>>>>>>> >>>>>>>>>>> https://github.com/open-mpi/ompi/pull/1073 >>>>>>>>>>> >>>>>>>>>>> Would you mind giving it a try? It fixes some other race >>>>>>>>>>> conditions and might pick this one up too. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Oct 27, 2015, at 10:04 AM, Ralph Castain <r...@open-mpi.org> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Okay, I’ll take a look - I’ve been chasing a race condition that >>>>>>>>>>> might be related >>>>>>>>>>> >>>>>>>>>>> On Oct 27, 2015, at 9:54 AM, George Bosilca <bosi...@icl.utk.edu> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> No, it's using 2 nodes. >>>>>>>>>>> George. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Tue, Oct 27, 2015 at 12:35 PM, Ralph Castain < >>>>>>>>>>> r...@open-mpi.org> wrote: >>>>>>>>>>> >>>>>>>>>>>> Is this on a single node? >>>>>>>>>>>> >>>>>>>>>>>> On Oct 27, 2015, at 9:25 AM, George Bosilca < >>>>>>>>>>>> bosi...@icl.utk.edu> wrote: >>>>>>>>>>>> >>>>>>>>>>>> I get intermittent deadlocks wit the latest trunk. The smallest >>>>>>>>>>>> reproducer is a shell for loop around a small (2 processes) short >>>>>>>>>>>> (20 >>>>>>>>>>>> seconds) MPI application. After few tens of iterations the >>>>>>>>>>>> MPI_Init will >>>>>>>>>>>> deadlock with the following backtrace: >>>>>>>>>>>> >>>>>>>>>>>> #0 0x00007fa94b5d9aed in nanosleep () from /lib64/libc.so.6 >>>>>>>>>>>> #1 0x00007fa94b60ec94 in usleep () from /lib64/libc.so.6 >>>>>>>>>>>> #2 0x00007fa94960ba08 in OPAL_PMIX_PMIX1XX_PMIx_Fence >>>>>>>>>>>> (procs=0x0, nprocs=0, info=0x7ffd7934fb90, >>>>>>>>>>>> ninfo=1) at src/client/pmix_client_fence.c:100 >>>>>>>>>>>> #3 0x00007fa9498376a2 in pmix1_fence (procs=0x0, >>>>>>>>>>>> collect_data=1) at pmix1_client.c:305 >>>>>>>>>>>> #4 0x00007fa94bb39ba4 in ompi_mpi_init (argc=3, >>>>>>>>>>>> argv=0x7ffd793500a8, requested=3, >>>>>>>>>>>> provided=0x7ffd7934ff94) at runtime/ompi_mpi_init.c:645 >>>>>>>>>>>> #5 0x00007fa94bb77281 in PMPI_Init_thread >>>>>>>>>>>> (argc=0x7ffd7934ff8c, argv=0x7ffd7934ff80, required=3, >>>>>>>>>>>> provided=0x7ffd7934ff94) at pinit_thread.c:69 >>>>>>>>>>>> #6 0x000000000040150f in main (argc=3, argv=0x7ffd793500a8) at >>>>>>>>>>>> osu_mbw_mr.c:86 >>>>>>>>>>>> >>>>>>>>>>>> On my machines this is reproducible at 100% after anywhere >>>>>>>>>>>> between 50 and 100 iterations. >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> George. >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> devel mailing list >>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>> Subscription: >>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>> Link to this post: >>>>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18280.php >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> devel mailing list >>>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>>> Subscription: >>>>>>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>>> Link to this post: >>>>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18281.php >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> devel mailing list >>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>> Link to this post: >>>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18282.php >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> devel mailing list >>>>>>>>>>> de...@open-mpi.org >>>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>> Link to this post: >>>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18284.php >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> devel mailing list >>>>>>>>>> de...@open-mpi.org >>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>> Link to this post: >>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18292.php >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> devel mailing list >>>>>>>>>> de...@open-mpi.org >>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>> Link to this post: >>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18294.php >>>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> devel mailing list >>>>>>>>> de...@open-mpi.org >>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>> Link to this post: >>>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18302.php >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> devel mailing list >>>>>>>>> de...@open-mpi.org >>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>> Link to this post: >>>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18309.php >>>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> devel mailing list >>>>>>>> de...@open-mpi.org >>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>> Link to this post: >>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18320.php >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> devel mailing list >>>>>>>> de...@open-mpi.org >>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>> Link to this post: >>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18323.php >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> С Уважением, Поляков Артем Юрьевич >>>>>>> Best regards, Artem Y. Polyakov >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org >>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>> Link to this post: >>>>>>> http://www.open-mpi.org/community/lists/devel/2015/11/18334.php >>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org >>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>> Link to this post: >>>>>>> http://www.open-mpi.org/community/lists/devel/2015/11/18335.php >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/devel/2015/11/18336.php >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/devel/2015/11/18337.php >>>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/devel/2015/11/18340.php >>>>> >>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2015/11/18341.php >>>> >>> >>> >>> >>> -- >>> С Уважением, Поляков Артем Юрьевич >>> Best regards, Artem Y. Polyakov >>> >> >> >> >> -- >> С Уважением, Поляков Артем Юрьевич >> Best regards, Artem Y. Polyakov >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2015/11/18345.php >> > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/11/18346.php > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/11/18347.php >