George was looking into it, but I don’t know if he has had time recently to continue the investigation. We understand “what” is happening (accept sometimes ignores the connection), but we don’t yet know “why”. I’ve done some digging around the web, and found that sometimes you can try to talk to a Unix Domain Socket too quickly - i.e., you open it and then send to it, but the OS hasn’t yet set it up. In those cases, you can hang the socket. However, I’ve tried adding some artificial delay, and while it helped, it didn’t completely solve the problem.
I have an idea for a workaround (set a timer and retry after awhile), but would obviously prefer a real solution. I’m not even sure it will work as it is unclear that the server (who is the one hung in accept) will break free if the client closes the socket and retries. > On Nov 6, 2015, at 10:53 PM, Artem Polyakov <artpo...@gmail.com> wrote: > > Hello, is there any progress on this topic? This affects our PMIx > measurements. > > 2015-10-30 21:21 GMT+06:00 Ralph Castain <r...@open-mpi.org > <mailto:r...@open-mpi.org>>: > I’ve verified that the orte/util/listener thread is not being started, so I > don’t think it should be involved in this problem. > > HTH > Ralph > >> On Oct 30, 2015, at 8:07 AM, Ralph Castain <r...@open-mpi.org >> <mailto:r...@open-mpi.org>> wrote: >> >> Hmmm…there is a hook that would allow the PMIx server to utilize that >> listener thread, but we aren’t currently using it. Each daemon plus mpirun >> will call orte_start_listener, but nothing is currently registering and so >> the listener in that code is supposed to just return without starting the >> thread. >> >> So the only listener thread that should exist is the one inside the PMIx >> server itself. If something else is happening, then that would be a bug. I >> can look at the orte listener code to ensure that the thread isn’t >> incorrectly starting. >> >> >>> On Oct 29, 2015, at 10:03 PM, George Bosilca <bosi...@icl.utk.edu >>> <mailto:bosi...@icl.utk.edu>> wrote: >>> >>> Some progress, that puzzles me but might help you understand. Once the >>> deadlock appears, if I manually kill the MPI process on the node where the >>> deadlock was created, the local orte daemon doesn't notice and will just >>> keep waiting. >>> >>> Quick question: I am under the impression that the issue is not in the PMIX >>> server but somewhere around the listener_thread_fn in orte/util/listener.c. >>> Possible ? >>> >>> George. >>> >>> >>> On Wed, Oct 28, 2015 at 3:56 AM, Ralph Castain <r...@open-mpi.org >>> <mailto:r...@open-mpi.org>> wrote: >>> Should have also clarified: the prior fixes are indeed in the current >>> master. >>> >>>> On Oct 28, 2015, at 12:42 AM, Ralph Castain <r...@open-mpi.org >>>> <mailto:r...@open-mpi.org>> wrote: >>>> >>>> Nope - I was wrong. The correction on the client side consisted of >>>> attempting to timeout if the blocking recv failed. We then modified the >>>> blocking send/recv so they would handle errors. >>>> >>>> So that problem occurred -after- the server had correctly called accept. >>>> The listener code is in >>>> opal/mca/pmix/pmix1xx/pmix/src/server/pmix_server_listener.c >>>> >>>> It looks to me like the only way we could drop the accept (assuming the OS >>>> doesn’t lose it) is if the file descriptor lies outside the expected range >>>> once we fall out of select: >>>> >>>> >>>> /* Spin accepting connections until all active listen sockets >>>> * do not have any incoming connections, pushing each connection >>>> * onto the event queue for processing >>>> */ >>>> do { >>>> accepted_connections = 0; >>>> /* according to the man pages, select replaces the given >>>> descriptor >>>> * set with a subset consisting of those descriptors that are >>>> ready >>>> * for the specified operation - in this case, a read. So we >>>> need to >>>> * first check to see if this file descriptor is included in >>>> the >>>> * returned subset >>>> */ >>>> if (0 == FD_ISSET(pmix_server_globals.listen_socket, >>>> &readfds)) { >>>> /* this descriptor is not included */ >>>> continue; >>>> } >>>> >>>> /* this descriptor is ready to be read, which means a >>>> connection >>>> * request has been received - so harvest it. All we want to do >>>> * here is accept the connection and push the info onto the >>>> event >>>> * library for subsequent processing - we don't want to >>>> actually >>>> * process the connection here as it takes too long, and so the >>>> * OS might start rejecting connections due to timeout. >>>> */ >>>> pending_connection = PMIX_NEW(pmix_pending_connection_t); >>>> event_assign(&pending_connection->ev, pmix_globals.evbase, -1, >>>> EV_WRITE, connection_handler, pending_connection); >>>> pending_connection->sd = >>>> accept(pmix_server_globals.listen_socket, >>>> (struct >>>> sockaddr*)&(pending_connection->addr), >>>> &addrlen); >>>> if (pending_connection->sd < 0) { >>>> PMIX_RELEASE(pending_connection); >>>> if (pmix_socket_errno != EAGAIN || >>>> pmix_socket_errno != EWOULDBLOCK) { >>>> if (EMFILE == pmix_socket_errno) { >>>> PMIX_ERROR_LOG(PMIX_ERR_OUT_OF_RESOURCE); >>>> } else { >>>> pmix_output(0, "listen_thread: accept() failed: %s >>>> (%d).", >>>> strerror(pmix_socket_errno), >>>> pmix_socket_errno); >>>> } >>>> goto done; >>>> } >>>> continue; >>>> } >>>> >>>> pmix_output_verbose(8, pmix_globals.debug_output, >>>> "listen_thread: new connection: (%d, %d)", >>>> pending_connection->sd, pmix_socket_errno); >>>> /* activate the event */ >>>> event_active(&pending_connection->ev, EV_WRITE, 1); >>>> accepted_connections++; >>>> } while (accepted_connections > 0); >>>> >>>> >>>>> On Oct 28, 2015, at 12:25 AM, Ralph Castain <r...@open-mpi.org >>>>> <mailto:r...@open-mpi.org>> wrote: >>>>> >>>>> Looking at the code, it appears that a fix was committed for this >>>>> problem, and that we correctly resolved the issue found by Paul. The >>>>> problem is that the fix didn’t get upstreamed, and so it was lost the >>>>> next time we refreshed PMIx. Sigh. >>>>> >>>>> Let me try to recreate the fix and have you take a gander at it. >>>>> >>>>> >>>>>> On Oct 28, 2015, at 12:22 AM, Ralph Castain <r...@open-mpi.org >>>>>> <mailto:r...@open-mpi.org>> wrote: >>>>>> >>>>>> Here is the discussion - afraid it is fairly lengthy. Ignore the hwloc >>>>>> references in it as that was a separate issue: >>>>>> >>>>>> http://www.open-mpi.org/community/lists/devel/2015/09/18074.php >>>>>> <http://www.open-mpi.org/community/lists/devel/2015/09/18074.php> >>>>>> >>>>>> It definitely sounds like the same issue creeping in again. I’d >>>>>> appreciate any thoughts on how to correct it. If it helps, you could >>>>>> look at the PMIx master - there are standalone tests in the test/simple >>>>>> directory that fork/exec a child and just do the connection. >>>>>> >>>>>> https://github.com/pmix/master <https://github.com/pmix/master> >>>>>> >>>>>> The test server is simptest.c - it will spawn a single copy of >>>>>> simpclient.c by default. >>>>>> >>>>>> >>>>>>> On Oct 27, 2015, at 10:14 PM, George Bosilca <bosi...@icl.utk.edu >>>>>>> <mailto:bosi...@icl.utk.edu>> wrote: >>>>>>> >>>>>>> Interesting. Do you have a pointer to the commit (or/and to the >>>>>>> discussion)? >>>>>>> >>>>>>> I looked at the PMIX code, and I have identified few issues, but >>>>>>> unfortunately none of them seem to fix the problem for good. However, >>>>>>> now I need more than 1000 runs to get a deadlock (instead of few tens). >>>>>>> >>>>>>> Looking with "netstat -ax" at the status of the UDS while the processes >>>>>>> are deadlocked, I see 2 UDS with the same name: one from the server >>>>>>> which is in LISTEN state, and one for the client which is being in >>>>>>> CONNECTING state (while the client already sent a message in the socket >>>>>>> and is now waiting in a blocking receive). This somehow suggest that >>>>>>> the server has not yet called accept on the UDS. Unfortunately, there >>>>>>> are 3 threads all doing different flavors of even_base and select, so I >>>>>>> have a hard time tracking the path of the UDS on the server side. >>>>>>> >>>>>>> So in order to validate my assumption I wrote a minimalistic UDS client >>>>>>> and server application and tried different scenarios. The conclusion is >>>>>>> that in order to see the same type of output from "netstat -ax" I have >>>>>>> to call listen on the server, connect on the client and do not call >>>>>>> accept on the server. >>>>>>> >>>>>>> With the same occasion I also confirmed that the UDS are holding the >>>>>>> data sent so there is no need for further synchronization for the case >>>>>>> where the data is sent first. We only need to find out how the server >>>>>>> forgets to call accept. >>>>>>> >>>>>>> George. >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Oct 27, 2015 at 7:52 PM, Ralph Castain <r...@open-mpi.org >>>>>>> <mailto:r...@open-mpi.org>> wrote: >>>>>>> Hmmm…this looks like it might be that problem we previously saw where >>>>>>> the blocking recv hangs in a proc when the blocking send tries to send >>>>>>> before the domain socket is actually ready, and so the send fails on >>>>>>> the other end. As I recall, it was something to do with the >>>>>>> socketoptions - and then Paul had a problem on some of his machines, >>>>>>> and we backed it out? >>>>>>> >>>>>>> I wonder if that’s what is biting us here again, and what we need is to >>>>>>> either remove the blocking send/recv’s altogether, or figure out a way >>>>>>> to wait until the socket is really ready. >>>>>>> >>>>>>> Any thoughts? >>>>>>> >>>>>>> >>>>>>>> On Oct 27, 2015, at 4:11 PM, George Bosilca <bosi...@icl.utk.edu >>>>>>>> <mailto:bosi...@icl.utk.edu>> wrote: >>>>>>>> >>>>>>>> It appear the branch solve the problem at least partially. I asked one >>>>>>>> of my students to hammer it pretty badly, and he reported that the >>>>>>>> deadlocks still occur. He also graciously provided some stacktraces: >>>>>>>> >>>>>>>> #0 0x00007f4bd5274aed in nanosleep () from /lib64/libc.so.6 >>>>>>>> #1 0x00007f4bd52a9c94 in usleep () from /lib64/libc.so.6 >>>>>>>> #2 0x00007f4bd2e42b00 in OPAL_PMIX_PMIX1XX_PMIx_Fence (procs=0x0, >>>>>>>> nprocs=0, info=0x7fff3c561960, >>>>>>>> ninfo=1) at src/client/pmix_client_fence.c:100 >>>>>>>> #3 0x00007f4bd306e6d2 in pmix1_fence (procs=0x0, collect_data=1) at >>>>>>>> pmix1_client.c:306 >>>>>>>> #4 0x00007f4bd57d5cc3 in ompi_mpi_init (argc=3, argv=0x7fff3c561ea8, >>>>>>>> requested=3, >>>>>>>> provided=0x7fff3c561d84) at runtime/ompi_mpi_init.c:644 >>>>>>>> #5 0x00007f4bd5813399 in PMPI_Init_thread (argc=0x7fff3c561d7c, >>>>>>>> argv=0x7fff3c561d70, required=3, >>>>>>>> provided=0x7fff3c561d84) at pinit_thread.c:69 >>>>>>>> #6 0x0000000000401516 in main (argc=3, argv=0x7fff3c561ea8) at >>>>>>>> osu_mbw_mr.c:86 >>>>>>>> >>>>>>>> And another process: >>>>>>>> >>>>>>>> #0 0x00007f7b9d7d8bdc in recv () from /lib64/libpthread.so.0 >>>>>>>> #1 0x00007f7b9b0aa42d in opal_pmix_pmix1xx_pmix_usock_recv_blocking >>>>>>>> (sd=13, data=0x7ffd62139004 "", >>>>>>>> size=4) at src/usock/usock.c:168 >>>>>>>> #2 0x00007f7b9b0af5d9 in recv_connect_ack (sd=13) at >>>>>>>> src/client/pmix_client.c:844 >>>>>>>> #3 0x00007f7b9b0b085e in usock_connect (addr=0x7ffd62139330) at >>>>>>>> src/client/pmix_client.c:1110 >>>>>>>> #4 0x00007f7b9b0acc24 in connect_to_server (address=0x7ffd62139330, >>>>>>>> cbdata=0x7ffd621390e0) >>>>>>>> at src/client/pmix_client.c:181 >>>>>>>> #5 0x00007f7b9b0ad569 in OPAL_PMIX_PMIX1XX_PMIx_Init >>>>>>>> (proc=0x7f7b9b4e9b60) >>>>>>>> at src/client/pmix_client.c:362 >>>>>>>> #6 0x00007f7b9b2dbd9d in pmix1_client_init () at pmix1_client.c:99 >>>>>>>> #7 0x00007f7b9b4eb95f in pmi_component_query (module=0x7ffd62139490, >>>>>>>> priority=0x7ffd6213948c) >>>>>>>> at ess_pmi_component.c:90 >>>>>>>> #8 0x00007f7b9ce70ec5 in mca_base_select (type_name=0x7f7b9d20e059 >>>>>>>> "ess", output_id=-1, >>>>>>>> components_available=0x7f7b9d431eb0, best_module=0x7ffd621394d0, >>>>>>>> best_component=0x7ffd621394d8, >>>>>>>> priority_out=0x0) at mca_base_components_select.c:77 >>>>>>>> #9 0x00007f7b9d1a956b in orte_ess_base_select () at >>>>>>>> base/ess_base_select.c:40 >>>>>>>> #10 0x00007f7b9d160449 in orte_init (pargc=0x0, pargv=0x0, flags=32) >>>>>>>> at runtime/orte_init.c:219 >>>>>>>> #11 0x00007f7b9da4377a in ompi_mpi_init (argc=3, argv=0x7ffd621397f8, >>>>>>>> requested=3, >>>>>>>> provided=0x7ffd621396d4) at runtime/ompi_mpi_init.c:488 >>>>>>>> #12 0x00007f7b9da81399 in PMPI_Init_thread (argc=0x7ffd621396cc, >>>>>>>> argv=0x7ffd621396c0, required=3, >>>>>>>> provided=0x7ffd621396d4) at pinit_thread.c:69 >>>>>>>> #13 0x0000000000401516 in main (argc=3, argv=0x7ffd621397f8) at >>>>>>>> osu_mbw_mr.c:86 >>>>>>>> >>>>>>>> George. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Oct 27, 2015 at 2:36 PM, Ralph Castain <r...@open-mpi.org >>>>>>>> <mailto:r...@open-mpi.org>> wrote: >>>>>>>> I haven’t been able to replicate this when using the branch in this PR: >>>>>>>> >>>>>>>> https://github.com/open-mpi/ompi/pull/1073 >>>>>>>> <https://github.com/open-mpi/ompi/pull/1073> >>>>>>>> >>>>>>>> Would you mind giving it a try? It fixes some other race conditions >>>>>>>> and might pick this one up too. >>>>>>>> >>>>>>>> >>>>>>>>> On Oct 27, 2015, at 10:04 AM, Ralph Castain <r...@open-mpi.org >>>>>>>>> <mailto:r...@open-mpi.org>> wrote: >>>>>>>>> >>>>>>>>> Okay, I’ll take a look - I’ve been chasing a race condition that >>>>>>>>> might be related >>>>>>>>> >>>>>>>>>> On Oct 27, 2015, at 9:54 AM, George Bosilca <bosi...@icl.utk.edu >>>>>>>>>> <mailto:bosi...@icl.utk.edu>> wrote: >>>>>>>>>> >>>>>>>>>> No, it's using 2 nodes. >>>>>>>>>> George. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, Oct 27, 2015 at 12:35 PM, Ralph Castain <r...@open-mpi.org >>>>>>>>>> <mailto:r...@open-mpi.org>> wrote: >>>>>>>>>> Is this on a single node? >>>>>>>>>> >>>>>>>>>>> On Oct 27, 2015, at 9:25 AM, George Bosilca <bosi...@icl.utk.edu >>>>>>>>>>> <mailto:bosi...@icl.utk.edu>> wrote: >>>>>>>>>>> >>>>>>>>>>> I get intermittent deadlocks wit the latest trunk. The smallest >>>>>>>>>>> reproducer is a shell for loop around a small (2 processes) short >>>>>>>>>>> (20 seconds) MPI application. After few tens of iterations the >>>>>>>>>>> MPI_Init will deadlock with the following backtrace: >>>>>>>>>>> >>>>>>>>>>> #0 0x00007fa94b5d9aed in nanosleep () from /lib64/libc.so.6 >>>>>>>>>>> #1 0x00007fa94b60ec94 in usleep () from /lib64/libc.so.6 >>>>>>>>>>> #2 0x00007fa94960ba08 in OPAL_PMIX_PMIX1XX_PMIx_Fence (procs=0x0, >>>>>>>>>>> nprocs=0, info=0x7ffd7934fb90, >>>>>>>>>>> ninfo=1) at src/client/pmix_client_fence.c:100 >>>>>>>>>>> #3 0x00007fa9498376a2 in pmix1_fence (procs=0x0, collect_data=1) >>>>>>>>>>> at pmix1_client.c:305 >>>>>>>>>>> #4 0x00007fa94bb39ba4 in ompi_mpi_init (argc=3, >>>>>>>>>>> argv=0x7ffd793500a8, requested=3, >>>>>>>>>>> provided=0x7ffd7934ff94) at runtime/ompi_mpi_init.c:645 >>>>>>>>>>> #5 0x00007fa94bb77281 in PMPI_Init_thread (argc=0x7ffd7934ff8c, >>>>>>>>>>> argv=0x7ffd7934ff80, required=3, >>>>>>>>>>> provided=0x7ffd7934ff94) at pinit_thread.c:69 >>>>>>>>>>> #6 0x000000000040150f in main (argc=3, argv=0x7ffd793500a8) at >>>>>>>>>>> osu_mbw_mr.c:86 >>>>>>>>>>> >>>>>>>>>>> On my machines this is reproducible at 100% after anywhere between >>>>>>>>>>> 50 and 100 iterations. >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> George. >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> devel mailing list >>>>>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>>>>> Link to this post: >>>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18280.php >>>>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2015/10/18280.php> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> devel mailing list >>>>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>>>> Link to this post: >>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18281.php >>>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2015/10/18281.php> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> devel mailing list >>>>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>>>> Link to this post: >>>>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18282.php >>>>>>>>>> <http://www.open-mpi.org/community/lists/devel/2015/10/18282.php> >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> devel mailing list >>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>> Link to this post: >>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18284.php >>>>>>>> <http://www.open-mpi.org/community/lists/devel/2015/10/18284.php> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> devel mailing list >>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>>> Link to this post: >>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18292.php >>>>>>>> <http://www.open-mpi.org/community/lists/devel/2015/10/18292.php> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>> Link to this post: >>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18294.php >>>>>>> <http://www.open-mpi.org/community/lists/devel/2015/10/18294.php> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>>>>>> Link to this post: >>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18302.php >>>>>>> <http://www.open-mpi.org/community/lists/devel/2015/10/18302.php> >>>>> >>>> >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2015/10/18309.php >>> <http://www.open-mpi.org/community/lists/devel/2015/10/18309.php> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org <mailto:de...@open-mpi.org> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel> >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2015/10/18320.php >>> <http://www.open-mpi.org/community/lists/devel/2015/10/18320.php> > > > _______________________________________________ > devel mailing list > de...@open-mpi.org <mailto:de...@open-mpi.org> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > <http://www.open-mpi.org/mailman/listinfo.cgi/devel> > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/10/18323.php > <http://www.open-mpi.org/community/lists/devel/2015/10/18323.php> > > > > -- > С Уважением, Поляков Артем Юрьевич > Best regards, Artem Y. Polyakov > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/11/18334.php