Hello, is there any progress on this topic? This affects our PMIx measurements.
2015-10-30 21:21 GMT+06:00 Ralph Castain <r...@open-mpi.org>: > I’ve verified that the orte/util/listener thread is not being started, so > I don’t think it should be involved in this problem. > > HTH > Ralph > > On Oct 30, 2015, at 8:07 AM, Ralph Castain <r...@open-mpi.org> wrote: > > Hmmm…there is a hook that would allow the PMIx server to utilize that > listener thread, but we aren’t currently using it. Each daemon plus mpirun > will call orte_start_listener, but nothing is currently registering and so > the listener in that code is supposed to just return without starting the > thread. > > So the only listener thread that should exist is the one inside the PMIx > server itself. If something else is happening, then that would be a bug. I > can look at the orte listener code to ensure that the thread isn’t > incorrectly starting. > > > On Oct 29, 2015, at 10:03 PM, George Bosilca <bosi...@icl.utk.edu> wrote: > > Some progress, that puzzles me but might help you understand. Once the > deadlock appears, if I manually kill the MPI process on the node where the > deadlock was created, the local orte daemon doesn't notice and will just > keep waiting. > > Quick question: I am under the impression that the issue is not in the > PMIX server but somewhere around the listener_thread_fn in > orte/util/listener.c. Possible ? > > George. > > > On Wed, Oct 28, 2015 at 3:56 AM, Ralph Castain <r...@open-mpi.org> wrote: > >> Should have also clarified: the prior fixes are indeed in the current >> master. >> >> On Oct 28, 2015, at 12:42 AM, Ralph Castain <r...@open-mpi.org> wrote: >> >> Nope - I was wrong. The correction on the client side consisted of >> attempting to timeout if the blocking recv failed. We then modified the >> blocking send/recv so they would handle errors. >> >> So that problem occurred -after- the server had correctly called accept. >> The listener code is in >> opal/mca/pmix/pmix1xx/pmix/src/server/pmix_server_listener.c >> >> It looks to me like the only way we could drop the accept (assuming the >> OS doesn’t lose it) is if the file descriptor lies outside the expected >> range once we fall out of select: >> >> >> /* Spin accepting connections until all active listen sockets >> * do not have any incoming connections, pushing each connection >> * onto the event queue for processing >> */ >> do { >> accepted_connections = 0; >> /* according to the man pages, select replaces the given >> descriptor >> * set with a subset consisting of those descriptors that are >> ready >> * for the specified operation - in this case, a read. So we >> need to >> * first check to see if this file descriptor is included in >> the >> * returned subset >> */ >> if (0 == FD_ISSET(pmix_server_globals.listen_socket, >> &readfds)) { >> /* this descriptor is not included */ >> continue; >> } >> >> /* this descriptor is ready to be read, which means a >> connection >> * request has been received - so harvest it. All we want to >> do >> * here is accept the connection and push the info onto the >> event >> * library for subsequent processing - we don't want to >> actually >> * process the connection here as it takes too long, and so >> the >> * OS might start rejecting connections due to timeout. >> */ >> pending_connection = PMIX_NEW(pmix_pending_connection_t); >> event_assign(&pending_connection->ev, pmix_globals.evbase, -1, >> EV_WRITE, connection_handler, >> pending_connection); >> pending_connection->sd = >> accept(pmix_server_globals.listen_socket, >> (struct >> sockaddr*)&(pending_connection->addr), >> &addrlen); >> if (pending_connection->sd < 0) { >> PMIX_RELEASE(pending_connection); >> if (pmix_socket_errno != EAGAIN || >> pmix_socket_errno != EWOULDBLOCK) { >> if (EMFILE == pmix_socket_errno) { >> PMIX_ERROR_LOG(PMIX_ERR_OUT_OF_RESOURCE); >> } else { >> pmix_output(0, "listen_thread: accept() failed: >> %s (%d).", >> strerror(pmix_socket_errno), >> pmix_socket_errno); >> } >> goto done; >> } >> continue; >> } >> >> pmix_output_verbose(8, pmix_globals.debug_output, >> "listen_thread: new connection: (%d, %d)", >> pending_connection->sd, >> pmix_socket_errno); >> /* activate the event */ >> event_active(&pending_connection->ev, EV_WRITE, 1); >> accepted_connections++; >> } while (accepted_connections > 0); >> >> >> On Oct 28, 2015, at 12:25 AM, Ralph Castain <r...@open-mpi.org> wrote: >> >> Looking at the code, it appears that a fix was committed for this >> problem, and that we correctly resolved the issue found by Paul. The >> problem is that the fix didn’t get upstreamed, and so it was lost the next >> time we refreshed PMIx. Sigh. >> >> Let me try to recreate the fix and have you take a gander at it. >> >> >> On Oct 28, 2015, at 12:22 AM, Ralph Castain <r...@open-mpi.org> wrote: >> >> Here is the discussion - afraid it is fairly lengthy. Ignore the hwloc >> references in it as that was a separate issue: >> >> http://www.open-mpi.org/community/lists/devel/2015/09/18074.php >> >> It definitely sounds like the same issue creeping in again. I’d >> appreciate any thoughts on how to correct it. If it helps, you could look >> at the PMIx master - there are standalone tests in the test/simple >> directory that fork/exec a child and just do the connection. >> >> https://github.com/pmix/master >> >> The test server is simptest.c - it will spawn a single copy of >> simpclient.c by default. >> >> >> On Oct 27, 2015, at 10:14 PM, George Bosilca <bosi...@icl.utk.edu> wrote: >> >> Interesting. Do you have a pointer to the commit (or/and to the >> discussion)? >> >> I looked at the PMIX code, and I have identified few issues, but >> unfortunately none of them seem to fix the problem for good. However, now I >> need more than 1000 runs to get a deadlock (instead of few tens). >> >> Looking with "netstat -ax" at the status of the UDS while the processes >> are deadlocked, I see 2 UDS with the same name: one from the server which >> is in LISTEN state, and one for the client which is being in CONNECTING >> state (while the client already sent a message in the socket and is now >> waiting in a blocking receive). This somehow suggest that the server has >> not yet called accept on the UDS. Unfortunately, there are 3 threads all >> doing different flavors of even_base and select, so I have a hard time >> tracking the path of the UDS on the server side. >> >> So in order to validate my assumption I wrote a minimalistic UDS client >> and server application and tried different scenarios. The conclusion is >> that in order to see the same type of output from "netstat -ax" I have to >> call listen on the server, connect on the client and do not call accept on >> the server. >> >> With the same occasion I also confirmed that the UDS are holding the data >> sent so there is no need for further synchronization for the case where the >> data is sent first. We only need to find out how the server forgets to call >> accept. >> >> George. >> >> >> >> On Tue, Oct 27, 2015 at 7:52 PM, Ralph Castain <r...@open-mpi.org> wrote: >> >>> Hmmm…this looks like it might be that problem we previously saw where >>> the blocking recv hangs in a proc when the blocking send tries to send >>> before the domain socket is actually ready, and so the send fails on the >>> other end. As I recall, it was something to do with the socketoptions - and >>> then Paul had a problem on some of his machines, and we backed it out? >>> >>> I wonder if that’s what is biting us here again, and what we need is to >>> either remove the blocking send/recv’s altogether, or figure out a way to >>> wait until the socket is really ready. >>> >>> Any thoughts? >>> >>> >>> On Oct 27, 2015, at 4:11 PM, George Bosilca <bosi...@icl.utk.edu> wrote: >>> >>> It appear the branch solve the problem at least partially. I asked one >>> of my students to hammer it pretty badly, and he reported that the >>> deadlocks still occur. He also graciously provided some stacktraces: >>> >>> #0 0x00007f4bd5274aed in nanosleep () from /lib64/libc.so.6 >>> #1 0x00007f4bd52a9c94 in usleep () from /lib64/libc.so.6 >>> #2 0x00007f4bd2e42b00 in OPAL_PMIX_PMIX1XX_PMIx_Fence (procs=0x0, >>> nprocs=0, info=0x7fff3c561960, >>> ninfo=1) at src/client/pmix_client_fence.c:100 >>> #3 0x00007f4bd306e6d2 in pmix1_fence (procs=0x0, collect_data=1) at >>> pmix1_client.c:306 >>> #4 0x00007f4bd57d5cc3 in ompi_mpi_init (argc=3, argv=0x7fff3c561ea8, >>> requested=3, >>> provided=0x7fff3c561d84) at runtime/ompi_mpi_init.c:644 >>> #5 0x00007f4bd5813399 in PMPI_Init_thread (argc=0x7fff3c561d7c, >>> argv=0x7fff3c561d70, required=3, >>> provided=0x7fff3c561d84) at pinit_thread.c:69 >>> #6 0x0000000000401516 in main (argc=3, argv=0x7fff3c561ea8) at >>> osu_mbw_mr.c:86 >>> >>> And another process: >>> >>> #0 0x00007f7b9d7d8bdc in recv () from /lib64/libpthread.so.0 >>> #1 0x00007f7b9b0aa42d in opal_pmix_pmix1xx_pmix_usock_recv_blocking >>> (sd=13, data=0x7ffd62139004 "", >>> size=4) at src/usock/usock.c:168 >>> #2 0x00007f7b9b0af5d9 in recv_connect_ack (sd=13) at >>> src/client/pmix_client.c:844 >>> #3 0x00007f7b9b0b085e in usock_connect (addr=0x7ffd62139330) at >>> src/client/pmix_client.c:1110 >>> #4 0x00007f7b9b0acc24 in connect_to_server (address=0x7ffd62139330, >>> cbdata=0x7ffd621390e0) >>> at src/client/pmix_client.c:181 >>> #5 0x00007f7b9b0ad569 in OPAL_PMIX_PMIX1XX_PMIx_Init >>> (proc=0x7f7b9b4e9b60) >>> at src/client/pmix_client.c:362 >>> #6 0x00007f7b9b2dbd9d in pmix1_client_init () at pmix1_client.c:99 >>> #7 0x00007f7b9b4eb95f in pmi_component_query (module=0x7ffd62139490, >>> priority=0x7ffd6213948c) >>> at ess_pmi_component.c:90 >>> #8 0x00007f7b9ce70ec5 in mca_base_select (type_name=0x7f7b9d20e059 >>> "ess", output_id=-1, >>> components_available=0x7f7b9d431eb0, best_module=0x7ffd621394d0, >>> best_component=0x7ffd621394d8, >>> priority_out=0x0) at mca_base_components_select.c:77 >>> #9 0x00007f7b9d1a956b in orte_ess_base_select () at >>> base/ess_base_select.c:40 >>> #10 0x00007f7b9d160449 in orte_init (pargc=0x0, pargv=0x0, flags=32) at >>> runtime/orte_init.c:219 >>> #11 0x00007f7b9da4377a in ompi_mpi_init (argc=3, argv=0x7ffd621397f8, >>> requested=3, >>> provided=0x7ffd621396d4) at runtime/ompi_mpi_init.c:488 >>> #12 0x00007f7b9da81399 in PMPI_Init_thread (argc=0x7ffd621396cc, >>> argv=0x7ffd621396c0, required=3, >>> provided=0x7ffd621396d4) at pinit_thread.c:69 >>> #13 0x0000000000401516 in main (argc=3, argv=0x7ffd621397f8) at >>> osu_mbw_mr.c:86 >>> >>> George. >>> >>> >>> >>> On Tue, Oct 27, 2015 at 2:36 PM, Ralph Castain <r...@open-mpi.org> wrote: >>> >>>> I haven’t been able to replicate this when using the branch in this PR: >>>> >>>> https://github.com/open-mpi/ompi/pull/1073 >>>> >>>> Would you mind giving it a try? It fixes some other race conditions and >>>> might pick this one up too. >>>> >>>> >>>> On Oct 27, 2015, at 10:04 AM, Ralph Castain <r...@open-mpi.org> wrote: >>>> >>>> Okay, I’ll take a look - I’ve been chasing a race condition that might >>>> be related >>>> >>>> On Oct 27, 2015, at 9:54 AM, George Bosilca <bosi...@icl.utk.edu> >>>> wrote: >>>> >>>> No, it's using 2 nodes. >>>> George. >>>> >>>> >>>> On Tue, Oct 27, 2015 at 12:35 PM, Ralph Castain <r...@open-mpi.org> >>>> wrote: >>>> >>>>> Is this on a single node? >>>>> >>>>> On Oct 27, 2015, at 9:25 AM, George Bosilca <bosi...@icl.utk.edu> >>>>> wrote: >>>>> >>>>> I get intermittent deadlocks wit the latest trunk. The smallest >>>>> reproducer is a shell for loop around a small (2 processes) short (20 >>>>> seconds) MPI application. After few tens of iterations the MPI_Init will >>>>> deadlock with the following backtrace: >>>>> >>>>> #0 0x00007fa94b5d9aed in nanosleep () from /lib64/libc.so.6 >>>>> #1 0x00007fa94b60ec94 in usleep () from /lib64/libc.so.6 >>>>> #2 0x00007fa94960ba08 in OPAL_PMIX_PMIX1XX_PMIx_Fence (procs=0x0, >>>>> nprocs=0, info=0x7ffd7934fb90, >>>>> ninfo=1) at src/client/pmix_client_fence.c:100 >>>>> #3 0x00007fa9498376a2 in pmix1_fence (procs=0x0, collect_data=1) at >>>>> pmix1_client.c:305 >>>>> #4 0x00007fa94bb39ba4 in ompi_mpi_init (argc=3, argv=0x7ffd793500a8, >>>>> requested=3, >>>>> provided=0x7ffd7934ff94) at runtime/ompi_mpi_init.c:645 >>>>> #5 0x00007fa94bb77281 in PMPI_Init_thread (argc=0x7ffd7934ff8c, >>>>> argv=0x7ffd7934ff80, required=3, >>>>> provided=0x7ffd7934ff94) at pinit_thread.c:69 >>>>> #6 0x000000000040150f in main (argc=3, argv=0x7ffd793500a8) at >>>>> osu_mbw_mr.c:86 >>>>> >>>>> On my machines this is reproducible at 100% after anywhere between 50 >>>>> and 100 iterations. >>>>> >>>>> Thanks, >>>>> George. >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18280.php >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18281.php >>>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2015/10/18282.php >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2015/10/18284.php >>>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2015/10/18292.php >>> >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2015/10/18294.php >>> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2015/10/18302.php >> >> >> >> >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2015/10/18309.php >> > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/10/18320.php > > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/10/18323.php > -- С Уважением, Поляков Артем Юрьевич Best regards, Artem Y. Polyakov