In listen_thread(): 194 while (pmix_server_globals.listen_thread_active) { 195 FD_ZERO(&readfds); 196 FD_SET(pmix_server_globals.listen_socket, &readfds); 197 max = pmix_server_globals.listen_socket;
Is it possible that pmix_server_globals.listen_thread_active can be false, in which case the thread just exits and will never call accept() ? In pmix_start_listening(): 147 /* fork off the listener thread */ 148 if (0 > pthread_create(&engine, NULL, listen_thread, NULL)) { 149 return PMIX_ERROR; 150 } 151 pmix_server_globals.listen_thread_active = true; pmix_server_globals.listen_thread_active is set to true after the thread is created, could this cause a race ? listen_thread_active might also need to be declared as volatile. Regards --Nysal On Sun, Nov 8, 2015 at 10:38 PM, George Bosilca <bosi...@icl.utk.edu> wrote: > We had a power outage last week and the local disks on our cluster were > wiped out. My tester was in there. But, I can rewrite it after SC. > > George. > > On Sat, Nov 7, 2015 at 12:04 PM, Ralph Castain <r...@open-mpi.org> wrote: > >> Could you send me your stress test? I’m wondering if it is just something >> about how we set socket options >> >> >> On Nov 7, 2015, at 8:58 AM, George Bosilca <bosi...@icl.utk.edu> wrote: >> >> I has to postpone this until after SC. However, I ran for 3 days a stress >> test of UDS reproducing the opening and sending of data (what Ralph said in >> his email) and I never could get a deadlock. >> >> George. >> >> >> On Sat, Nov 7, 2015 at 11:26 AM, Ralph Castain <r...@open-mpi.org> wrote: >> >>> George was looking into it, but I don’t know if he has had time recently >>> to continue the investigation. We understand “what” is happening (accept >>> sometimes ignores the connection), but we don’t yet know “why”. I’ve done >>> some digging around the web, and found that sometimes you can try to talk >>> to a Unix Domain Socket too quickly - i.e., you open it and then send to >>> it, but the OS hasn’t yet set it up. In those cases, you can hang the >>> socket. However, I’ve tried adding some artificial delay, and while it >>> helped, it didn’t completely solve the problem. >>> >>> I have an idea for a workaround (set a timer and retry after awhile), >>> but would obviously prefer a real solution. I’m not even sure it will work >>> as it is unclear that the server (who is the one hung in accept) will break >>> free if the client closes the socket and retries. >>> >>> >>> On Nov 6, 2015, at 10:53 PM, Artem Polyakov <artpo...@gmail.com> wrote: >>> >>> Hello, is there any progress on this topic? This affects our PMIx >>> measurements. >>> >>> 2015-10-30 21:21 GMT+06:00 Ralph Castain <r...@open-mpi.org>: >>> >>>> I’ve verified that the orte/util/listener thread is not being started, >>>> so I don’t think it should be involved in this problem. >>>> >>>> HTH >>>> Ralph >>>> >>>> On Oct 30, 2015, at 8:07 AM, Ralph Castain <r...@open-mpi.org> wrote: >>>> >>>> Hmmm…there is a hook that would allow the PMIx server to utilize that >>>> listener thread, but we aren’t currently using it. Each daemon plus mpirun >>>> will call orte_start_listener, but nothing is currently registering and so >>>> the listener in that code is supposed to just return without starting the >>>> thread. >>>> >>>> So the only listener thread that should exist is the one inside the >>>> PMIx server itself. If something else is happening, then that would be a >>>> bug. I can look at the orte listener code to ensure that the thread isn’t >>>> incorrectly starting. >>>> >>>> >>>> On Oct 29, 2015, at 10:03 PM, George Bosilca <bosi...@icl.utk.edu> >>>> wrote: >>>> >>>> Some progress, that puzzles me but might help you understand. Once the >>>> deadlock appears, if I manually kill the MPI process on the node where the >>>> deadlock was created, the local orte daemon doesn't notice and will just >>>> keep waiting. >>>> >>>> Quick question: I am under the impression that the issue is not in the >>>> PMIX server but somewhere around the listener_thread_fn in >>>> orte/util/listener.c. Possible ? >>>> >>>> George. >>>> >>>> >>>> On Wed, Oct 28, 2015 at 3:56 AM, Ralph Castain <r...@open-mpi.org> >>>> wrote: >>>> >>>>> Should have also clarified: the prior fixes are indeed in the current >>>>> master. >>>>> >>>>> On Oct 28, 2015, at 12:42 AM, Ralph Castain <r...@open-mpi.org> wrote: >>>>> >>>>> Nope - I was wrong. The correction on the client side consisted of >>>>> attempting to timeout if the blocking recv failed. We then modified the >>>>> blocking send/recv so they would handle errors. >>>>> >>>>> So that problem occurred -after- the server had correctly called >>>>> accept. The listener code is in >>>>> opal/mca/pmix/pmix1xx/pmix/src/server/pmix_server_listener.c >>>>> >>>>> It looks to me like the only way we could drop the accept (assuming >>>>> the OS doesn’t lose it) is if the file descriptor lies outside the >>>>> expected >>>>> range once we fall out of select: >>>>> >>>>> >>>>> /* Spin accepting connections until all active listen sockets >>>>> * do not have any incoming connections, pushing each >>>>> connection >>>>> * onto the event queue for processing >>>>> */ >>>>> do { >>>>> accepted_connections = 0; >>>>> /* according to the man pages, select replaces the given >>>>> descriptor >>>>> * set with a subset consisting of those descriptors that >>>>> are ready >>>>> * for the specified operation - in this case, a read. So >>>>> we need to >>>>> * first check to see if this file descriptor is included >>>>> in the >>>>> * returned subset >>>>> */ >>>>> if (0 == FD_ISSET(pmix_server_globals.listen_socket, >>>>> &readfds)) { >>>>> /* this descriptor is not included */ >>>>> continue; >>>>> } >>>>> >>>>> /* this descriptor is ready to be read, which means a >>>>> connection >>>>> * request has been received - so harvest it. All we want >>>>> to do >>>>> * here is accept the connection and push the info onto >>>>> the event >>>>> * library for subsequent processing - we don't want to >>>>> actually >>>>> * process the connection here as it takes too long, and >>>>> so the >>>>> * OS might start rejecting connections due to timeout. >>>>> */ >>>>> pending_connection = PMIX_NEW(pmix_pending_connection_t); >>>>> event_assign(&pending_connection->ev, pmix_globals.evbase, >>>>> -1, >>>>> EV_WRITE, connection_handler, >>>>> pending_connection); >>>>> pending_connection->sd = >>>>> accept(pmix_server_globals.listen_socket, >>>>> (struct >>>>> sockaddr*)&(pending_connection->addr), >>>>> &addrlen); >>>>> if (pending_connection->sd < 0) { >>>>> PMIX_RELEASE(pending_connection); >>>>> if (pmix_socket_errno != EAGAIN || >>>>> pmix_socket_errno != EWOULDBLOCK) { >>>>> if (EMFILE == pmix_socket_errno) { >>>>> PMIX_ERROR_LOG(PMIX_ERR_OUT_OF_RESOURCE); >>>>> } else { >>>>> pmix_output(0, "listen_thread: accept() >>>>> failed: %s (%d).", >>>>> strerror(pmix_socket_errno), >>>>> pmix_socket_errno); >>>>> } >>>>> goto done; >>>>> } >>>>> continue; >>>>> } >>>>> >>>>> pmix_output_verbose(8, pmix_globals.debug_output, >>>>> "listen_thread: new connection: (%d, >>>>> %d)", >>>>> pending_connection->sd, >>>>> pmix_socket_errno); >>>>> /* activate the event */ >>>>> event_active(&pending_connection->ev, EV_WRITE, 1); >>>>> accepted_connections++; >>>>> } while (accepted_connections > 0); >>>>> >>>>> >>>>> On Oct 28, 2015, at 12:25 AM, Ralph Castain <r...@open-mpi.org> wrote: >>>>> >>>>> Looking at the code, it appears that a fix was committed for this >>>>> problem, and that we correctly resolved the issue found by Paul. The >>>>> problem is that the fix didn’t get upstreamed, and so it was lost the next >>>>> time we refreshed PMIx. Sigh. >>>>> >>>>> Let me try to recreate the fix and have you take a gander at it. >>>>> >>>>> >>>>> On Oct 28, 2015, at 12:22 AM, Ralph Castain <r...@open-mpi.org> wrote: >>>>> >>>>> Here is the discussion - afraid it is fairly lengthy. Ignore the hwloc >>>>> references in it as that was a separate issue: >>>>> >>>>> http://www.open-mpi.org/community/lists/devel/2015/09/18074.php >>>>> >>>>> It definitely sounds like the same issue creeping in again. I’d >>>>> appreciate any thoughts on how to correct it. If it helps, you could look >>>>> at the PMIx master - there are standalone tests in the test/simple >>>>> directory that fork/exec a child and just do the connection. >>>>> >>>>> https://github.com/pmix/master >>>>> >>>>> The test server is simptest.c - it will spawn a single copy of >>>>> simpclient.c by default. >>>>> >>>>> >>>>> On Oct 27, 2015, at 10:14 PM, George Bosilca <bosi...@icl.utk.edu> >>>>> wrote: >>>>> >>>>> Interesting. Do you have a pointer to the commit (or/and to the >>>>> discussion)? >>>>> >>>>> I looked at the PMIX code, and I have identified few issues, but >>>>> unfortunately none of them seem to fix the problem for good. However, now >>>>> I >>>>> need more than 1000 runs to get a deadlock (instead of few tens). >>>>> >>>>> Looking with "netstat -ax" at the status of the UDS while the >>>>> processes are deadlocked, I see 2 UDS with the same name: one from the >>>>> server which is in LISTEN state, and one for the client which is being in >>>>> CONNECTING state (while the client already sent a message in the socket >>>>> and >>>>> is now waiting in a blocking receive). This somehow suggest that the >>>>> server >>>>> has not yet called accept on the UDS. Unfortunately, there are 3 threads >>>>> all doing different flavors of even_base and select, so I have a hard time >>>>> tracking the path of the UDS on the server side. >>>>> >>>>> So in order to validate my assumption I wrote a minimalistic UDS >>>>> client and server application and tried different scenarios. The >>>>> conclusion >>>>> is that in order to see the same type of output from "netstat -ax" I have >>>>> to call listen on the server, connect on the client and do not call accept >>>>> on the server. >>>>> >>>>> With the same occasion I also confirmed that the UDS are holding the >>>>> data sent so there is no need for further synchronization for the case >>>>> where the data is sent first. We only need to find out how the server >>>>> forgets to call accept. >>>>> >>>>> George. >>>>> >>>>> >>>>> >>>>> On Tue, Oct 27, 2015 at 7:52 PM, Ralph Castain <r...@open-mpi.org> >>>>> wrote: >>>>> >>>>>> Hmmm…this looks like it might be that problem we previously saw where >>>>>> the blocking recv hangs in a proc when the blocking send tries to send >>>>>> before the domain socket is actually ready, and so the send fails on the >>>>>> other end. As I recall, it was something to do with the socketoptions - >>>>>> and >>>>>> then Paul had a problem on some of his machines, and we backed it out? >>>>>> >>>>>> I wonder if that’s what is biting us here again, and what we need is >>>>>> to either remove the blocking send/recv’s altogether, or figure out a way >>>>>> to wait until the socket is really ready. >>>>>> >>>>>> Any thoughts? >>>>>> >>>>>> >>>>>> On Oct 27, 2015, at 4:11 PM, George Bosilca <bosi...@icl.utk.edu> >>>>>> wrote: >>>>>> >>>>>> It appear the branch solve the problem at least partially. I asked >>>>>> one of my students to hammer it pretty badly, and he reported that the >>>>>> deadlocks still occur. He also graciously provided some stacktraces: >>>>>> >>>>>> #0 0x00007f4bd5274aed in nanosleep () from /lib64/libc.so.6 >>>>>> #1 0x00007f4bd52a9c94 in usleep () from /lib64/libc.so.6 >>>>>> #2 0x00007f4bd2e42b00 in OPAL_PMIX_PMIX1XX_PMIx_Fence (procs=0x0, >>>>>> nprocs=0, info=0x7fff3c561960, >>>>>> ninfo=1) at src/client/pmix_client_fence.c:100 >>>>>> #3 0x00007f4bd306e6d2 in pmix1_fence (procs=0x0, collect_data=1) at >>>>>> pmix1_client.c:306 >>>>>> #4 0x00007f4bd57d5cc3 in ompi_mpi_init (argc=3, argv=0x7fff3c561ea8, >>>>>> requested=3, >>>>>> provided=0x7fff3c561d84) at runtime/ompi_mpi_init.c:644 >>>>>> #5 0x00007f4bd5813399 in PMPI_Init_thread (argc=0x7fff3c561d7c, >>>>>> argv=0x7fff3c561d70, required=3, >>>>>> provided=0x7fff3c561d84) at pinit_thread.c:69 >>>>>> #6 0x0000000000401516 in main (argc=3, argv=0x7fff3c561ea8) at >>>>>> osu_mbw_mr.c:86 >>>>>> >>>>>> And another process: >>>>>> >>>>>> #0 0x00007f7b9d7d8bdc in recv () from /lib64/libpthread.so.0 >>>>>> #1 0x00007f7b9b0aa42d in opal_pmix_pmix1xx_pmix_usock_recv_blocking >>>>>> (sd=13, data=0x7ffd62139004 "", >>>>>> size=4) at src/usock/usock.c:168 >>>>>> #2 0x00007f7b9b0af5d9 in recv_connect_ack (sd=13) at >>>>>> src/client/pmix_client.c:844 >>>>>> #3 0x00007f7b9b0b085e in usock_connect (addr=0x7ffd62139330) at >>>>>> src/client/pmix_client.c:1110 >>>>>> #4 0x00007f7b9b0acc24 in connect_to_server (address=0x7ffd62139330, >>>>>> cbdata=0x7ffd621390e0) >>>>>> at src/client/pmix_client.c:181 >>>>>> #5 0x00007f7b9b0ad569 in OPAL_PMIX_PMIX1XX_PMIx_Init >>>>>> (proc=0x7f7b9b4e9b60) >>>>>> at src/client/pmix_client.c:362 >>>>>> #6 0x00007f7b9b2dbd9d in pmix1_client_init () at pmix1_client.c:99 >>>>>> #7 0x00007f7b9b4eb95f in pmi_component_query (module=0x7ffd62139490, >>>>>> priority=0x7ffd6213948c) >>>>>> at ess_pmi_component.c:90 >>>>>> #8 0x00007f7b9ce70ec5 in mca_base_select (type_name=0x7f7b9d20e059 >>>>>> "ess", output_id=-1, >>>>>> components_available=0x7f7b9d431eb0, best_module=0x7ffd621394d0, >>>>>> best_component=0x7ffd621394d8, >>>>>> priority_out=0x0) at mca_base_components_select.c:77 >>>>>> #9 0x00007f7b9d1a956b in orte_ess_base_select () at >>>>>> base/ess_base_select.c:40 >>>>>> #10 0x00007f7b9d160449 in orte_init (pargc=0x0, pargv=0x0, flags=32) >>>>>> at runtime/orte_init.c:219 >>>>>> #11 0x00007f7b9da4377a in ompi_mpi_init (argc=3, argv=0x7ffd621397f8, >>>>>> requested=3, >>>>>> provided=0x7ffd621396d4) at runtime/ompi_mpi_init.c:488 >>>>>> #12 0x00007f7b9da81399 in PMPI_Init_thread (argc=0x7ffd621396cc, >>>>>> argv=0x7ffd621396c0, required=3, >>>>>> provided=0x7ffd621396d4) at pinit_thread.c:69 >>>>>> #13 0x0000000000401516 in main (argc=3, argv=0x7ffd621397f8) at >>>>>> osu_mbw_mr.c:86 >>>>>> >>>>>> George. >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Oct 27, 2015 at 2:36 PM, Ralph Castain <r...@open-mpi.org> >>>>>> wrote: >>>>>> >>>>>>> I haven’t been able to replicate this when using the branch in this >>>>>>> PR: >>>>>>> >>>>>>> https://github.com/open-mpi/ompi/pull/1073 >>>>>>> >>>>>>> Would you mind giving it a try? It fixes some other race conditions >>>>>>> and might pick this one up too. >>>>>>> >>>>>>> >>>>>>> On Oct 27, 2015, at 10:04 AM, Ralph Castain <r...@open-mpi.org> >>>>>>> wrote: >>>>>>> >>>>>>> Okay, I’ll take a look - I’ve been chasing a race condition that >>>>>>> might be related >>>>>>> >>>>>>> On Oct 27, 2015, at 9:54 AM, George Bosilca <bosi...@icl.utk.edu> >>>>>>> wrote: >>>>>>> >>>>>>> No, it's using 2 nodes. >>>>>>> George. >>>>>>> >>>>>>> >>>>>>> On Tue, Oct 27, 2015 at 12:35 PM, Ralph Castain <r...@open-mpi.org> >>>>>>> wrote: >>>>>>> >>>>>>>> Is this on a single node? >>>>>>>> >>>>>>>> On Oct 27, 2015, at 9:25 AM, George Bosilca <bosi...@icl.utk.edu> >>>>>>>> wrote: >>>>>>>> >>>>>>>> I get intermittent deadlocks wit the latest trunk. The smallest >>>>>>>> reproducer is a shell for loop around a small (2 processes) short (20 >>>>>>>> seconds) MPI application. After few tens of iterations the MPI_Init >>>>>>>> will >>>>>>>> deadlock with the following backtrace: >>>>>>>> >>>>>>>> #0 0x00007fa94b5d9aed in nanosleep () from /lib64/libc.so.6 >>>>>>>> #1 0x00007fa94b60ec94 in usleep () from /lib64/libc.so.6 >>>>>>>> #2 0x00007fa94960ba08 in OPAL_PMIX_PMIX1XX_PMIx_Fence (procs=0x0, >>>>>>>> nprocs=0, info=0x7ffd7934fb90, >>>>>>>> ninfo=1) at src/client/pmix_client_fence.c:100 >>>>>>>> #3 0x00007fa9498376a2 in pmix1_fence (procs=0x0, collect_data=1) >>>>>>>> at pmix1_client.c:305 >>>>>>>> #4 0x00007fa94bb39ba4 in ompi_mpi_init (argc=3, >>>>>>>> argv=0x7ffd793500a8, requested=3, >>>>>>>> provided=0x7ffd7934ff94) at runtime/ompi_mpi_init.c:645 >>>>>>>> #5 0x00007fa94bb77281 in PMPI_Init_thread (argc=0x7ffd7934ff8c, >>>>>>>> argv=0x7ffd7934ff80, required=3, >>>>>>>> provided=0x7ffd7934ff94) at pinit_thread.c:69 >>>>>>>> #6 0x000000000040150f in main (argc=3, argv=0x7ffd793500a8) at >>>>>>>> osu_mbw_mr.c:86 >>>>>>>> >>>>>>>> On my machines this is reproducible at 100% after anywhere between >>>>>>>> 50 and 100 iterations. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> George. >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> devel mailing list >>>>>>>> de...@open-mpi.org >>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>> Link to this post: >>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18280.php >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> devel mailing list >>>>>>>> de...@open-mpi.org >>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>> Link to this post: >>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18281.php >>>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org >>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>> Link to this post: >>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18282.php >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org >>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>> Link to this post: >>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18284.php >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18292.php >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18294.php >>>>>> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18302.php >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18309.php >>>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2015/10/18320.php >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2015/10/18323.php >>>> >>> >>> >>> >>> -- >>> С Уважением, Поляков Артем Юрьевич >>> Best regards, Artem Y. Polyakov >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2015/11/18334.php >>> >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2015/11/18335.php >>> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2015/11/18336.php >> >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2015/11/18337.php >> > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/11/18340.php >