Nope - I was wrong. The correction on the client side consisted of attempting 
to timeout if the blocking recv failed. We then modified the blocking send/recv 
so they would handle errors.

So that problem occurred -after- the server had correctly called accept. The 
listener code is in opal/mca/pmix/pmix1xx/pmix/src/server/pmix_server_listener.c

It looks to me like the only way we could drop the accept (assuming the OS 
doesn’t lose it) is if the file descriptor lies outside the expected range once 
we fall out of select:


        /* Spin accepting connections until all active listen sockets
         * do not have any incoming connections, pushing each connection
         * onto the event queue for processing
         */
        do {
            accepted_connections = 0;
            /* according to the man pages, select replaces the given descriptor
             * set with a subset consisting of those descriptors that are ready
             * for the specified operation - in this case, a read. So we need to
             * first check to see if this file descriptor is included in the
             * returned subset
             */
            if (0 == FD_ISSET(pmix_server_globals.listen_socket, &readfds)) {
                /* this descriptor is not included */
                continue;
            }

            /* this descriptor is ready to be read, which means a connection
             * request has been received - so harvest it. All we want to do
             * here is accept the connection and push the info onto the event
             * library for subsequent processing - we don't want to actually
             * process the connection here as it takes too long, and so the
             * OS might start rejecting connections due to timeout.
             */
            pending_connection = PMIX_NEW(pmix_pending_connection_t);
            event_assign(&pending_connection->ev, pmix_globals.evbase, -1,
                         EV_WRITE, connection_handler, pending_connection);
            pending_connection->sd = accept(pmix_server_globals.listen_socket,
                                            (struct 
sockaddr*)&(pending_connection->addr),
                                            &addrlen);
            if (pending_connection->sd < 0) {
                PMIX_RELEASE(pending_connection);
                if (pmix_socket_errno != EAGAIN ||
                    pmix_socket_errno != EWOULDBLOCK) {
                    if (EMFILE == pmix_socket_errno) {
                        PMIX_ERROR_LOG(PMIX_ERR_OUT_OF_RESOURCE);
                    } else {
                        pmix_output(0, "listen_thread: accept() failed: %s 
(%d).",
                                    strerror(pmix_socket_errno), 
pmix_socket_errno);
                    }
                    goto done;
                }
                continue;
            }

            pmix_output_verbose(8, pmix_globals.debug_output,
                                "listen_thread: new connection: (%d, %d)",
                                pending_connection->sd, pmix_socket_errno);
            /* activate the event */
            event_active(&pending_connection->ev, EV_WRITE, 1);
            accepted_connections++;
        } while (accepted_connections > 0);


> On Oct 28, 2015, at 12:25 AM, Ralph Castain <r...@open-mpi.org> wrote:
> 
> Looking at the code, it appears that a fix was committed for this problem, 
> and that we correctly resolved the issue found by Paul. The problem is that 
> the fix didn’t get upstreamed, and so it was lost the next time we refreshed 
> PMIx. Sigh.
> 
> Let me try to recreate the fix and have you take a gander at it.
> 
> 
>> On Oct 28, 2015, at 12:22 AM, Ralph Castain <r...@open-mpi.org 
>> <mailto:r...@open-mpi.org>> wrote:
>> 
>> Here is the discussion - afraid it is fairly lengthy. Ignore the hwloc 
>> references in it as that was a separate issue:
>> 
>> http://www.open-mpi.org/community/lists/devel/2015/09/18074.php 
>> <http://www.open-mpi.org/community/lists/devel/2015/09/18074.php>
>> 
>> It definitely sounds like the same issue creeping in again. I’d appreciate 
>> any thoughts on how to correct it. If it helps, you could look at the PMIx 
>> master - there are standalone tests in the test/simple directory that 
>> fork/exec a child and just do the connection.
>> 
>> https://github.com/pmix/master <https://github.com/pmix/master>
>> 
>> The test server is simptest.c - it will spawn a single copy of simpclient.c 
>> by default.
>> 
>> 
>>> On Oct 27, 2015, at 10:14 PM, George Bosilca <bosi...@icl.utk.edu 
>>> <mailto:bosi...@icl.utk.edu>> wrote:
>>> 
>>> Interesting. Do you have a pointer to the commit (or/and to the discussion)?
>>> 
>>> I looked at the PMIX code, and I have identified few issues, but 
>>> unfortunately none of them seem to fix the problem for good. However, now I 
>>> need more than 1000 runs to get a deadlock (instead of few tens).
>>> 
>>> Looking with "netstat -ax" at the status of the UDS while the processes are 
>>> deadlocked, I see 2 UDS with the same name: one from the server which is in 
>>> LISTEN state, and one for the client which is being in CONNECTING state 
>>> (while the client already sent a message in the socket and is now waiting 
>>> in a blocking receive). This somehow suggest that the server has not yet 
>>> called accept on the UDS. Unfortunately, there are 3 threads all doing 
>>> different flavors of even_base and select, so I have a hard time tracking 
>>> the path of the UDS on the server side.
>>> 
>>> So in order to validate my assumption I wrote a minimalistic UDS client and 
>>> server application and tried different scenarios. The conclusion is that in 
>>> order to see the same type of output from "netstat -ax" I have to call 
>>> listen on the server, connect on the client and do not call accept on the 
>>> server.
>>> 
>>> With the same occasion I also confirmed that the UDS are holding the data 
>>> sent so there is no need for further synchronization for the case where the 
>>> data is sent first. We only need to find out how the server forgets to call 
>>> accept.
>>> 
>>>   George.
>>> 
>>> 
>>> 
>>> On Tue, Oct 27, 2015 at 7:52 PM, Ralph Castain <r...@open-mpi.org 
>>> <mailto:r...@open-mpi.org>> wrote:
>>> Hmmm…this looks like it might be that problem we previously saw where the 
>>> blocking recv hangs in a proc when the blocking send tries to send before 
>>> the domain socket is actually ready, and so the send fails on the other 
>>> end. As I recall, it was something to do with the socketoptions - and then 
>>> Paul had a problem on some of his machines, and we backed it out?
>>> 
>>> I wonder if that’s what is biting us here again, and what we need is to 
>>> either remove the blocking send/recv’s altogether, or figure out a way to 
>>> wait until the socket is really ready.
>>> 
>>> Any thoughts?
>>> 
>>> 
>>>> On Oct 27, 2015, at 4:11 PM, George Bosilca <bosi...@icl.utk.edu 
>>>> <mailto:bosi...@icl.utk.edu>> wrote:
>>>> 
>>>> It appear the branch solve the problem at least partially. I asked one of 
>>>> my students to hammer it pretty badly, and he reported that the deadlocks 
>>>> still occur. He also graciously provided some stacktraces:
>>>> 
>>>> #0  0x00007f4bd5274aed in nanosleep () from /lib64/libc.so.6
>>>> #1  0x00007f4bd52a9c94 in usleep () from /lib64/libc.so.6
>>>> #2  0x00007f4bd2e42b00 in OPAL_PMIX_PMIX1XX_PMIx_Fence (procs=0x0, 
>>>> nprocs=0, info=0x7fff3c561960, 
>>>>     ninfo=1) at src/client/pmix_client_fence.c:100
>>>> #3  0x00007f4bd306e6d2 in pmix1_fence (procs=0x0, collect_data=1) at 
>>>> pmix1_client.c:306
>>>> #4  0x00007f4bd57d5cc3 in ompi_mpi_init (argc=3, argv=0x7fff3c561ea8, 
>>>> requested=3, 
>>>>     provided=0x7fff3c561d84) at runtime/ompi_mpi_init.c:644
>>>> #5  0x00007f4bd5813399 in PMPI_Init_thread (argc=0x7fff3c561d7c, 
>>>> argv=0x7fff3c561d70, required=3, 
>>>>     provided=0x7fff3c561d84) at pinit_thread.c:69
>>>> #6  0x0000000000401516 in main (argc=3, argv=0x7fff3c561ea8) at 
>>>> osu_mbw_mr.c:86
>>>> 
>>>> And another process:
>>>> 
>>>> #0  0x00007f7b9d7d8bdc in recv () from /lib64/libpthread.so.0
>>>> #1  0x00007f7b9b0aa42d in opal_pmix_pmix1xx_pmix_usock_recv_blocking 
>>>> (sd=13, data=0x7ffd62139004 "", 
>>>>     size=4) at src/usock/usock.c:168
>>>> #2  0x00007f7b9b0af5d9 in recv_connect_ack (sd=13) at 
>>>> src/client/pmix_client.c:844
>>>> #3  0x00007f7b9b0b085e in usock_connect (addr=0x7ffd62139330) at 
>>>> src/client/pmix_client.c:1110
>>>> #4  0x00007f7b9b0acc24 in connect_to_server (address=0x7ffd62139330, 
>>>> cbdata=0x7ffd621390e0)
>>>>     at src/client/pmix_client.c:181
>>>> #5  0x00007f7b9b0ad569 in OPAL_PMIX_PMIX1XX_PMIx_Init (proc=0x7f7b9b4e9b60)
>>>>     at src/client/pmix_client.c:362
>>>> #6  0x00007f7b9b2dbd9d in pmix1_client_init () at pmix1_client.c:99
>>>> #7  0x00007f7b9b4eb95f in pmi_component_query (module=0x7ffd62139490, 
>>>> priority=0x7ffd6213948c)
>>>>     at ess_pmi_component.c:90
>>>> #8  0x00007f7b9ce70ec5 in mca_base_select (type_name=0x7f7b9d20e059 "ess", 
>>>> output_id=-1, 
>>>>     components_available=0x7f7b9d431eb0, best_module=0x7ffd621394d0, 
>>>> best_component=0x7ffd621394d8, 
>>>>     priority_out=0x0) at mca_base_components_select.c:77
>>>> #9  0x00007f7b9d1a956b in orte_ess_base_select () at 
>>>> base/ess_base_select.c:40
>>>> #10 0x00007f7b9d160449 in orte_init (pargc=0x0, pargv=0x0, flags=32) at 
>>>> runtime/orte_init.c:219
>>>> #11 0x00007f7b9da4377a in ompi_mpi_init (argc=3, argv=0x7ffd621397f8, 
>>>> requested=3, 
>>>>     provided=0x7ffd621396d4) at runtime/ompi_mpi_init.c:488
>>>> #12 0x00007f7b9da81399 in PMPI_Init_thread (argc=0x7ffd621396cc, 
>>>> argv=0x7ffd621396c0, required=3, 
>>>>     provided=0x7ffd621396d4) at pinit_thread.c:69
>>>> #13 0x0000000000401516 in main (argc=3, argv=0x7ffd621397f8) at 
>>>> osu_mbw_mr.c:86
>>>> 
>>>>   George.
>>>> 
>>>> 
>>>> 
>>>> On Tue, Oct 27, 2015 at 2:36 PM, Ralph Castain <r...@open-mpi.org 
>>>> <mailto:r...@open-mpi.org>> wrote:
>>>> I haven’t been able to replicate this when using the branch in this PR:
>>>> 
>>>> https://github.com/open-mpi/ompi/pull/1073 
>>>> <https://github.com/open-mpi/ompi/pull/1073>
>>>> 
>>>> Would you mind giving it a try? It fixes some other race conditions and 
>>>> might pick this one up too.
>>>> 
>>>> 
>>>>> On Oct 27, 2015, at 10:04 AM, Ralph Castain <r...@open-mpi.org 
>>>>> <mailto:r...@open-mpi.org>> wrote:
>>>>> 
>>>>> Okay, I’ll take a look - I’ve been chasing a race condition that might be 
>>>>> related
>>>>> 
>>>>>> On Oct 27, 2015, at 9:54 AM, George Bosilca <bosi...@icl.utk.edu 
>>>>>> <mailto:bosi...@icl.utk.edu>> wrote:
>>>>>> 
>>>>>> No, it's using 2 nodes.
>>>>>>   George.
>>>>>> 
>>>>>> 
>>>>>> On Tue, Oct 27, 2015 at 12:35 PM, Ralph Castain <r...@open-mpi.org 
>>>>>> <mailto:r...@open-mpi.org>> wrote:
>>>>>> Is this on a single node?
>>>>>> 
>>>>>>> On Oct 27, 2015, at 9:25 AM, George Bosilca <bosi...@icl.utk.edu 
>>>>>>> <mailto:bosi...@icl.utk.edu>> wrote:
>>>>>>> 
>>>>>>> I get intermittent deadlocks wit the latest trunk. The smallest 
>>>>>>> reproducer is a shell for loop around a small (2 processes) short (20 
>>>>>>> seconds) MPI application. After few tens of iterations the MPI_Init 
>>>>>>> will deadlock with the following backtrace:
>>>>>>> 
>>>>>>> #0  0x00007fa94b5d9aed in nanosleep () from /lib64/libc.so.6
>>>>>>> #1  0x00007fa94b60ec94 in usleep () from /lib64/libc.so.6
>>>>>>> #2  0x00007fa94960ba08 in OPAL_PMIX_PMIX1XX_PMIx_Fence (procs=0x0, 
>>>>>>> nprocs=0, info=0x7ffd7934fb90, 
>>>>>>>     ninfo=1) at src/client/pmix_client_fence.c:100
>>>>>>> #3  0x00007fa9498376a2 in pmix1_fence (procs=0x0, collect_data=1) at 
>>>>>>> pmix1_client.c:305
>>>>>>> #4  0x00007fa94bb39ba4 in ompi_mpi_init (argc=3, argv=0x7ffd793500a8, 
>>>>>>> requested=3, 
>>>>>>>     provided=0x7ffd7934ff94) at runtime/ompi_mpi_init.c:645
>>>>>>> #5  0x00007fa94bb77281 in PMPI_Init_thread (argc=0x7ffd7934ff8c, 
>>>>>>> argv=0x7ffd7934ff80, required=3, 
>>>>>>>     provided=0x7ffd7934ff94) at pinit_thread.c:69
>>>>>>> #6  0x000000000040150f in main (argc=3, argv=0x7ffd793500a8) at 
>>>>>>> osu_mbw_mr.c:86
>>>>>>> 
>>>>>>> On my machines this is reproducible at 100% after anywhere between 50 
>>>>>>> and 100 iterations.
>>>>>>> 
>>>>>>>   Thanks,
>>>>>>>     George.
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>>>>> Link to this post: 
>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18280.php 
>>>>>>> <http://www.open-mpi.org/community/lists/devel/2015/10/18280.php>
>>>>>> 
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>>>> Link to this post: 
>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18281.php 
>>>>>> <http://www.open-mpi.org/community/lists/devel/2015/10/18281.php>
>>>>>> 
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>>>> Link to this post: 
>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18282.php 
>>>>>> <http://www.open-mpi.org/community/lists/devel/2015/10/18282.php>
>>>> 
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18284.php 
>>>> <http://www.open-mpi.org/community/lists/devel/2015/10/18284.php>
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18292.php 
>>>> <http://www.open-mpi.org/community/lists/devel/2015/10/18292.php>
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2015/10/18294.php 
>>> <http://www.open-mpi.org/community/lists/devel/2015/10/18294.php>
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org <mailto:de...@open-mpi.org>
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel 
>>> <http://www.open-mpi.org/mailman/listinfo.cgi/devel>
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2015/10/18302.php 
>>> <http://www.open-mpi.org/community/lists/devel/2015/10/18302.php>
> 

Reply via email to