We had a power outage last week and the local disks on our cluster were
wiped out. My tester was in there. But, I can rewrite it after SC.

  George.

On Sat, Nov 7, 2015 at 12:04 PM, Ralph Castain <r...@open-mpi.org> wrote:

> Could you send me your stress test? I’m wondering if it is just something
> about how we set socket options
>
>
> On Nov 7, 2015, at 8:58 AM, George Bosilca <bosi...@icl.utk.edu> wrote:
>
> I has to postpone this until after SC. However, I ran for 3 days a stress
> test of UDS reproducing the opening and sending of data (what Ralph said in
> his email) and I never could get a deadlock.
>
>   George.
>
>
> On Sat, Nov 7, 2015 at 11:26 AM, Ralph Castain <r...@open-mpi.org> wrote:
>
>> George was looking into it, but I don’t know if he has had time recently
>> to continue the investigation. We understand “what” is happening (accept
>> sometimes ignores the connection), but we don’t yet know “why”. I’ve done
>> some digging around the web, and found that sometimes you can try to talk
>> to a Unix Domain Socket too quickly - i.e., you open it and then send to
>> it, but the OS hasn’t yet set it up. In those cases, you can hang the
>> socket. However, I’ve tried adding some artificial delay, and while it
>> helped, it didn’t completely solve the problem.
>>
>> I have an idea for a workaround (set a timer and retry after awhile), but
>> would obviously prefer a real solution. I’m not even sure it will work as
>> it is unclear that the server (who is the one hung in accept) will break
>> free if the client closes the socket and retries.
>>
>>
>> On Nov 6, 2015, at 10:53 PM, Artem Polyakov <artpo...@gmail.com> wrote:
>>
>> Hello, is there any progress on this topic? This affects our PMIx
>> measurements.
>>
>> 2015-10-30 21:21 GMT+06:00 Ralph Castain <r...@open-mpi.org>:
>>
>>> I’ve verified that the orte/util/listener thread is not being started,
>>> so I don’t think it should be involved in this problem.
>>>
>>> HTH
>>> Ralph
>>>
>>> On Oct 30, 2015, at 8:07 AM, Ralph Castain <r...@open-mpi.org> wrote:
>>>
>>> Hmmm…there is a hook that would allow the PMIx server to utilize that
>>> listener thread, but we aren’t currently using it. Each daemon plus mpirun
>>> will call orte_start_listener, but nothing is currently registering and so
>>> the listener in that code is supposed to just return without starting the
>>> thread.
>>>
>>> So the only listener thread that should exist is the one inside the PMIx
>>> server itself. If something else is happening, then that would be a bug. I
>>> can look at the orte listener code to ensure that the thread isn’t
>>> incorrectly starting.
>>>
>>>
>>> On Oct 29, 2015, at 10:03 PM, George Bosilca <bosi...@icl.utk.edu>
>>> wrote:
>>>
>>> Some progress, that puzzles me but might help you understand. Once the
>>> deadlock appears, if I manually kill the MPI process on the node where the
>>> deadlock was created, the local orte daemon doesn't notice and will just
>>> keep waiting.
>>>
>>> Quick question: I am under the impression that the issue is not in the
>>> PMIX server but somewhere around the listener_thread_fn in
>>> orte/util/listener.c. Possible ?
>>>
>>>   George.
>>>
>>>
>>> On Wed, Oct 28, 2015 at 3:56 AM, Ralph Castain <r...@open-mpi.org> wrote:
>>>
>>>> Should have also clarified: the prior fixes are indeed in the current
>>>> master.
>>>>
>>>> On Oct 28, 2015, at 12:42 AM, Ralph Castain <r...@open-mpi.org> wrote:
>>>>
>>>> Nope - I was wrong. The correction on the client side consisted of
>>>> attempting to timeout if the blocking recv failed. We then modified the
>>>> blocking send/recv so they would handle errors.
>>>>
>>>> So that problem occurred -after- the server had correctly called
>>>> accept. The listener code is in
>>>> opal/mca/pmix/pmix1xx/pmix/src/server/pmix_server_listener.c
>>>>
>>>> It looks to me like the only way we could drop the accept (assuming the
>>>> OS doesn’t lose it) is if the file descriptor lies outside the expected
>>>> range once we fall out of select:
>>>>
>>>>
>>>>         /* Spin accepting connections until all active listen sockets
>>>>          * do not have any incoming connections, pushing each connection
>>>>          * onto the event queue for processing
>>>>          */
>>>>         do {
>>>>             accepted_connections = 0;
>>>>             /* according to the man pages, select replaces the given
>>>> descriptor
>>>>              * set with a subset consisting of those descriptors that
>>>> are ready
>>>>              * for the specified operation - in this case, a read. So
>>>> we need to
>>>>              * first check to see if this file descriptor is included
>>>> in the
>>>>              * returned subset
>>>>              */
>>>>             if (0 == FD_ISSET(pmix_server_globals.listen_socket,
>>>> &readfds)) {
>>>>                 /* this descriptor is not included */
>>>>                 continue;
>>>>             }
>>>>
>>>>             /* this descriptor is ready to be read, which means a
>>>> connection
>>>>              * request has been received - so harvest it. All we want
>>>> to do
>>>>              * here is accept the connection and push the info onto the
>>>> event
>>>>              * library for subsequent processing - we don't want to
>>>> actually
>>>>              * process the connection here as it takes too long, and so
>>>> the
>>>>              * OS might start rejecting connections due to timeout.
>>>>              */
>>>>             pending_connection = PMIX_NEW(pmix_pending_connection_t);
>>>>             event_assign(&pending_connection->ev, pmix_globals.evbase,
>>>> -1,
>>>>                          EV_WRITE, connection_handler,
>>>> pending_connection);
>>>>             pending_connection->sd =
>>>> accept(pmix_server_globals.listen_socket,
>>>>                                             (struct
>>>> sockaddr*)&(pending_connection->addr),
>>>>                                             &addrlen);
>>>>             if (pending_connection->sd < 0) {
>>>>                 PMIX_RELEASE(pending_connection);
>>>>                 if (pmix_socket_errno != EAGAIN ||
>>>>                     pmix_socket_errno != EWOULDBLOCK) {
>>>>                     if (EMFILE == pmix_socket_errno) {
>>>>                         PMIX_ERROR_LOG(PMIX_ERR_OUT_OF_RESOURCE);
>>>>                     } else {
>>>>                         pmix_output(0, "listen_thread: accept() failed:
>>>> %s (%d).",
>>>>                                     strerror(pmix_socket_errno),
>>>> pmix_socket_errno);
>>>>                     }
>>>>                     goto done;
>>>>                 }
>>>>                 continue;
>>>>             }
>>>>
>>>>             pmix_output_verbose(8, pmix_globals.debug_output,
>>>>                                 "listen_thread: new connection: (%d,
>>>> %d)",
>>>>                                 pending_connection->sd,
>>>> pmix_socket_errno);
>>>>             /* activate the event */
>>>>             event_active(&pending_connection->ev, EV_WRITE, 1);
>>>>             accepted_connections++;
>>>>         } while (accepted_connections > 0);
>>>>
>>>>
>>>> On Oct 28, 2015, at 12:25 AM, Ralph Castain <r...@open-mpi.org> wrote:
>>>>
>>>> Looking at the code, it appears that a fix was committed for this
>>>> problem, and that we correctly resolved the issue found by Paul. The
>>>> problem is that the fix didn’t get upstreamed, and so it was lost the next
>>>> time we refreshed PMIx. Sigh.
>>>>
>>>> Let me try to recreate the fix and have you take a gander at it.
>>>>
>>>>
>>>> On Oct 28, 2015, at 12:22 AM, Ralph Castain <r...@open-mpi.org> wrote:
>>>>
>>>> Here is the discussion - afraid it is fairly lengthy. Ignore the hwloc
>>>> references in it as that was a separate issue:
>>>>
>>>> http://www.open-mpi.org/community/lists/devel/2015/09/18074.php
>>>>
>>>> It definitely sounds like the same issue creeping in again. I’d
>>>> appreciate any thoughts on how to correct it. If it helps, you could look
>>>> at the PMIx master - there are standalone tests in the test/simple
>>>> directory that fork/exec a child and just do the connection.
>>>>
>>>> https://github.com/pmix/master
>>>>
>>>> The test server is simptest.c - it will spawn a single copy of
>>>> simpclient.c by default.
>>>>
>>>>
>>>> On Oct 27, 2015, at 10:14 PM, George Bosilca <bosi...@icl.utk.edu>
>>>> wrote:
>>>>
>>>> Interesting. Do you have a pointer to the commit (or/and to the
>>>> discussion)?
>>>>
>>>> I looked at the PMIX code, and I have identified few issues, but
>>>> unfortunately none of them seem to fix the problem for good. However, now I
>>>> need more than 1000 runs to get a deadlock (instead of few tens).
>>>>
>>>> Looking with "netstat -ax" at the status of the UDS while the processes
>>>> are deadlocked, I see 2 UDS with the same name: one from the server which
>>>> is in LISTEN state, and one for the client which is being in CONNECTING
>>>> state (while the client already sent a message in the socket and is now
>>>> waiting in a blocking receive). This somehow suggest that the server has
>>>> not yet called accept on the UDS. Unfortunately, there are 3 threads all
>>>> doing different flavors of even_base and select, so I have a hard time
>>>> tracking the path of the UDS on the server side.
>>>>
>>>> So in order to validate my assumption I wrote a minimalistic UDS client
>>>> and server application and tried different scenarios. The conclusion is
>>>> that in order to see the same type of output from "netstat -ax" I have to
>>>> call listen on the server, connect on the client and do not call accept on
>>>> the server.
>>>>
>>>> With the same occasion I also confirmed that the UDS are holding the
>>>> data sent so there is no need for further synchronization for the case
>>>> where the data is sent first. We only need to find out how the server
>>>> forgets to call accept.
>>>>
>>>>   George.
>>>>
>>>>
>>>>
>>>> On Tue, Oct 27, 2015 at 7:52 PM, Ralph Castain <r...@open-mpi.org>
>>>> wrote:
>>>>
>>>>> Hmmm…this looks like it might be that problem we previously saw where
>>>>> the blocking recv hangs in a proc when the blocking send tries to send
>>>>> before the domain socket is actually ready, and so the send fails on the
>>>>> other end. As I recall, it was something to do with the socketoptions - 
>>>>> and
>>>>> then Paul had a problem on some of his machines, and we backed it out?
>>>>>
>>>>> I wonder if that’s what is biting us here again, and what we need is
>>>>> to either remove the blocking send/recv’s altogether, or figure out a way
>>>>> to wait until the socket is really ready.
>>>>>
>>>>> Any thoughts?
>>>>>
>>>>>
>>>>> On Oct 27, 2015, at 4:11 PM, George Bosilca <bosi...@icl.utk.edu>
>>>>> wrote:
>>>>>
>>>>> It appear the branch solve the problem at least partially. I asked one
>>>>> of my students to hammer it pretty badly, and he reported that the
>>>>> deadlocks still occur. He also graciously provided some stacktraces:
>>>>>
>>>>> #0  0x00007f4bd5274aed in nanosleep () from /lib64/libc.so.6
>>>>> #1  0x00007f4bd52a9c94 in usleep () from /lib64/libc.so.6
>>>>> #2  0x00007f4bd2e42b00 in OPAL_PMIX_PMIX1XX_PMIx_Fence (procs=0x0,
>>>>> nprocs=0, info=0x7fff3c561960,
>>>>>     ninfo=1) at src/client/pmix_client_fence.c:100
>>>>> #3  0x00007f4bd306e6d2 in pmix1_fence (procs=0x0, collect_data=1) at
>>>>> pmix1_client.c:306
>>>>> #4  0x00007f4bd57d5cc3 in ompi_mpi_init (argc=3, argv=0x7fff3c561ea8,
>>>>> requested=3,
>>>>>     provided=0x7fff3c561d84) at runtime/ompi_mpi_init.c:644
>>>>> #5  0x00007f4bd5813399 in PMPI_Init_thread (argc=0x7fff3c561d7c,
>>>>> argv=0x7fff3c561d70, required=3,
>>>>>     provided=0x7fff3c561d84) at pinit_thread.c:69
>>>>> #6  0x0000000000401516 in main (argc=3, argv=0x7fff3c561ea8) at
>>>>> osu_mbw_mr.c:86
>>>>>
>>>>> And another process:
>>>>>
>>>>> #0  0x00007f7b9d7d8bdc in recv () from /lib64/libpthread.so.0
>>>>> #1  0x00007f7b9b0aa42d in opal_pmix_pmix1xx_pmix_usock_recv_blocking
>>>>> (sd=13, data=0x7ffd62139004 "",
>>>>>     size=4) at src/usock/usock.c:168
>>>>> #2  0x00007f7b9b0af5d9 in recv_connect_ack (sd=13) at
>>>>> src/client/pmix_client.c:844
>>>>> #3  0x00007f7b9b0b085e in usock_connect (addr=0x7ffd62139330) at
>>>>> src/client/pmix_client.c:1110
>>>>> #4  0x00007f7b9b0acc24 in connect_to_server (address=0x7ffd62139330,
>>>>> cbdata=0x7ffd621390e0)
>>>>>     at src/client/pmix_client.c:181
>>>>> #5  0x00007f7b9b0ad569 in OPAL_PMIX_PMIX1XX_PMIx_Init
>>>>> (proc=0x7f7b9b4e9b60)
>>>>>     at src/client/pmix_client.c:362
>>>>> #6  0x00007f7b9b2dbd9d in pmix1_client_init () at pmix1_client.c:99
>>>>> #7  0x00007f7b9b4eb95f in pmi_component_query (module=0x7ffd62139490,
>>>>> priority=0x7ffd6213948c)
>>>>>     at ess_pmi_component.c:90
>>>>> #8  0x00007f7b9ce70ec5 in mca_base_select (type_name=0x7f7b9d20e059
>>>>> "ess", output_id=-1,
>>>>>     components_available=0x7f7b9d431eb0, best_module=0x7ffd621394d0,
>>>>> best_component=0x7ffd621394d8,
>>>>>     priority_out=0x0) at mca_base_components_select.c:77
>>>>> #9  0x00007f7b9d1a956b in orte_ess_base_select () at
>>>>> base/ess_base_select.c:40
>>>>> #10 0x00007f7b9d160449 in orte_init (pargc=0x0, pargv=0x0, flags=32)
>>>>> at runtime/orte_init.c:219
>>>>> #11 0x00007f7b9da4377a in ompi_mpi_init (argc=3, argv=0x7ffd621397f8,
>>>>> requested=3,
>>>>>     provided=0x7ffd621396d4) at runtime/ompi_mpi_init.c:488
>>>>> #12 0x00007f7b9da81399 in PMPI_Init_thread (argc=0x7ffd621396cc,
>>>>> argv=0x7ffd621396c0, required=3,
>>>>>     provided=0x7ffd621396d4) at pinit_thread.c:69
>>>>> #13 0x0000000000401516 in main (argc=3, argv=0x7ffd621397f8) at
>>>>> osu_mbw_mr.c:86
>>>>>
>>>>>   George.
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Oct 27, 2015 at 2:36 PM, Ralph Castain <r...@open-mpi.org>
>>>>> wrote:
>>>>>
>>>>>> I haven’t been able to replicate this when using the branch in this
>>>>>> PR:
>>>>>>
>>>>>> https://github.com/open-mpi/ompi/pull/1073
>>>>>>
>>>>>> Would you mind giving it a try? It fixes some other race conditions
>>>>>> and might pick this one up too.
>>>>>>
>>>>>>
>>>>>> On Oct 27, 2015, at 10:04 AM, Ralph Castain <r...@open-mpi.org> wrote:
>>>>>>
>>>>>> Okay, I’ll take a look - I’ve been chasing a race condition that
>>>>>> might be related
>>>>>>
>>>>>> On Oct 27, 2015, at 9:54 AM, George Bosilca <bosi...@icl.utk.edu>
>>>>>> wrote:
>>>>>>
>>>>>> No, it's using 2 nodes.
>>>>>>   George.
>>>>>>
>>>>>>
>>>>>> On Tue, Oct 27, 2015 at 12:35 PM, Ralph Castain <r...@open-mpi.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Is this on a single node?
>>>>>>>
>>>>>>> On Oct 27, 2015, at 9:25 AM, George Bosilca <bosi...@icl.utk.edu>
>>>>>>> wrote:
>>>>>>>
>>>>>>> I get intermittent deadlocks wit the latest trunk. The smallest
>>>>>>> reproducer is a shell for loop around a small (2 processes) short (20
>>>>>>> seconds) MPI application. After few tens of iterations the MPI_Init will
>>>>>>> deadlock with the following backtrace:
>>>>>>>
>>>>>>> #0  0x00007fa94b5d9aed in nanosleep () from /lib64/libc.so.6
>>>>>>> #1  0x00007fa94b60ec94 in usleep () from /lib64/libc.so.6
>>>>>>> #2  0x00007fa94960ba08 in OPAL_PMIX_PMIX1XX_PMIx_Fence (procs=0x0,
>>>>>>> nprocs=0, info=0x7ffd7934fb90,
>>>>>>>     ninfo=1) at src/client/pmix_client_fence.c:100
>>>>>>> #3  0x00007fa9498376a2 in pmix1_fence (procs=0x0, collect_data=1) at
>>>>>>> pmix1_client.c:305
>>>>>>> #4  0x00007fa94bb39ba4 in ompi_mpi_init (argc=3,
>>>>>>> argv=0x7ffd793500a8, requested=3,
>>>>>>>     provided=0x7ffd7934ff94) at runtime/ompi_mpi_init.c:645
>>>>>>> #5  0x00007fa94bb77281 in PMPI_Init_thread (argc=0x7ffd7934ff8c,
>>>>>>> argv=0x7ffd7934ff80, required=3,
>>>>>>>     provided=0x7ffd7934ff94) at pinit_thread.c:69
>>>>>>> #6  0x000000000040150f in main (argc=3, argv=0x7ffd793500a8) at
>>>>>>> osu_mbw_mr.c:86
>>>>>>>
>>>>>>> On my machines this is reproducible at 100% after anywhere between
>>>>>>> 50 and 100 iterations.
>>>>>>>
>>>>>>>   Thanks,
>>>>>>>     George.
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> de...@open-mpi.org
>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>> Link to this post:
>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18280.php
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> devel mailing list
>>>>>>> de...@open-mpi.org
>>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>>> Link to this post:
>>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18281.php
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org
>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>> Link to this post:
>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18282.php
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org
>>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>>> Link to this post:
>>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18284.php
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> Link to this post:
>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18292.php
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> Link to this post:
>>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18294.php
>>>>>
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> Link to this post:
>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18302.php
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> Link to this post:
>>>> http://www.open-mpi.org/community/lists/devel/2015/10/18309.php
>>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/devel/2015/10/18320.php
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post:
>>> http://www.open-mpi.org/community/lists/devel/2015/10/18323.php
>>>
>>
>>
>>
>> --
>> С Уважением, Поляков Артем Юрьевич
>> Best regards, Artem Y. Polyakov
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2015/11/18334.php
>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2015/11/18335.php
>>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/11/18336.php
>
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/11/18337.php
>

Reply via email to