Re: [OMPI devel] race condition in oob/tcp

Gilles Gouaillardet Thu, 18 Sep 2014 02:35:11 -0400 (EDT)

Ralph,

yes and no ...


mpi hello world with four nodes can be used to reproduce the issue,


you can increase the likelyhood of producing the race condition by hacking
./opal/mca/event/libevent2021/libevent/poll.c
and replace
        i = random() % nfds;
with
       if (nfds < 2) {
           i = 0;
       } else {
           i = nfds - 2;
       }

but since this is really a race condition, all i could do is show you
how to use a debugger in order to force it


here is what really happens :
- thanks to your patch, when vpid 2 cannot read the acknowledge, this is
no more a fatal error.
- that being said, the peer->recv_event is not removed from the libevent
- later, send_event will be added to the libevent
- and then peer->recv_event will be added to the libevent
/* this is clearly not supported, and the interesting behaviour is that
peer->send_event will be kicked out of libevent (!) */

The attached patch fixes this race condition, could you please review it ?

Cheers,

Gilles

On 2014/09/17 22:17, Ralph Castain wrote:
> Do you have a reproducer you can share for testing this? I'm unable to get it 
> to happen on my machine, but maybe you have a test code that triggers it so I 
> can continue debugging
>
> Ralph
>
> On Sep 17, 2014, at 4:07 AM, Gilles Gouaillardet 
> <[email protected]> wrote:
>
>> Thanks Ralph,
>>
>> this is much better but there is still a bug :
>> with the very same scenario i described earlier, vpid 2 does not send
>> its message to vpid 3 once the connection has been established.
>>
>> i tried to debug it but i have been pretty unsuccessful so far ..
>>
>> vpid 2 calls tcp_peer_connected and execute the following snippet
>>
>> if (NULL != peer->send_msg && !peer->send_ev_active) {
>>        opal_event_add(&peer->send_event, 0);
>>        peer->send_ev_active = true;
>>    }
>>
>> but when evmap_io_active is invoked later, the following part :
>>
>>    TAILQ_FOREACH(ev, &ctx->events, ev_io_next) {
>>        if (ev->ev_events & events)
>>            event_active_nolock(ev, ev->ev_events & events, 1);
>>    }
>>
>> finds only one ev (mca_oob_tcp_recv_handler and *no*
>> mca_oob_tcp_send_handler)
>>
>> i will resume my investigations tomorrow
>>
>> Cheers,
>>
>> Gilles
>>
>> On 2014/09/17 4:01, Ralph Castain wrote:
>>> Hi Gilles
>>>
>>> I took a crack at solving this in r32744 - CMRd it for 1.8.3 and assigned 
>>> it to you for review. Give it a try and let me know if I (hopefully) got it.
>>>
>>> The approach we have used in the past is to have both sides close their 
>>> connections, and then have the higher vpid retry while the lower one waits. 
>>> The logic for that was still in place, but it looks like you are hitting a 
>>> different code path, and I found another potential one as well. So I think 
>>> I plugged the holes, but will wait to hear if you confirm.
>>>
>>> Thanks
>>> Ralph
>>>
>>> On Sep 16, 2014, at 6:27 AM, Gilles Gouaillardet 
>>> <[email protected]> wrote:
>>>
>>>> Ralph,
>>>>
>>>> here is the full description of a race condition in oob/tcp i very briefly 
>>>> mentionned in a previous post :
>>>>
>>>> the race condition can occur when two not connected orted try to send a 
>>>> message to each other for the first time and at the same time.
>>>>
>>>> that can occur when running mpi helloworld on 4 nodes with the grpcomm/rcd 
>>>> module.
>>>>
>>>> here is a scenario in which the race condition occurs :
>>>>
>>>> orted vpid 2 and 3 enter the allgather
>>>> /* they are not orte yet oob/tcp connected*/
>>>> and they call orte.send_buffer_nb each other.
>>>> from a libevent point of view, vpid 2 and 3 will call 
>>>> mca_oob_tcp_peer_try_connect
>>>>
>>>> vpid 2 calls mca_oob_tcp_send_handler
>>>>
>>>> vpid 3 calls connection_event_handler
>>>>
>>>> depending on the value returned by random() in libevent, vpid 3 will
>>>> either call mca_oob_tcp_send_handler (likely) or recv_handler (unlikely)
>>>> if vpid 3 calls recv_handler, it will close the two sockets to vpid 2
>>>>
>>>> then vpid 2 will call mca_oob_tcp_recv_handler
>>>> (peer->state is MCA_OOB_TCP_CONNECT_ACK)
>>>> that will invoke mca_oob_tcp_recv_connect_ack
>>>> tcp_peer_recv_blocking will fail 
>>>> /* zero bytes are recv'ed since vpid 3 previously closed the socket before 
>>>> writing a header */
>>>> and this is handled by mca_oob_tcp_recv_handler as a fatal error
>>>> /* ORTE_FORCED_TERMINATE(1) */
>>>>
>>>> could you please have a look at it ?
>>>>
>>>> if you are too busy, could you please advise where this scenario should be 
>>>> handled differently ?
>>>> - should vpid 3 keep one socket instead of closing both and retrying ?
>>>> - should vpid 2 handle the failure as a non fatal error ?
>>>>
>>>> Cheers,
>>>>
>>>> Gilles
>>>> _______________________________________________
>>>> devel mailing list
>>>> [email protected]
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/devel/2014/09/15836.php
>>> _______________________________________________
>>> devel mailing list
>>> [email protected]
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2014/09/15844.php
>> _______________________________________________
>> devel mailing list
>> [email protected]
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/09/15854.php
> _______________________________________________
> devel mailing list
> [email protected]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/09/15855.php

Index: orte/mca/oob/tcp/oob_tcp_connection.c
===================================================================
--- orte/mca/oob/tcp/oob_tcp_connection.c       (revision 32752)
+++ orte/mca/oob/tcp/oob_tcp_connection.c       (working copy)
@@ -14,6 +14,8 @@
  * Copyright (c) 2009-2014 Cisco Systems, Inc.  All rights reserved.
  * Copyright (c) 2011      Oak Ridge National Labs.  All rights reserved.
  * Copyright (c) 2013-2014 Intel, Inc.  All rights reserved.
+ * Copyright (c) 2014      Research Organization for Information Science
+ *                         and Technology (RIST). All rights reserved.
  * $COPYRIGHT$
  * 
  * Additional copyrights may follow
@@ -867,6 +869,16 @@
         peer->active_addr->state = MCA_OOB_TCP_CLOSED;
     }

+    /* unregister active events */
+    if (peer->recv_ev_active) {
+        opal_event_del(&peer->recv_event);
+        peer->recv_ev_active = false;
+    }
+    if (peer->send_ev_active) {
+        opal_event_del(&peer->send_event);
+        peer->send_ev_active = false;
+    }
+
     /* inform the component-level that we have lost a connection so
      * it can decide what to do about it.
      */

Re: [OMPI devel] race condition in oob/tcp

Reply via email to