Ralph,

yes and no ...

mpi hello world with four nodes can be used to reproduce the issue,


you can increase the likelyhood of producing the race condition by hacking
./opal/mca/event/libevent2021/libevent/poll.c
and replace
        i = random() % nfds;
with
       if (nfds < 2) {
           i = 0;
       } else {
           i = nfds - 2;
       }

but since this is really a race condition, all i could do is show you
how to use a debugger in order to force it


here is what really happens :
- thanks to your patch, when vpid 2 cannot read the acknowledge, this is
no more a fatal error.
- that being said, the peer->recv_event is not removed from the libevent
- later, send_event will be added to the libevent
- and then peer->recv_event will be added to the libevent
/* this is clearly not supported, and the interesting behaviour is that
peer->send_event will be kicked out of libevent (!) */

The attached patch fixes this race condition, could you please review it ?

Cheers,

Gilles

On 2014/09/17 22:17, Ralph Castain wrote:
> Do you have a reproducer you can share for testing this? I'm unable to get it 
> to happen on my machine, but maybe you have a test code that triggers it so I 
> can continue debugging
>
> Ralph
>
> On Sep 17, 2014, at 4:07 AM, Gilles Gouaillardet 
> <gilles.gouaillar...@iferc.org> wrote:
>
>> Thanks Ralph,
>>
>> this is much better but there is still a bug :
>> with the very same scenario i described earlier, vpid 2 does not send
>> its message to vpid 3 once the connection has been established.
>>
>> i tried to debug it but i have been pretty unsuccessful so far ..
>>
>> vpid 2 calls tcp_peer_connected and execute the following snippet
>>
>> if (NULL != peer->send_msg && !peer->send_ev_active) {
>>        opal_event_add(&peer->send_event, 0);
>>        peer->send_ev_active = true;
>>    }
>>
>> but when evmap_io_active is invoked later, the following part :
>>
>>    TAILQ_FOREACH(ev, &ctx->events, ev_io_next) {
>>        if (ev->ev_events & events)
>>            event_active_nolock(ev, ev->ev_events & events, 1);
>>    }
>>
>> finds only one ev (mca_oob_tcp_recv_handler and *no*
>> mca_oob_tcp_send_handler)
>>
>> i will resume my investigations tomorrow
>>
>> Cheers,
>>
>> Gilles
>>
>> On 2014/09/17 4:01, Ralph Castain wrote:
>>> Hi Gilles
>>>
>>> I took a crack at solving this in r32744 - CMRd it for 1.8.3 and assigned 
>>> it to you for review. Give it a try and let me know if I (hopefully) got it.
>>>
>>> The approach we have used in the past is to have both sides close their 
>>> connections, and then have the higher vpid retry while the lower one waits. 
>>> The logic for that was still in place, but it looks like you are hitting a 
>>> different code path, and I found another potential one as well. So I think 
>>> I plugged the holes, but will wait to hear if you confirm.
>>>
>>> Thanks
>>> Ralph
>>>
>>> On Sep 16, 2014, at 6:27 AM, Gilles Gouaillardet 
>>> <gilles.gouaillar...@gmail.com> wrote:
>>>
>>>> Ralph,
>>>>
>>>> here is the full description of a race condition in oob/tcp i very briefly 
>>>> mentionned in a previous post :
>>>>
>>>> the race condition can occur when two not connected orted try to send a 
>>>> message to each other for the first time and at the same time.
>>>>
>>>> that can occur when running mpi helloworld on 4 nodes with the grpcomm/rcd 
>>>> module.
>>>>
>>>> here is a scenario in which the race condition occurs :
>>>>
>>>> orted vpid 2 and 3 enter the allgather
>>>> /* they are not orte yet oob/tcp connected*/
>>>> and they call orte.send_buffer_nb each other.
>>>> from a libevent point of view, vpid 2 and 3 will call 
>>>> mca_oob_tcp_peer_try_connect
>>>>
>>>> vpid 2 calls mca_oob_tcp_send_handler
>>>>
>>>> vpid 3 calls connection_event_handler
>>>>
>>>> depending on the value returned by random() in libevent, vpid 3 will
>>>> either call mca_oob_tcp_send_handler (likely) or recv_handler (unlikely)
>>>> if vpid 3 calls recv_handler, it will close the two sockets to vpid 2
>>>>
>>>> then vpid 2 will call mca_oob_tcp_recv_handler
>>>> (peer->state is MCA_OOB_TCP_CONNECT_ACK)
>>>> that will invoke mca_oob_tcp_recv_connect_ack
>>>> tcp_peer_recv_blocking will fail 
>>>> /* zero bytes are recv'ed since vpid 3 previously closed the socket before 
>>>> writing a header */
>>>> and this is handled by mca_oob_tcp_recv_handler as a fatal error
>>>> /* ORTE_FORCED_TERMINATE(1) */
>>>>
>>>> could you please have a look at it ?
>>>>
>>>> if you are too busy, could you please advise where this scenario should be 
>>>> handled differently ?
>>>> - should vpid 3 keep one socket instead of closing both and retrying ?
>>>> - should vpid 2 handle the failure as a non fatal error ?
>>>>
>>>> Cheers,
>>>>
>>>> Gilles
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> Link to this post: 
>>>> http://www.open-mpi.org/community/lists/devel/2014/09/15836.php
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2014/09/15844.php
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/09/15854.php
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/09/15855.php

Index: orte/mca/oob/tcp/oob_tcp_connection.c
===================================================================
--- orte/mca/oob/tcp/oob_tcp_connection.c       (revision 32752)
+++ orte/mca/oob/tcp/oob_tcp_connection.c       (working copy)
@@ -14,6 +14,8 @@
  * Copyright (c) 2009-2014 Cisco Systems, Inc.  All rights reserved.
  * Copyright (c) 2011      Oak Ridge National Labs.  All rights reserved.
  * Copyright (c) 2013-2014 Intel, Inc.  All rights reserved.
+ * Copyright (c) 2014      Research Organization for Information Science
+ *                         and Technology (RIST). All rights reserved.
  * $COPYRIGHT$
  * 
  * Additional copyrights may follow
@@ -867,6 +869,16 @@
         peer->active_addr->state = MCA_OOB_TCP_CLOSED;
     }

+    /* unregister active events */
+    if (peer->recv_ev_active) {
+        opal_event_del(&peer->recv_event);
+        peer->recv_ev_active = false;
+    }
+    if (peer->send_ev_active) {
+        opal_event_del(&peer->send_event);
+        peer->send_ev_active = false;
+    }
+
     /* inform the component-level that we have lost a connection so
      * it can decide what to do about it.
      */

Reply via email to