Ralph, yes and no ...
mpi hello world with four nodes can be used to reproduce the issue, you can increase the likelyhood of producing the race condition by hacking ./opal/mca/event/libevent2021/libevent/poll.c and replace i = random() % nfds; with if (nfds < 2) { i = 0; } else { i = nfds - 2; } but since this is really a race condition, all i could do is show you how to use a debugger in order to force it here is what really happens : - thanks to your patch, when vpid 2 cannot read the acknowledge, this is no more a fatal error. - that being said, the peer->recv_event is not removed from the libevent - later, send_event will be added to the libevent - and then peer->recv_event will be added to the libevent /* this is clearly not supported, and the interesting behaviour is that peer->send_event will be kicked out of libevent (!) */ The attached patch fixes this race condition, could you please review it ? Cheers, Gilles On 2014/09/17 22:17, Ralph Castain wrote: > Do you have a reproducer you can share for testing this? I'm unable to get it > to happen on my machine, but maybe you have a test code that triggers it so I > can continue debugging > > Ralph > > On Sep 17, 2014, at 4:07 AM, Gilles Gouaillardet > <gilles.gouaillar...@iferc.org> wrote: > >> Thanks Ralph, >> >> this is much better but there is still a bug : >> with the very same scenario i described earlier, vpid 2 does not send >> its message to vpid 3 once the connection has been established. >> >> i tried to debug it but i have been pretty unsuccessful so far .. >> >> vpid 2 calls tcp_peer_connected and execute the following snippet >> >> if (NULL != peer->send_msg && !peer->send_ev_active) { >> opal_event_add(&peer->send_event, 0); >> peer->send_ev_active = true; >> } >> >> but when evmap_io_active is invoked later, the following part : >> >> TAILQ_FOREACH(ev, &ctx->events, ev_io_next) { >> if (ev->ev_events & events) >> event_active_nolock(ev, ev->ev_events & events, 1); >> } >> >> finds only one ev (mca_oob_tcp_recv_handler and *no* >> mca_oob_tcp_send_handler) >> >> i will resume my investigations tomorrow >> >> Cheers, >> >> Gilles >> >> On 2014/09/17 4:01, Ralph Castain wrote: >>> Hi Gilles >>> >>> I took a crack at solving this in r32744 - CMRd it for 1.8.3 and assigned >>> it to you for review. Give it a try and let me know if I (hopefully) got it. >>> >>> The approach we have used in the past is to have both sides close their >>> connections, and then have the higher vpid retry while the lower one waits. >>> The logic for that was still in place, but it looks like you are hitting a >>> different code path, and I found another potential one as well. So I think >>> I plugged the holes, but will wait to hear if you confirm. >>> >>> Thanks >>> Ralph >>> >>> On Sep 16, 2014, at 6:27 AM, Gilles Gouaillardet >>> <gilles.gouaillar...@gmail.com> wrote: >>> >>>> Ralph, >>>> >>>> here is the full description of a race condition in oob/tcp i very briefly >>>> mentionned in a previous post : >>>> >>>> the race condition can occur when two not connected orted try to send a >>>> message to each other for the first time and at the same time. >>>> >>>> that can occur when running mpi helloworld on 4 nodes with the grpcomm/rcd >>>> module. >>>> >>>> here is a scenario in which the race condition occurs : >>>> >>>> orted vpid 2 and 3 enter the allgather >>>> /* they are not orte yet oob/tcp connected*/ >>>> and they call orte.send_buffer_nb each other. >>>> from a libevent point of view, vpid 2 and 3 will call >>>> mca_oob_tcp_peer_try_connect >>>> >>>> vpid 2 calls mca_oob_tcp_send_handler >>>> >>>> vpid 3 calls connection_event_handler >>>> >>>> depending on the value returned by random() in libevent, vpid 3 will >>>> either call mca_oob_tcp_send_handler (likely) or recv_handler (unlikely) >>>> if vpid 3 calls recv_handler, it will close the two sockets to vpid 2 >>>> >>>> then vpid 2 will call mca_oob_tcp_recv_handler >>>> (peer->state is MCA_OOB_TCP_CONNECT_ACK) >>>> that will invoke mca_oob_tcp_recv_connect_ack >>>> tcp_peer_recv_blocking will fail >>>> /* zero bytes are recv'ed since vpid 3 previously closed the socket before >>>> writing a header */ >>>> and this is handled by mca_oob_tcp_recv_handler as a fatal error >>>> /* ORTE_FORCED_TERMINATE(1) */ >>>> >>>> could you please have a look at it ? >>>> >>>> if you are too busy, could you please advise where this scenario should be >>>> handled differently ? >>>> - should vpid 3 keep one socket instead of closing both and retrying ? >>>> - should vpid 2 handle the failure as a non fatal error ? >>>> >>>> Cheers, >>>> >>>> Gilles >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2014/09/15836.php >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/09/15844.php >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/09/15854.php > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/09/15855.php
Index: orte/mca/oob/tcp/oob_tcp_connection.c =================================================================== --- orte/mca/oob/tcp/oob_tcp_connection.c (revision 32752) +++ orte/mca/oob/tcp/oob_tcp_connection.c (working copy) @@ -14,6 +14,8 @@ * Copyright (c) 2009-2014 Cisco Systems, Inc. All rights reserved. * Copyright (c) 2011 Oak Ridge National Labs. All rights reserved. * Copyright (c) 2013-2014 Intel, Inc. All rights reserved. + * Copyright (c) 2014 Research Organization for Information Science + * and Technology (RIST). All rights reserved. * $COPYRIGHT$ * * Additional copyrights may follow @@ -867,6 +869,16 @@ peer->active_addr->state = MCA_OOB_TCP_CLOSED; } + /* unregister active events */ + if (peer->recv_ev_active) { + opal_event_del(&peer->recv_event); + peer->recv_ev_active = false; + } + if (peer->send_ev_active) { + opal_event_del(&peer->send_event); + peer->send_ev_active = false; + } + /* inform the component-level that we have lost a connection so * it can decide what to do about it. */