Ralph, i found an other race condition. in a very specific scenario, vpid3 is in the MCA_OOB_TCP_CLOSED state, and processes data from the socket received from vpid 2 vpid3 is in the MCA_OOB_TCP_CLOSED state because vpid2 called retry() and closed all its both sockets to vpid 3
vpid3 read the ack data that was send to the socket (ok) and then ends up calling tcp_peer_send_blocking Function main (orted.c:62) orte_daemon (orted_main.c:828) opal_libevent2021_event_base_loop (event.c:1645) event_process_active (event.c:1437) event_process_active_single_queue (event.c:1367) recv_handler (oob_tcp.c:599) mca_oob_tcp_peer_accept (oob_tcp_connection.c:1071) tcp_peer_send_connect_ack (oob_tcp_connection.c:384) tcp_peer_send_blocking (oob_tcp_connection.c:525) though the socket (fd 17) is my case has been closed by the peer, and is hence reported in the CLOSE_WAIT state by lsof, send(17, ...) is a success (!!!) i thought the root cause was we previously set the O_NONBLOCK flag to this socket. so i explicitly cleared this flag (that was not set anyway...), before invoking mca_oob_tcp_peer_accept but i got the very same behaviour :-( could you please advise : - should the send fail because the socket is in the CLOSE_WAIT state ? - if a success is not a bad behaviour, does this mean an other step should be added to the oob/tcp "handshake" ? - or could this mean that when the peer state was moved from MCA_OOB_TCP_CONNECT_ACK to MCA_OOB_TCP_CLOSED, retry() should have been invoked ? Cheers, Gilles On 2014/09/18 17:02, Ralph Castain wrote: > The patch looks fine to me - please go ahead and apply it. Thanks! > > On Sep 17, 2014, at 11:35 PM, Gilles Gouaillardet > <gilles.gouaillar...@iferc.org> wrote: > >> Ralph, >> >> yes and no ... >> >> mpi hello world with four nodes can be used to reproduce the issue, >> >> >> you can increase the likelyhood of producing the race condition by hacking >> ./opal/mca/event/libevent2021/libevent/poll.c >> and replace >> i = random() % nfds; >> with >> if (nfds < 2) { >> i = 0; >> } else { >> i = nfds - 2; >> } >> >> but since this is really a race condition, all i could do is show you >> how to use a debugger in order to force it >> >> >> here is what really happens : >> - thanks to your patch, when vpid 2 cannot read the acknowledge, this is >> no more a fatal error. >> - that being said, the peer->recv_event is not removed from the libevent >> - later, send_event will be added to the libevent >> - and then peer->recv_event will be added to the libevent >> /* this is clearly not supported, and the interesting behaviour is that >> peer->send_event will be kicked out of libevent (!) */ >> >> The attached patch fixes this race condition, could you please review it ? >> >> Cheers, >> >> Gilles >> >> On 2014/09/17 22:17, Ralph Castain wrote: >>> Do you have a reproducer you can share for testing this? I'm unable to get >>> it to happen on my machine, but maybe you have a test code that triggers it >>> so I can continue debugging >>> >>> Ralph >>> >>> On Sep 17, 2014, at 4:07 AM, Gilles Gouaillardet >>> <gilles.gouaillar...@iferc.org> wrote: >>> >>>> Thanks Ralph, >>>> >>>> this is much better but there is still a bug : >>>> with the very same scenario i described earlier, vpid 2 does not send >>>> its message to vpid 3 once the connection has been established. >>>> >>>> i tried to debug it but i have been pretty unsuccessful so far .. >>>> >>>> vpid 2 calls tcp_peer_connected and execute the following snippet >>>> >>>> if (NULL != peer->send_msg && !peer->send_ev_active) { >>>> opal_event_add(&peer->send_event, 0); >>>> peer->send_ev_active = true; >>>> } >>>> >>>> but when evmap_io_active is invoked later, the following part : >>>> >>>> TAILQ_FOREACH(ev, &ctx->events, ev_io_next) { >>>> if (ev->ev_events & events) >>>> event_active_nolock(ev, ev->ev_events & events, 1); >>>> } >>>> >>>> finds only one ev (mca_oob_tcp_recv_handler and *no* >>>> mca_oob_tcp_send_handler) >>>> >>>> i will resume my investigations tomorrow >>>> >>>> Cheers, >>>> >>>> Gilles >>>> >>>> On 2014/09/17 4:01, Ralph Castain wrote: >>>>> Hi Gilles >>>>> >>>>> I took a crack at solving this in r32744 - CMRd it for 1.8.3 and assigned >>>>> it to you for review. Give it a try and let me know if I (hopefully) got >>>>> it. >>>>> >>>>> The approach we have used in the past is to have both sides close their >>>>> connections, and then have the higher vpid retry while the lower one >>>>> waits. The logic for that was still in place, but it looks like you are >>>>> hitting a different code path, and I found another potential one as well. >>>>> So I think I plugged the holes, but will wait to hear if you confirm. >>>>> >>>>> Thanks >>>>> Ralph >>>>> >>>>> On Sep 16, 2014, at 6:27 AM, Gilles Gouaillardet >>>>> <gilles.gouaillar...@gmail.com> wrote: >>>>> >>>>>> Ralph, >>>>>> >>>>>> here is the full description of a race condition in oob/tcp i very >>>>>> briefly mentionned in a previous post : >>>>>> >>>>>> the race condition can occur when two not connected orted try to send a >>>>>> message to each other for the first time and at the same time. >>>>>> >>>>>> that can occur when running mpi helloworld on 4 nodes with the >>>>>> grpcomm/rcd module. >>>>>> >>>>>> here is a scenario in which the race condition occurs : >>>>>> >>>>>> orted vpid 2 and 3 enter the allgather >>>>>> /* they are not orte yet oob/tcp connected*/ >>>>>> and they call orte.send_buffer_nb each other. >>>>>> from a libevent point of view, vpid 2 and 3 will call >>>>>> mca_oob_tcp_peer_try_connect >>>>>> >>>>>> vpid 2 calls mca_oob_tcp_send_handler >>>>>> >>>>>> vpid 3 calls connection_event_handler >>>>>> >>>>>> depending on the value returned by random() in libevent, vpid 3 will >>>>>> either call mca_oob_tcp_send_handler (likely) or recv_handler (unlikely) >>>>>> if vpid 3 calls recv_handler, it will close the two sockets to vpid 2 >>>>>> >>>>>> then vpid 2 will call mca_oob_tcp_recv_handler >>>>>> (peer->state is MCA_OOB_TCP_CONNECT_ACK) >>>>>> that will invoke mca_oob_tcp_recv_connect_ack >>>>>> tcp_peer_recv_blocking will fail >>>>>> /* zero bytes are recv'ed since vpid 3 previously closed the socket >>>>>> before writing a header */ >>>>>> and this is handled by mca_oob_tcp_recv_handler as a fatal error >>>>>> /* ORTE_FORCED_TERMINATE(1) */ >>>>>> >>>>>> could you please have a look at it ? >>>>>> >>>>>> if you are too busy, could you please advise where this scenario should >>>>>> be handled differently ? >>>>>> - should vpid 3 keep one socket instead of closing both and retrying ? >>>>>> - should vpid 2 handle the failure as a non fatal error ? >>>>>> >>>>>> Cheers, >>>>>> >>>>>> Gilles >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> Link to this post: >>>>>> http://www.open-mpi.org/community/lists/devel/2014/09/15836.php >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/devel/2014/09/15844.php >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2014/09/15854.php >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/09/15855.php >> <oob_tcp.patch>_______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/09/15862.php > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/09/15863.php