Or copy the handshake protocol design of the TCP BTL... George.
On Fri, Sep 19, 2014 at 6:23 PM, Ralph Castain <r...@open-mpi.org> wrote: > You know, I'm almost beginning to dread opening my email in the morning > for fear of seeing another "race condition" subject line! :-) > > I think the correct answer here is that orted 3 should be entering "retry" > when it sees the peer state change to "closed", regardless of what happened > in the send. I'll take a look at it, but since you have a reliable > reproducer and I don't, please feel free to poke at it yourself > > > On Sep 19, 2014, at 8:30 AM, Gilles Gouaillardet < > gilles.gouaillar...@gmail.com> wrote: > > Ralph, > > let me detail the new race condition. > > orted 2 and 3 are not connected to each other and send a message to each > other > > orted 2 and 3 call send_process (that set peer->state = > MCA_OOB_TCP_PEER_CONNECTING) > they both end up calling mca_oob_tcp_peer_try_connect > > now if orted 3 calls mca_oob_tcp_send_handler. > peer->state = MCA_OOB_TCP_CONNECT_ACK > peer->fd = 15 /* in my environment */ > > now orted 2 accept the connection from orted 3 (fd=17 in my environment) > and invokes recv_handler > that will call mca_oob_tcp_peer_recv_connect_ack > that will in turn call retry because peer->state is MCA_OOB_TCP_CONNECTING > retry will close fd 15 and 17 > > now orted 3 decides to handle fd 15 (depend on random() of libevent ...) > mca_oob_tcp_recv_handler will call mca_oob_tcp_peer_recv_connect_ack > that will call tcp_peer_recv_blocking > recv(15, ...) will return 0 /* orted 2 did not invoke > mca_oob_tcp_send_handler yet) > then mca_oob_tcp_peer_close will set peer->state = MCA_OOB_TCP_CLOSED > and mca_oob_tcp_peer_recv_connect_ack will *not* invoke retry() > > now orted 3 accepts the conection from orted 2 (fd=17 in my environment) > it will successfully read the connection request, and write its ack and > set peer->state = MCA_OOB_TCP_CONNECTED > > /* in my previous email, i was surprised the write did not fail. > on second thought, this is not so surprising : depending on the timing, > orted 3 might not be aware that orted 2 closed the other end of the socket. > bottom line, we cannot reliably expect write would fail in all cases */ > > the following write to fd=17 will fail and because the state is > MCA_OOB_TCP_CONNECTED, the result is either a hang (the message from rml is > never sent) or an abort of orted. > > > a possible fix is to ensure retry() is invoked on orted 3 by > mca_oob_tcp_peer_recv_connect_ack when peer->state is MCA_OOB_TCP_CLOSED > > an other possible fix is to add a new state to the sequence : > orted 3 move peer->state to MCA_OOB_TCP_CONNECTED when it receives a ack > from orted 2 (and not when it sends its ack to orted 2) > > or something i did not think of yet ... > > Cheers, > > Gilles > > On Fri, Sep 19, 2014 at 8:06 PM, Gilles Gouaillardet < > gilles.gouaillar...@iferc.org> wrote: > >> Ralph, >> >> i found an other race condition. >> in a very specific scenario, vpid3 is in the MCA_OOB_TCP_CLOSED state, >> and processes data from the socket received from vpid 2 >> vpid3 is in the MCA_OOB_TCP_CLOSED state because vpid2 called retry() >> and closed all its both sockets to vpid 3 >> >> vpid3 read the ack data that was send to the socket (ok) and then ends >> up calling tcp_peer_send_blocking >> >> Function >> main (orted.c:62) >> orte_daemon (orted_main.c:828) >> opal_libevent2021_event_base_loop (event.c:1645) >> event_process_active (event.c:1437) >> event_process_active_single_queue (event.c:1367) >> recv_handler (oob_tcp.c:599) >> mca_oob_tcp_peer_accept (oob_tcp_connection.c:1071) >> tcp_peer_send_connect_ack (oob_tcp_connection.c:384) >> tcp_peer_send_blocking (oob_tcp_connection.c:525) >> >> >> though the socket (fd 17) is my case has been closed by the peer, and is >> hence reported in the CLOSE_WAIT state by lsof, >> send(17, ...) is a success (!!!) >> >> i thought the root cause was we previously set the O_NONBLOCK flag to >> this socket. >> so i explicitly cleared this flag (that was not set anyway...), before >> invoking mca_oob_tcp_peer_accept >> but i got the very same behaviour :-( >> >> could you please advise : >> - should the send fail because the socket is in the CLOSE_WAIT state ? >> - if a success is not a bad behaviour, does this mean an other step >> should be added to the oob/tcp "handshake" ? >> - or could this mean that when the peer state was moved from >> MCA_OOB_TCP_CONNECT_ACK to MCA_OOB_TCP_CLOSED, >> retry() should have been invoked ? >> >> Cheers, >> >> Gilles >> >> On 2014/09/18 17:02, Ralph Castain wrote: >> > The patch looks fine to me - please go ahead and apply it. Thanks! >> > >> > On Sep 17, 2014, at 11:35 PM, Gilles Gouaillardet < >> gilles.gouaillar...@iferc.org> wrote: >> > >> >> Ralph, >> >> >> >> yes and no ... >> >> >> >> mpi hello world with four nodes can be used to reproduce the issue, >> >> >> >> >> >> you can increase the likelyhood of producing the race condition by >> hacking >> >> ./opal/mca/event/libevent2021/libevent/poll.c >> >> and replace >> >> i = random() % nfds; >> >> with >> >> if (nfds < 2) { >> >> i = 0; >> >> } else { >> >> i = nfds - 2; >> >> } >> >> >> >> but since this is really a race condition, all i could do is show you >> >> how to use a debugger in order to force it >> >> >> >> >> >> here is what really happens : >> >> - thanks to your patch, when vpid 2 cannot read the acknowledge, this >> is >> >> no more a fatal error. >> >> - that being said, the peer->recv_event is not removed from the >> libevent >> >> - later, send_event will be added to the libevent >> >> - and then peer->recv_event will be added to the libevent >> >> /* this is clearly not supported, and the interesting behaviour is that >> >> peer->send_event will be kicked out of libevent (!) */ >> >> >> >> The attached patch fixes this race condition, could you please review >> it ? >> >> >> >> Cheers, >> >> >> >> Gilles >> >> >> >> On 2014/09/17 22:17, Ralph Castain wrote: >> >>> Do you have a reproducer you can share for testing this? I'm unable >> to get it to happen on my machine, but maybe you have a test code that >> triggers it so I can continue debugging >> >>> >> >>> Ralph >> >>> >> >>> On Sep 17, 2014, at 4:07 AM, Gilles Gouaillardet < >> gilles.gouaillar...@iferc.org> wrote: >> >>> >> >>>> Thanks Ralph, >> >>>> >> >>>> this is much better but there is still a bug : >> >>>> with the very same scenario i described earlier, vpid 2 does not send >> >>>> its message to vpid 3 once the connection has been established. >> >>>> >> >>>> i tried to debug it but i have been pretty unsuccessful so far .. >> >>>> >> >>>> vpid 2 calls tcp_peer_connected and execute the following snippet >> >>>> >> >>>> if (NULL != peer->send_msg && !peer->send_ev_active) { >> >>>> opal_event_add(&peer->send_event, 0); >> >>>> peer->send_ev_active = true; >> >>>> } >> >>>> >> >>>> but when evmap_io_active is invoked later, the following part : >> >>>> >> >>>> TAILQ_FOREACH(ev, &ctx->events, ev_io_next) { >> >>>> if (ev->ev_events & events) >> >>>> event_active_nolock(ev, ev->ev_events & events, 1); >> >>>> } >> >>>> >> >>>> finds only one ev (mca_oob_tcp_recv_handler and *no* >> >>>> mca_oob_tcp_send_handler) >> >>>> >> >>>> i will resume my investigations tomorrow >> >>>> >> >>>> Cheers, >> >>>> >> >>>> Gilles >> >>>> >> >>>> On 2014/09/17 4:01, Ralph Castain wrote: >> >>>>> Hi Gilles >> >>>>> >> >>>>> I took a crack at solving this in r32744 - CMRd it for 1.8.3 and >> assigned it to you for review. Give it a try and let me know if I >> (hopefully) got it. >> >>>>> >> >>>>> The approach we have used in the past is to have both sides close >> their connections, and then have the higher vpid retry while the lower one >> waits. The logic for that was still in place, but it looks like you are >> hitting a different code path, and I found another potential one as well. >> So I think I plugged the holes, but will wait to hear if you confirm. >> >>>>> >> >>>>> Thanks >> >>>>> Ralph >> >>>>> >> >>>>> On Sep 16, 2014, at 6:27 AM, Gilles Gouaillardet < >> gilles.gouaillar...@gmail.com> wrote: >> >>>>> >> >>>>>> Ralph, >> >>>>>> >> >>>>>> here is the full description of a race condition in oob/tcp i very >> briefly mentionned in a previous post : >> >>>>>> >> >>>>>> the race condition can occur when two not connected orted try to >> send a message to each other for the first time and at the same time. >> >>>>>> >> >>>>>> that can occur when running mpi helloworld on 4 nodes with the >> grpcomm/rcd module. >> >>>>>> >> >>>>>> here is a scenario in which the race condition occurs : >> >>>>>> >> >>>>>> orted vpid 2 and 3 enter the allgather >> >>>>>> /* they are not orte yet oob/tcp connected*/ >> >>>>>> and they call orte.send_buffer_nb each other. >> >>>>>> from a libevent point of view, vpid 2 and 3 will call >> mca_oob_tcp_peer_try_connect >> >>>>>> >> >>>>>> vpid 2 calls mca_oob_tcp_send_handler >> >>>>>> >> >>>>>> vpid 3 calls connection_event_handler >> >>>>>> >> >>>>>> depending on the value returned by random() in libevent, vpid 3 >> will >> >>>>>> either call mca_oob_tcp_send_handler (likely) or recv_handler >> (unlikely) >> >>>>>> if vpid 3 calls recv_handler, it will close the two sockets to >> vpid 2 >> >>>>>> >> >>>>>> then vpid 2 will call mca_oob_tcp_recv_handler >> >>>>>> (peer->state is MCA_OOB_TCP_CONNECT_ACK) >> >>>>>> that will invoke mca_oob_tcp_recv_connect_ack >> >>>>>> tcp_peer_recv_blocking will fail >> >>>>>> /* zero bytes are recv'ed since vpid 3 previously closed the >> socket before writing a header */ >> >>>>>> and this is handled by mca_oob_tcp_recv_handler as a fatal error >> >>>>>> /* ORTE_FORCED_TERMINATE(1) */ >> >>>>>> >> >>>>>> could you please have a look at it ? >> >>>>>> >> >>>>>> if you are too busy, could you please advise where this scenario >> should be handled differently ? >> >>>>>> - should vpid 3 keep one socket instead of closing both and >> retrying ? >> >>>>>> - should vpid 2 handle the failure as a non fatal error ? >> >>>>>> >> >>>>>> Cheers, >> >>>>>> >> >>>>>> Gilles >> >>>>>> _______________________________________________ >> >>>>>> devel mailing list >> >>>>>> de...@open-mpi.org >> >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >>>>>> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/09/15836.php >> >>>>> _______________________________________________ >> >>>>> devel mailing list >> >>>>> de...@open-mpi.org >> >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >>>>> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/09/15844.php >> >>>> _______________________________________________ >> >>>> devel mailing list >> >>>> de...@open-mpi.org >> >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >>>> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/09/15854.php >> >>> _______________________________________________ >> >>> devel mailing list >> >>> de...@open-mpi.org >> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >>> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/09/15855.php >> >> <oob_tcp.patch>_______________________________________________ >> >> devel mailing list >> >> de...@open-mpi.org >> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/09/15862.php >> > _______________________________________________ >> > devel mailing list >> > de...@open-mpi.org >> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/09/15863.php >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/09/15880.php >> > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/09/15882.php > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/09/15883.php >