You know, I'm almost beginning to dread opening my email in the morning for fear of seeing another "race condition" subject line! :-)
I think the correct answer here is that orted 3 should be entering "retry" when it sees the peer state change to "closed", regardless of what happened in the send. I'll take a look at it, but since you have a reliable reproducer and I don't, please feel free to poke at it yourself On Sep 19, 2014, at 8:30 AM, Gilles Gouaillardet <gilles.gouaillar...@gmail.com> wrote: > Ralph, > > let me detail the new race condition. > > orted 2 and 3 are not connected to each other and send a message to each other > > orted 2 and 3 call send_process (that set peer->state = > MCA_OOB_TCP_PEER_CONNECTING) > they both end up calling mca_oob_tcp_peer_try_connect > > now if orted 3 calls mca_oob_tcp_send_handler. > peer->state = MCA_OOB_TCP_CONNECT_ACK > peer->fd = 15 /* in my environment */ > > now orted 2 accept the connection from orted 3 (fd=17 in my environment) > and invokes recv_handler > that will call mca_oob_tcp_peer_recv_connect_ack > that will in turn call retry because peer->state is MCA_OOB_TCP_CONNECTING > retry will close fd 15 and 17 > > now orted 3 decides to handle fd 15 (depend on random() of libevent ...) > mca_oob_tcp_recv_handler will call mca_oob_tcp_peer_recv_connect_ack > that will call tcp_peer_recv_blocking > recv(15, ...) will return 0 /* orted 2 did not invoke > mca_oob_tcp_send_handler yet) > then mca_oob_tcp_peer_close will set peer->state = MCA_OOB_TCP_CLOSED > and mca_oob_tcp_peer_recv_connect_ack will *not* invoke retry() > > now orted 3 accepts the conection from orted 2 (fd=17 in my environment) > it will successfully read the connection request, and write its ack and set > peer->state = MCA_OOB_TCP_CONNECTED > > /* in my previous email, i was surprised the write did not fail. > on second thought, this is not so surprising : depending on the timing, orted > 3 might not be aware that orted 2 closed the other end of the socket. bottom > line, we cannot reliably expect write would fail in all cases */ > > the following write to fd=17 will fail and because the state is > MCA_OOB_TCP_CONNECTED, the result is either a hang (the message from rml is > never sent) or an abort of orted. > > > a possible fix is to ensure retry() is invoked on orted 3 by > mca_oob_tcp_peer_recv_connect_ack when peer->state is MCA_OOB_TCP_CLOSED > > an other possible fix is to add a new state to the sequence : > orted 3 move peer->state to MCA_OOB_TCP_CONNECTED when it receives a ack from > orted 2 (and not when it sends its ack to orted 2) > > or something i did not think of yet ... > > Cheers, > > Gilles > > On Fri, Sep 19, 2014 at 8:06 PM, Gilles Gouaillardet > <gilles.gouaillar...@iferc.org> wrote: > Ralph, > > i found an other race condition. > in a very specific scenario, vpid3 is in the MCA_OOB_TCP_CLOSED state, > and processes data from the socket received from vpid 2 > vpid3 is in the MCA_OOB_TCP_CLOSED state because vpid2 called retry() > and closed all its both sockets to vpid 3 > > vpid3 read the ack data that was send to the socket (ok) and then ends > up calling tcp_peer_send_blocking > > Function > main (orted.c:62) > orte_daemon (orted_main.c:828) > opal_libevent2021_event_base_loop (event.c:1645) > event_process_active (event.c:1437) > event_process_active_single_queue (event.c:1367) > recv_handler (oob_tcp.c:599) > mca_oob_tcp_peer_accept (oob_tcp_connection.c:1071) > tcp_peer_send_connect_ack (oob_tcp_connection.c:384) > tcp_peer_send_blocking (oob_tcp_connection.c:525) > > > though the socket (fd 17) is my case has been closed by the peer, and is > hence reported in the CLOSE_WAIT state by lsof, > send(17, ...) is a success (!!!) > > i thought the root cause was we previously set the O_NONBLOCK flag to > this socket. > so i explicitly cleared this flag (that was not set anyway...), before > invoking mca_oob_tcp_peer_accept > but i got the very same behaviour :-( > > could you please advise : > - should the send fail because the socket is in the CLOSE_WAIT state ? > - if a success is not a bad behaviour, does this mean an other step > should be added to the oob/tcp "handshake" ? > - or could this mean that when the peer state was moved from > MCA_OOB_TCP_CONNECT_ACK to MCA_OOB_TCP_CLOSED, > retry() should have been invoked ? > > Cheers, > > Gilles > > On 2014/09/18 17:02, Ralph Castain wrote: > > The patch looks fine to me - please go ahead and apply it. Thanks! > > > > On Sep 17, 2014, at 11:35 PM, Gilles Gouaillardet > > <gilles.gouaillar...@iferc.org> wrote: > > > >> Ralph, > >> > >> yes and no ... > >> > >> mpi hello world with four nodes can be used to reproduce the issue, > >> > >> > >> you can increase the likelyhood of producing the race condition by hacking > >> ./opal/mca/event/libevent2021/libevent/poll.c > >> and replace > >> i = random() % nfds; > >> with > >> if (nfds < 2) { > >> i = 0; > >> } else { > >> i = nfds - 2; > >> } > >> > >> but since this is really a race condition, all i could do is show you > >> how to use a debugger in order to force it > >> > >> > >> here is what really happens : > >> - thanks to your patch, when vpid 2 cannot read the acknowledge, this is > >> no more a fatal error. > >> - that being said, the peer->recv_event is not removed from the libevent > >> - later, send_event will be added to the libevent > >> - and then peer->recv_event will be added to the libevent > >> /* this is clearly not supported, and the interesting behaviour is that > >> peer->send_event will be kicked out of libevent (!) */ > >> > >> The attached patch fixes this race condition, could you please review it ? > >> > >> Cheers, > >> > >> Gilles > >> > >> On 2014/09/17 22:17, Ralph Castain wrote: > >>> Do you have a reproducer you can share for testing this? I'm unable to > >>> get it to happen on my machine, but maybe you have a test code that > >>> triggers it so I can continue debugging > >>> > >>> Ralph > >>> > >>> On Sep 17, 2014, at 4:07 AM, Gilles Gouaillardet > >>> <gilles.gouaillar...@iferc.org> wrote: > >>> > >>>> Thanks Ralph, > >>>> > >>>> this is much better but there is still a bug : > >>>> with the very same scenario i described earlier, vpid 2 does not send > >>>> its message to vpid 3 once the connection has been established. > >>>> > >>>> i tried to debug it but i have been pretty unsuccessful so far .. > >>>> > >>>> vpid 2 calls tcp_peer_connected and execute the following snippet > >>>> > >>>> if (NULL != peer->send_msg && !peer->send_ev_active) { > >>>> opal_event_add(&peer->send_event, 0); > >>>> peer->send_ev_active = true; > >>>> } > >>>> > >>>> but when evmap_io_active is invoked later, the following part : > >>>> > >>>> TAILQ_FOREACH(ev, &ctx->events, ev_io_next) { > >>>> if (ev->ev_events & events) > >>>> event_active_nolock(ev, ev->ev_events & events, 1); > >>>> } > >>>> > >>>> finds only one ev (mca_oob_tcp_recv_handler and *no* > >>>> mca_oob_tcp_send_handler) > >>>> > >>>> i will resume my investigations tomorrow > >>>> > >>>> Cheers, > >>>> > >>>> Gilles > >>>> > >>>> On 2014/09/17 4:01, Ralph Castain wrote: > >>>>> Hi Gilles > >>>>> > >>>>> I took a crack at solving this in r32744 - CMRd it for 1.8.3 and > >>>>> assigned it to you for review. Give it a try and let me know if I > >>>>> (hopefully) got it. > >>>>> > >>>>> The approach we have used in the past is to have both sides close their > >>>>> connections, and then have the higher vpid retry while the lower one > >>>>> waits. The logic for that was still in place, but it looks like you are > >>>>> hitting a different code path, and I found another potential one as > >>>>> well. So I think I plugged the holes, but will wait to hear if you > >>>>> confirm. > >>>>> > >>>>> Thanks > >>>>> Ralph > >>>>> > >>>>> On Sep 16, 2014, at 6:27 AM, Gilles Gouaillardet > >>>>> <gilles.gouaillar...@gmail.com> wrote: > >>>>> > >>>>>> Ralph, > >>>>>> > >>>>>> here is the full description of a race condition in oob/tcp i very > >>>>>> briefly mentionned in a previous post : > >>>>>> > >>>>>> the race condition can occur when two not connected orted try to send > >>>>>> a message to each other for the first time and at the same time. > >>>>>> > >>>>>> that can occur when running mpi helloworld on 4 nodes with the > >>>>>> grpcomm/rcd module. > >>>>>> > >>>>>> here is a scenario in which the race condition occurs : > >>>>>> > >>>>>> orted vpid 2 and 3 enter the allgather > >>>>>> /* they are not orte yet oob/tcp connected*/ > >>>>>> and they call orte.send_buffer_nb each other. > >>>>>> from a libevent point of view, vpid 2 and 3 will call > >>>>>> mca_oob_tcp_peer_try_connect > >>>>>> > >>>>>> vpid 2 calls mca_oob_tcp_send_handler > >>>>>> > >>>>>> vpid 3 calls connection_event_handler > >>>>>> > >>>>>> depending on the value returned by random() in libevent, vpid 3 will > >>>>>> either call mca_oob_tcp_send_handler (likely) or recv_handler > >>>>>> (unlikely) > >>>>>> if vpid 3 calls recv_handler, it will close the two sockets to vpid 2 > >>>>>> > >>>>>> then vpid 2 will call mca_oob_tcp_recv_handler > >>>>>> (peer->state is MCA_OOB_TCP_CONNECT_ACK) > >>>>>> that will invoke mca_oob_tcp_recv_connect_ack > >>>>>> tcp_peer_recv_blocking will fail > >>>>>> /* zero bytes are recv'ed since vpid 3 previously closed the socket > >>>>>> before writing a header */ > >>>>>> and this is handled by mca_oob_tcp_recv_handler as a fatal error > >>>>>> /* ORTE_FORCED_TERMINATE(1) */ > >>>>>> > >>>>>> could you please have a look at it ? > >>>>>> > >>>>>> if you are too busy, could you please advise where this scenario > >>>>>> should be handled differently ? > >>>>>> - should vpid 3 keep one socket instead of closing both and retrying ? > >>>>>> - should vpid 2 handle the failure as a non fatal error ? > >>>>>> > >>>>>> Cheers, > >>>>>> > >>>>>> Gilles > >>>>>> _______________________________________________ > >>>>>> devel mailing list > >>>>>> de...@open-mpi.org > >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>>>>> Link to this post: > >>>>>> http://www.open-mpi.org/community/lists/devel/2014/09/15836.php > >>>>> _______________________________________________ > >>>>> devel mailing list > >>>>> de...@open-mpi.org > >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>>>> Link to this post: > >>>>> http://www.open-mpi.org/community/lists/devel/2014/09/15844.php > >>>> _______________________________________________ > >>>> devel mailing list > >>>> de...@open-mpi.org > >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>>> Link to this post: > >>>> http://www.open-mpi.org/community/lists/devel/2014/09/15854.php > >>> _______________________________________________ > >>> devel mailing list > >>> de...@open-mpi.org > >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >>> Link to this post: > >>> http://www.open-mpi.org/community/lists/devel/2014/09/15855.php > >> <oob_tcp.patch>_______________________________________________ > >> devel mailing list > >> de...@open-mpi.org > >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > >> Link to this post: > >> http://www.open-mpi.org/community/lists/devel/2014/09/15862.php > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > > http://www.open-mpi.org/community/lists/devel/2014/09/15863.php > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/09/15880.php > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/09/15882.php