Re: [OMPI devel] race condition in oob/tcp

Ralph Castain Fri, 19 Sep 2014 12:23:05 -0400 (EDT)

You know, I'm almost beginning to dread opening my email in the morning for 
fear of seeing another "race condition" subject line! :-)


I think the correct answer here is that orted 3 should be entering "retry" when 
it sees the peer state change to "closed", regardless of what happened in the 
send. I'll take a look at it, but since you have a reliable reproducer and I 
don't, please feel free to poke at it yourself


On Sep 19, 2014, at 8:30 AM, Gilles Gouaillardet 
<gilles.gouaillar...@gmail.com> wrote:

> Ralph,
> 
> let me detail the new race condition.
> 
> orted 2 and 3 are not connected to each other and send a message to each other
> 
> orted 2 and 3 call send_process (that set peer->state = 
> MCA_OOB_TCP_PEER_CONNECTING)
> they both end up calling mca_oob_tcp_peer_try_connect
> 
> now if orted 3 calls mca_oob_tcp_send_handler.
> peer->state = MCA_OOB_TCP_CONNECT_ACK
> peer->fd = 15 /* in my environment */
> 
> now orted 2 accept the connection from orted 3 (fd=17 in my environment)
> and invokes recv_handler
> that will call mca_oob_tcp_peer_recv_connect_ack
> that will in turn call retry because peer->state is MCA_OOB_TCP_CONNECTING
> retry will close fd 15 and 17
> 
> now orted 3 decides to handle fd 15 (depend on random() of libevent ...)
> mca_oob_tcp_recv_handler will call mca_oob_tcp_peer_recv_connect_ack
> that will call tcp_peer_recv_blocking
> recv(15, ...) will return 0 /* orted 2 did not invoke 
> mca_oob_tcp_send_handler yet)
> then mca_oob_tcp_peer_close will set peer->state = MCA_OOB_TCP_CLOSED
> and mca_oob_tcp_peer_recv_connect_ack will *not* invoke retry()
> 
> now orted 3 accepts the conection from orted 2 (fd=17 in my environment)
> it will successfully read the connection request, and write its ack and set 
> peer->state = MCA_OOB_TCP_CONNECTED
> 
> /* in my previous email, i was surprised the write did not fail.
> on second thought, this is not so surprising : depending on the timing, orted 
> 3 might not be aware that orted 2 closed the other end of the socket. bottom 
> line, we cannot reliably expect write would fail in all cases */
> 
> the following write to fd=17 will fail and because the state is 
> MCA_OOB_TCP_CONNECTED, the result is either a hang (the message from rml is 
> never sent) or an abort of orted.
> 
> 
> a possible fix is to ensure retry() is invoked on orted 3 by 
> mca_oob_tcp_peer_recv_connect_ack when peer->state is MCA_OOB_TCP_CLOSED
> 
> an other possible fix is to add a new state to the sequence :
> orted 3 move peer->state to MCA_OOB_TCP_CONNECTED when it receives a ack from 
> orted 2 (and not when it sends its ack to orted 2)
> 
> or something i did not think of yet ...
> 
> Cheers,
> 
> Gilles
> 
> On Fri, Sep 19, 2014 at 8:06 PM, Gilles Gouaillardet 
> <gilles.gouaillar...@iferc.org> wrote:
> Ralph,
> 
> i found an other race condition.
> in a very specific scenario, vpid3 is in the MCA_OOB_TCP_CLOSED state,
> and processes data from the socket received from vpid 2
> vpid3 is in the MCA_OOB_TCP_CLOSED state because vpid2 called retry()
> and closed all its both sockets to vpid 3
> 
> vpid3 read the ack data that was send to the socket (ok) and then ends
> up calling tcp_peer_send_blocking
> 
> Function
> main (orted.c:62)
>   orte_daemon (orted_main.c:828)
>     opal_libevent2021_event_base_loop (event.c:1645)
>       event_process_active (event.c:1437)
>         event_process_active_single_queue (event.c:1367)
>           recv_handler (oob_tcp.c:599)
>             mca_oob_tcp_peer_accept (oob_tcp_connection.c:1071)
>               tcp_peer_send_connect_ack (oob_tcp_connection.c:384)
>                 tcp_peer_send_blocking (oob_tcp_connection.c:525)
> 
> 
> though the socket (fd 17) is my case has been closed by the peer, and is
> hence reported in the CLOSE_WAIT state by lsof,
> send(17, ...) is a success (!!!)
> 
> i thought the root cause was we previously set the O_NONBLOCK flag to
> this socket.
> so i explicitly cleared this flag (that was not set anyway...), before
> invoking mca_oob_tcp_peer_accept
> but i got the very same behaviour :-(
> 
> could you please advise :
> - should the send fail because the socket is in the CLOSE_WAIT state ?
> - if a success is not a bad behaviour, does this mean an other step
> should be added to the oob/tcp "handshake" ?
> - or could this mean that when the peer state was moved from
> MCA_OOB_TCP_CONNECT_ACK to MCA_OOB_TCP_CLOSED,
> retry() should have been invoked ?
> 
> Cheers,
> 
> Gilles
> 
> On 2014/09/18 17:02, Ralph Castain wrote:
> > The patch looks fine to me - please go ahead and apply it. Thanks!
> >
> > On Sep 17, 2014, at 11:35 PM, Gilles Gouaillardet 
> > <gilles.gouaillar...@iferc.org> wrote:
> >
> >> Ralph,
> >>
> >> yes and no ...
> >>
> >> mpi hello world with four nodes can be used to reproduce the issue,
> >>
> >>
> >> you can increase the likelyhood of producing the race condition by hacking
> >> ./opal/mca/event/libevent2021/libevent/poll.c
> >> and replace
> >>        i = random() % nfds;
> >> with
> >>       if (nfds < 2) {
> >>           i = 0;
> >>       } else {
> >>           i = nfds - 2;
> >>       }
> >>
> >> but since this is really a race condition, all i could do is show you
> >> how to use a debugger in order to force it
> >>
> >>
> >> here is what really happens :
> >> - thanks to your patch, when vpid 2 cannot read the acknowledge, this is
> >> no more a fatal error.
> >> - that being said, the peer->recv_event is not removed from the libevent
> >> - later, send_event will be added to the libevent
> >> - and then peer->recv_event will be added to the libevent
> >> /* this is clearly not supported, and the interesting behaviour is that
> >> peer->send_event will be kicked out of libevent (!) */
> >>
> >> The attached patch fixes this race condition, could you please review it ?
> >>
> >> Cheers,
> >>
> >> Gilles
> >>
> >> On 2014/09/17 22:17, Ralph Castain wrote:
> >>> Do you have a reproducer you can share for testing this? I'm unable to 
> >>> get it to happen on my machine, but maybe you have a test code that 
> >>> triggers it so I can continue debugging
> >>>
> >>> Ralph
> >>>
> >>> On Sep 17, 2014, at 4:07 AM, Gilles Gouaillardet 
> >>> <gilles.gouaillar...@iferc.org> wrote:
> >>>
> >>>> Thanks Ralph,
> >>>>
> >>>> this is much better but there is still a bug :
> >>>> with the very same scenario i described earlier, vpid 2 does not send
> >>>> its message to vpid 3 once the connection has been established.
> >>>>
> >>>> i tried to debug it but i have been pretty unsuccessful so far ..
> >>>>
> >>>> vpid 2 calls tcp_peer_connected and execute the following snippet
> >>>>
> >>>> if (NULL != peer->send_msg && !peer->send_ev_active) {
> >>>>       opal_event_add(&peer->send_event, 0);
> >>>>       peer->send_ev_active = true;
> >>>>   }
> >>>>
> >>>> but when evmap_io_active is invoked later, the following part :
> >>>>
> >>>>   TAILQ_FOREACH(ev, &ctx->events, ev_io_next) {
> >>>>       if (ev->ev_events & events)
> >>>>           event_active_nolock(ev, ev->ev_events & events, 1);
> >>>>   }
> >>>>
> >>>> finds only one ev (mca_oob_tcp_recv_handler and *no*
> >>>> mca_oob_tcp_send_handler)
> >>>>
> >>>> i will resume my investigations tomorrow
> >>>>
> >>>> Cheers,
> >>>>
> >>>> Gilles
> >>>>
> >>>> On 2014/09/17 4:01, Ralph Castain wrote:
> >>>>> Hi Gilles
> >>>>>
> >>>>> I took a crack at solving this in r32744 - CMRd it for 1.8.3 and 
> >>>>> assigned it to you for review. Give it a try and let me know if I 
> >>>>> (hopefully) got it.
> >>>>>
> >>>>> The approach we have used in the past is to have both sides close their 
> >>>>> connections, and then have the higher vpid retry while the lower one 
> >>>>> waits. The logic for that was still in place, but it looks like you are 
> >>>>> hitting a different code path, and I found another potential one as 
> >>>>> well. So I think I plugged the holes, but will wait to hear if you 
> >>>>> confirm.
> >>>>>
> >>>>> Thanks
> >>>>> Ralph
> >>>>>
> >>>>> On Sep 16, 2014, at 6:27 AM, Gilles Gouaillardet 
> >>>>> <gilles.gouaillar...@gmail.com> wrote:
> >>>>>
> >>>>>> Ralph,
> >>>>>>
> >>>>>> here is the full description of a race condition in oob/tcp i very 
> >>>>>> briefly mentionned in a previous post :
> >>>>>>
> >>>>>> the race condition can occur when two not connected orted try to send 
> >>>>>> a message to each other for the first time and at the same time.
> >>>>>>
> >>>>>> that can occur when running mpi helloworld on 4 nodes with the 
> >>>>>> grpcomm/rcd module.
> >>>>>>
> >>>>>> here is a scenario in which the race condition occurs :
> >>>>>>
> >>>>>> orted vpid 2 and 3 enter the allgather
> >>>>>> /* they are not orte yet oob/tcp connected*/
> >>>>>> and they call orte.send_buffer_nb each other.
> >>>>>> from a libevent point of view, vpid 2 and 3 will call 
> >>>>>> mca_oob_tcp_peer_try_connect
> >>>>>>
> >>>>>> vpid 2 calls mca_oob_tcp_send_handler
> >>>>>>
> >>>>>> vpid 3 calls connection_event_handler
> >>>>>>
> >>>>>> depending on the value returned by random() in libevent, vpid 3 will
> >>>>>> either call mca_oob_tcp_send_handler (likely) or recv_handler 
> >>>>>> (unlikely)
> >>>>>> if vpid 3 calls recv_handler, it will close the two sockets to vpid 2
> >>>>>>
> >>>>>> then vpid 2 will call mca_oob_tcp_recv_handler
> >>>>>> (peer->state is MCA_OOB_TCP_CONNECT_ACK)
> >>>>>> that will invoke mca_oob_tcp_recv_connect_ack
> >>>>>> tcp_peer_recv_blocking will fail
> >>>>>> /* zero bytes are recv'ed since vpid 3 previously closed the socket 
> >>>>>> before writing a header */
> >>>>>> and this is handled by mca_oob_tcp_recv_handler as a fatal error
> >>>>>> /* ORTE_FORCED_TERMINATE(1) */
> >>>>>>
> >>>>>> could you please have a look at it ?
> >>>>>>
> >>>>>> if you are too busy, could you please advise where this scenario 
> >>>>>> should be handled differently ?
> >>>>>> - should vpid 3 keep one socket instead of closing both and retrying ?
> >>>>>> - should vpid 2 handle the failure as a non fatal error ?
> >>>>>>
> >>>>>> Cheers,
> >>>>>>
> >>>>>> Gilles
> >>>>>> _______________________________________________
> >>>>>> devel mailing list
> >>>>>> de...@open-mpi.org
> >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>>>> Link to this post: 
> >>>>>> http://www.open-mpi.org/community/lists/devel/2014/09/15836.php
> >>>>> _______________________________________________
> >>>>> devel mailing list
> >>>>> de...@open-mpi.org
> >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>>> Link to this post: 
> >>>>> http://www.open-mpi.org/community/lists/devel/2014/09/15844.php
> >>>> _______________________________________________
> >>>> devel mailing list
> >>>> de...@open-mpi.org
> >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>> Link to this post: 
> >>>> http://www.open-mpi.org/community/lists/devel/2014/09/15854.php
> >>> _______________________________________________
> >>> devel mailing list
> >>> de...@open-mpi.org
> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>> Link to this post: 
> >>> http://www.open-mpi.org/community/lists/devel/2014/09/15855.php
> >> <oob_tcp.patch>_______________________________________________
> >> devel mailing list
> >> de...@open-mpi.org
> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >> Link to this post: 
> >> http://www.open-mpi.org/community/lists/devel/2014/09/15862.php
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post: 
> > http://www.open-mpi.org/community/lists/devel/2014/09/15863.php
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/09/15880.php
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/09/15882.php

Re: [OMPI devel] race condition in oob/tcp

Reply via email to