Re: [OMPI devel] race condition in oob/tcp

Gilles Gouaillardet Fri, 19 Sep 2014 11:30:18 -0400 (EDT)

Ralph,

let me detail the new race condition.


orted 2 and 3 are not connected to each other and send a message to each
other

orted 2 and 3 call send_process (that set peer->state =
MCA_OOB_TCP_PEER_CONNECTING)
they both end up calling mca_oob_tcp_peer_try_connect

now if orted 3 calls mca_oob_tcp_send_handler.
peer->state = MCA_OOB_TCP_CONNECT_ACK
peer->fd = 15 /* in my environment */

now orted 2 accept the connection from orted 3 (fd=17 in my environment)
and invokes recv_handler
that will call mca_oob_tcp_peer_recv_connect_ack
that will in turn call retry because peer->state is MCA_OOB_TCP_CONNECTING
retry will close fd 15 and 17

now orted 3 decides to handle fd 15 (depend on random() of libevent ...)
mca_oob_tcp_recv_handler will call mca_oob_tcp_peer_recv_connect_ack
that will call tcp_peer_recv_blocking
recv(15, ...) will return 0 /* orted 2 did not invoke
mca_oob_tcp_send_handler yet)
then mca_oob_tcp_peer_close will set peer->state = MCA_OOB_TCP_CLOSED
and mca_oob_tcp_peer_recv_connect_ack will *not* invoke retry()

now orted 3 accepts the conection from orted 2 (fd=17 in my environment)
it will successfully read the connection request, and write its ack and set
peer->state = MCA_OOB_TCP_CONNECTED

/* in my previous email, i was surprised the write did not fail.
on second thought, this is not so surprising : depending on the timing,
orted 3 might not be aware that orted 2 closed the other end of the socket.
bottom line, we cannot reliably expect write would fail in all cases */

the following write to fd=17 will fail and because the state is
MCA_OOB_TCP_CONNECTED, the result is either a hang (the message from rml is
never sent) or an abort of orted.


a possible fix is to ensure retry() is invoked on orted 3 by
mca_oob_tcp_peer_recv_connect_ack when peer->state is MCA_OOB_TCP_CLOSED

an other possible fix is to add a new state to the sequence :
orted 3 move peer->state to MCA_OOB_TCP_CONNECTED when it receives a ack
from orted 2 (and not when it sends its ack to orted 2)

or something i did not think of yet ...

Cheers,

Gilles

On Fri, Sep 19, 2014 at 8:06 PM, Gilles Gouaillardet <
[email protected]> wrote:

> Ralph,
>
> i found an other race condition.
> in a very specific scenario, vpid3 is in the MCA_OOB_TCP_CLOSED state,
> and processes data from the socket received from vpid 2
> vpid3 is in the MCA_OOB_TCP_CLOSED state because vpid2 called retry()
> and closed all its both sockets to vpid 3
>
> vpid3 read the ack data that was send to the socket (ok) and then ends
> up calling tcp_peer_send_blocking
>
> Function
> main (orted.c:62)
>   orte_daemon (orted_main.c:828)
>     opal_libevent2021_event_base_loop (event.c:1645)
>       event_process_active (event.c:1437)
>         event_process_active_single_queue (event.c:1367)
>           recv_handler (oob_tcp.c:599)
>             mca_oob_tcp_peer_accept (oob_tcp_connection.c:1071)
>               tcp_peer_send_connect_ack (oob_tcp_connection.c:384)
>                 tcp_peer_send_blocking (oob_tcp_connection.c:525)
>
>
> though the socket (fd 17) is my case has been closed by the peer, and is
> hence reported in the CLOSE_WAIT state by lsof,
> send(17, ...) is a success (!!!)
>
> i thought the root cause was we previously set the O_NONBLOCK flag to
> this socket.
> so i explicitly cleared this flag (that was not set anyway...), before
> invoking mca_oob_tcp_peer_accept
> but i got the very same behaviour :-(
>
> could you please advise :
> - should the send fail because the socket is in the CLOSE_WAIT state ?
> - if a success is not a bad behaviour, does this mean an other step
> should be added to the oob/tcp "handshake" ?
> - or could this mean that when the peer state was moved from
> MCA_OOB_TCP_CONNECT_ACK to MCA_OOB_TCP_CLOSED,
> retry() should have been invoked ?
>
> Cheers,
>
> Gilles
>
> On 2014/09/18 17:02, Ralph Castain wrote:
> > The patch looks fine to me - please go ahead and apply it. Thanks!
> >
> > On Sep 17, 2014, at 11:35 PM, Gilles Gouaillardet <
> [email protected]> wrote:
> >
> >> Ralph,
> >>
> >> yes and no ...
> >>
> >> mpi hello world with four nodes can be used to reproduce the issue,
> >>
> >>
> >> you can increase the likelyhood of producing the race condition by
> hacking
> >> ./opal/mca/event/libevent2021/libevent/poll.c
> >> and replace
> >>        i = random() % nfds;
> >> with
> >>       if (nfds < 2) {
> >>           i = 0;
> >>       } else {
> >>           i = nfds - 2;
> >>       }
> >>
> >> but since this is really a race condition, all i could do is show you
> >> how to use a debugger in order to force it
> >>
> >>
> >> here is what really happens :
> >> - thanks to your patch, when vpid 2 cannot read the acknowledge, this is
> >> no more a fatal error.
> >> - that being said, the peer->recv_event is not removed from the libevent
> >> - later, send_event will be added to the libevent
> >> - and then peer->recv_event will be added to the libevent
> >> /* this is clearly not supported, and the interesting behaviour is that
> >> peer->send_event will be kicked out of libevent (!) */
> >>
> >> The attached patch fixes this race condition, could you please review
> it ?
> >>
> >> Cheers,
> >>
> >> Gilles
> >>
> >> On 2014/09/17 22:17, Ralph Castain wrote:
> >>> Do you have a reproducer you can share for testing this? I'm unable to
> get it to happen on my machine, but maybe you have a test code that
> triggers it so I can continue debugging
> >>>
> >>> Ralph
> >>>
> >>> On Sep 17, 2014, at 4:07 AM, Gilles Gouaillardet <
> [email protected]> wrote:
> >>>
> >>>> Thanks Ralph,
> >>>>
> >>>> this is much better but there is still a bug :
> >>>> with the very same scenario i described earlier, vpid 2 does not send
> >>>> its message to vpid 3 once the connection has been established.
> >>>>
> >>>> i tried to debug it but i have been pretty unsuccessful so far ..
> >>>>
> >>>> vpid 2 calls tcp_peer_connected and execute the following snippet
> >>>>
> >>>> if (NULL != peer->send_msg && !peer->send_ev_active) {
> >>>>       opal_event_add(&peer->send_event, 0);
> >>>>       peer->send_ev_active = true;
> >>>>   }
> >>>>
> >>>> but when evmap_io_active is invoked later, the following part :
> >>>>
> >>>>   TAILQ_FOREACH(ev, &ctx->events, ev_io_next) {
> >>>>       if (ev->ev_events & events)
> >>>>           event_active_nolock(ev, ev->ev_events & events, 1);
> >>>>   }
> >>>>
> >>>> finds only one ev (mca_oob_tcp_recv_handler and *no*
> >>>> mca_oob_tcp_send_handler)
> >>>>
> >>>> i will resume my investigations tomorrow
> >>>>
> >>>> Cheers,
> >>>>
> >>>> Gilles
> >>>>
> >>>> On 2014/09/17 4:01, Ralph Castain wrote:
> >>>>> Hi Gilles
> >>>>>
> >>>>> I took a crack at solving this in r32744 - CMRd it for 1.8.3 and
> assigned it to you for review. Give it a try and let me know if I
> (hopefully) got it.
> >>>>>
> >>>>> The approach we have used in the past is to have both sides close
> their connections, and then have the higher vpid retry while the lower one
> waits. The logic for that was still in place, but it looks like you are
> hitting a different code path, and I found another potential one as well.
> So I think I plugged the holes, but will wait to hear if you confirm.
> >>>>>
> >>>>> Thanks
> >>>>> Ralph
> >>>>>
> >>>>> On Sep 16, 2014, at 6:27 AM, Gilles Gouaillardet <
> [email protected]> wrote:
> >>>>>
> >>>>>> Ralph,
> >>>>>>
> >>>>>> here is the full description of a race condition in oob/tcp i very
> briefly mentionned in a previous post :
> >>>>>>
> >>>>>> the race condition can occur when two not connected orted try to
> send a message to each other for the first time and at the same time.
> >>>>>>
> >>>>>> that can occur when running mpi helloworld on 4 nodes with the
> grpcomm/rcd module.
> >>>>>>
> >>>>>> here is a scenario in which the race condition occurs :
> >>>>>>
> >>>>>> orted vpid 2 and 3 enter the allgather
> >>>>>> /* they are not orte yet oob/tcp connected*/
> >>>>>> and they call orte.send_buffer_nb each other.
> >>>>>> from a libevent point of view, vpid 2 and 3 will call
> mca_oob_tcp_peer_try_connect
> >>>>>>
> >>>>>> vpid 2 calls mca_oob_tcp_send_handler
> >>>>>>
> >>>>>> vpid 3 calls connection_event_handler
> >>>>>>
> >>>>>> depending on the value returned by random() in libevent, vpid 3 will
> >>>>>> either call mca_oob_tcp_send_handler (likely) or recv_handler
> (unlikely)
> >>>>>> if vpid 3 calls recv_handler, it will close the two sockets to vpid
> 2
> >>>>>>
> >>>>>> then vpid 2 will call mca_oob_tcp_recv_handler
> >>>>>> (peer->state is MCA_OOB_TCP_CONNECT_ACK)
> >>>>>> that will invoke mca_oob_tcp_recv_connect_ack
> >>>>>> tcp_peer_recv_blocking will fail
> >>>>>> /* zero bytes are recv'ed since vpid 3 previously closed the socket
> before writing a header */
> >>>>>> and this is handled by mca_oob_tcp_recv_handler as a fatal error
> >>>>>> /* ORTE_FORCED_TERMINATE(1) */
> >>>>>>
> >>>>>> could you please have a look at it ?
> >>>>>>
> >>>>>> if you are too busy, could you please advise where this scenario
> should be handled differently ?
> >>>>>> - should vpid 3 keep one socket instead of closing both and
> retrying ?
> >>>>>> - should vpid 2 handle the failure as a non fatal error ?
> >>>>>>
> >>>>>> Cheers,
> >>>>>>
> >>>>>> Gilles
> >>>>>> _______________________________________________
> >>>>>> devel mailing list
> >>>>>> [email protected]
> >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>>>> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/09/15836.php
> >>>>> _______________________________________________
> >>>>> devel mailing list
> >>>>> [email protected]
> >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>>> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/09/15844.php
> >>>> _______________________________________________
> >>>> devel mailing list
> >>>> [email protected]
> >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/09/15854.php
> >>> _______________________________________________
> >>> devel mailing list
> >>> [email protected]
> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/09/15855.php
> >> <oob_tcp.patch>_______________________________________________
> >> devel mailing list
> >> [email protected]
> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/09/15862.php
> > _______________________________________________
> > devel mailing list
> > [email protected]
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/09/15863.php
>
> _______________________________________________
> devel mailing list
> [email protected]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/09/15880.php
>

Re: [OMPI devel] race condition in oob/tcp

Reply via email to