Or copy the handshake protocol design of the TCP BTL...

  George.

On Fri, Sep 19, 2014 at 6:23 PM, Ralph Castain <r...@open-mpi.org> wrote:

> You know, I'm almost beginning to dread opening my email in the morning
> for fear of seeing another "race condition" subject line! :-)
>
> I think the correct answer here is that orted 3 should be entering "retry"
> when it sees the peer state change to "closed", regardless of what happened
> in the send. I'll take a look at it, but since you have a reliable
> reproducer and I don't, please feel free to poke at it yourself
>
>
> On Sep 19, 2014, at 8:30 AM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com> wrote:
>
> Ralph,
>
> let me detail the new race condition.
>
> orted 2 and 3 are not connected to each other and send a message to each
> other
>
> orted 2 and 3 call send_process (that set peer->state =
> MCA_OOB_TCP_PEER_CONNECTING)
> they both end up calling mca_oob_tcp_peer_try_connect
>
> now if orted 3 calls mca_oob_tcp_send_handler.
> peer->state = MCA_OOB_TCP_CONNECT_ACK
> peer->fd = 15 /* in my environment */
>
> now orted 2 accept the connection from orted 3 (fd=17 in my environment)
> and invokes recv_handler
> that will call mca_oob_tcp_peer_recv_connect_ack
> that will in turn call retry because peer->state is MCA_OOB_TCP_CONNECTING
> retry will close fd 15 and 17
>
> now orted 3 decides to handle fd 15 (depend on random() of libevent ...)
> mca_oob_tcp_recv_handler will call mca_oob_tcp_peer_recv_connect_ack
> that will call tcp_peer_recv_blocking
> recv(15, ...) will return 0 /* orted 2 did not invoke
> mca_oob_tcp_send_handler yet)
> then mca_oob_tcp_peer_close will set peer->state = MCA_OOB_TCP_CLOSED
> and mca_oob_tcp_peer_recv_connect_ack will *not* invoke retry()
>
> now orted 3 accepts the conection from orted 2 (fd=17 in my environment)
> it will successfully read the connection request, and write its ack and
> set peer->state = MCA_OOB_TCP_CONNECTED
>
> /* in my previous email, i was surprised the write did not fail.
> on second thought, this is not so surprising : depending on the timing,
> orted 3 might not be aware that orted 2 closed the other end of the socket.
> bottom line, we cannot reliably expect write would fail in all cases */
>
> the following write to fd=17 will fail and because the state is
> MCA_OOB_TCP_CONNECTED, the result is either a hang (the message from rml is
> never sent) or an abort of orted.
>
>
> a possible fix is to ensure retry() is invoked on orted 3 by
> mca_oob_tcp_peer_recv_connect_ack when peer->state is MCA_OOB_TCP_CLOSED
>
> an other possible fix is to add a new state to the sequence :
> orted 3 move peer->state to MCA_OOB_TCP_CONNECTED when it receives a ack
> from orted 2 (and not when it sends its ack to orted 2)
>
> or something i did not think of yet ...
>
> Cheers,
>
> Gilles
>
> On Fri, Sep 19, 2014 at 8:06 PM, Gilles Gouaillardet <
> gilles.gouaillar...@iferc.org> wrote:
>
>> Ralph,
>>
>> i found an other race condition.
>> in a very specific scenario, vpid3 is in the MCA_OOB_TCP_CLOSED state,
>> and processes data from the socket received from vpid 2
>> vpid3 is in the MCA_OOB_TCP_CLOSED state because vpid2 called retry()
>> and closed all its both sockets to vpid 3
>>
>> vpid3 read the ack data that was send to the socket (ok) and then ends
>> up calling tcp_peer_send_blocking
>>
>> Function
>> main (orted.c:62)
>>   orte_daemon (orted_main.c:828)
>>     opal_libevent2021_event_base_loop (event.c:1645)
>>       event_process_active (event.c:1437)
>>         event_process_active_single_queue (event.c:1367)
>>           recv_handler (oob_tcp.c:599)
>>             mca_oob_tcp_peer_accept (oob_tcp_connection.c:1071)
>>               tcp_peer_send_connect_ack (oob_tcp_connection.c:384)
>>                 tcp_peer_send_blocking (oob_tcp_connection.c:525)
>>
>>
>> though the socket (fd 17) is my case has been closed by the peer, and is
>> hence reported in the CLOSE_WAIT state by lsof,
>> send(17, ...) is a success (!!!)
>>
>> i thought the root cause was we previously set the O_NONBLOCK flag to
>> this socket.
>> so i explicitly cleared this flag (that was not set anyway...), before
>> invoking mca_oob_tcp_peer_accept
>> but i got the very same behaviour :-(
>>
>> could you please advise :
>> - should the send fail because the socket is in the CLOSE_WAIT state ?
>> - if a success is not a bad behaviour, does this mean an other step
>> should be added to the oob/tcp "handshake" ?
>> - or could this mean that when the peer state was moved from
>> MCA_OOB_TCP_CONNECT_ACK to MCA_OOB_TCP_CLOSED,
>> retry() should have been invoked ?
>>
>> Cheers,
>>
>> Gilles
>>
>> On 2014/09/18 17:02, Ralph Castain wrote:
>> > The patch looks fine to me - please go ahead and apply it. Thanks!
>> >
>> > On Sep 17, 2014, at 11:35 PM, Gilles Gouaillardet <
>> gilles.gouaillar...@iferc.org> wrote:
>> >
>> >> Ralph,
>> >>
>> >> yes and no ...
>> >>
>> >> mpi hello world with four nodes can be used to reproduce the issue,
>> >>
>> >>
>> >> you can increase the likelyhood of producing the race condition by
>> hacking
>> >> ./opal/mca/event/libevent2021/libevent/poll.c
>> >> and replace
>> >>        i = random() % nfds;
>> >> with
>> >>       if (nfds < 2) {
>> >>           i = 0;
>> >>       } else {
>> >>           i = nfds - 2;
>> >>       }
>> >>
>> >> but since this is really a race condition, all i could do is show you
>> >> how to use a debugger in order to force it
>> >>
>> >>
>> >> here is what really happens :
>> >> - thanks to your patch, when vpid 2 cannot read the acknowledge, this
>> is
>> >> no more a fatal error.
>> >> - that being said, the peer->recv_event is not removed from the
>> libevent
>> >> - later, send_event will be added to the libevent
>> >> - and then peer->recv_event will be added to the libevent
>> >> /* this is clearly not supported, and the interesting behaviour is that
>> >> peer->send_event will be kicked out of libevent (!) */
>> >>
>> >> The attached patch fixes this race condition, could you please review
>> it ?
>> >>
>> >> Cheers,
>> >>
>> >> Gilles
>> >>
>> >> On 2014/09/17 22:17, Ralph Castain wrote:
>> >>> Do you have a reproducer you can share for testing this? I'm unable
>> to get it to happen on my machine, but maybe you have a test code that
>> triggers it so I can continue debugging
>> >>>
>> >>> Ralph
>> >>>
>> >>> On Sep 17, 2014, at 4:07 AM, Gilles Gouaillardet <
>> gilles.gouaillar...@iferc.org> wrote:
>> >>>
>> >>>> Thanks Ralph,
>> >>>>
>> >>>> this is much better but there is still a bug :
>> >>>> with the very same scenario i described earlier, vpid 2 does not send
>> >>>> its message to vpid 3 once the connection has been established.
>> >>>>
>> >>>> i tried to debug it but i have been pretty unsuccessful so far ..
>> >>>>
>> >>>> vpid 2 calls tcp_peer_connected and execute the following snippet
>> >>>>
>> >>>> if (NULL != peer->send_msg && !peer->send_ev_active) {
>> >>>>       opal_event_add(&peer->send_event, 0);
>> >>>>       peer->send_ev_active = true;
>> >>>>   }
>> >>>>
>> >>>> but when evmap_io_active is invoked later, the following part :
>> >>>>
>> >>>>   TAILQ_FOREACH(ev, &ctx->events, ev_io_next) {
>> >>>>       if (ev->ev_events & events)
>> >>>>           event_active_nolock(ev, ev->ev_events & events, 1);
>> >>>>   }
>> >>>>
>> >>>> finds only one ev (mca_oob_tcp_recv_handler and *no*
>> >>>> mca_oob_tcp_send_handler)
>> >>>>
>> >>>> i will resume my investigations tomorrow
>> >>>>
>> >>>> Cheers,
>> >>>>
>> >>>> Gilles
>> >>>>
>> >>>> On 2014/09/17 4:01, Ralph Castain wrote:
>> >>>>> Hi Gilles
>> >>>>>
>> >>>>> I took a crack at solving this in r32744 - CMRd it for 1.8.3 and
>> assigned it to you for review. Give it a try and let me know if I
>> (hopefully) got it.
>> >>>>>
>> >>>>> The approach we have used in the past is to have both sides close
>> their connections, and then have the higher vpid retry while the lower one
>> waits. The logic for that was still in place, but it looks like you are
>> hitting a different code path, and I found another potential one as well.
>> So I think I plugged the holes, but will wait to hear if you confirm.
>> >>>>>
>> >>>>> Thanks
>> >>>>> Ralph
>> >>>>>
>> >>>>> On Sep 16, 2014, at 6:27 AM, Gilles Gouaillardet <
>> gilles.gouaillar...@gmail.com> wrote:
>> >>>>>
>> >>>>>> Ralph,
>> >>>>>>
>> >>>>>> here is the full description of a race condition in oob/tcp i very
>> briefly mentionned in a previous post :
>> >>>>>>
>> >>>>>> the race condition can occur when two not connected orted try to
>> send a message to each other for the first time and at the same time.
>> >>>>>>
>> >>>>>> that can occur when running mpi helloworld on 4 nodes with the
>> grpcomm/rcd module.
>> >>>>>>
>> >>>>>> here is a scenario in which the race condition occurs :
>> >>>>>>
>> >>>>>> orted vpid 2 and 3 enter the allgather
>> >>>>>> /* they are not orte yet oob/tcp connected*/
>> >>>>>> and they call orte.send_buffer_nb each other.
>> >>>>>> from a libevent point of view, vpid 2 and 3 will call
>> mca_oob_tcp_peer_try_connect
>> >>>>>>
>> >>>>>> vpid 2 calls mca_oob_tcp_send_handler
>> >>>>>>
>> >>>>>> vpid 3 calls connection_event_handler
>> >>>>>>
>> >>>>>> depending on the value returned by random() in libevent, vpid 3
>> will
>> >>>>>> either call mca_oob_tcp_send_handler (likely) or recv_handler
>> (unlikely)
>> >>>>>> if vpid 3 calls recv_handler, it will close the two sockets to
>> vpid 2
>> >>>>>>
>> >>>>>> then vpid 2 will call mca_oob_tcp_recv_handler
>> >>>>>> (peer->state is MCA_OOB_TCP_CONNECT_ACK)
>> >>>>>> that will invoke mca_oob_tcp_recv_connect_ack
>> >>>>>> tcp_peer_recv_blocking will fail
>> >>>>>> /* zero bytes are recv'ed since vpid 3 previously closed the
>> socket before writing a header */
>> >>>>>> and this is handled by mca_oob_tcp_recv_handler as a fatal error
>> >>>>>> /* ORTE_FORCED_TERMINATE(1) */
>> >>>>>>
>> >>>>>> could you please have a look at it ?
>> >>>>>>
>> >>>>>> if you are too busy, could you please advise where this scenario
>> should be handled differently ?
>> >>>>>> - should vpid 3 keep one socket instead of closing both and
>> retrying ?
>> >>>>>> - should vpid 2 handle the failure as a non fatal error ?
>> >>>>>>
>> >>>>>> Cheers,
>> >>>>>>
>> >>>>>> Gilles
>> >>>>>> _______________________________________________
>> >>>>>> devel mailing list
>> >>>>>> de...@open-mpi.org
>> >>>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> >>>>>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/09/15836.php
>> >>>>> _______________________________________________
>> >>>>> devel mailing list
>> >>>>> de...@open-mpi.org
>> >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> >>>>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/09/15844.php
>> >>>> _______________________________________________
>> >>>> devel mailing list
>> >>>> de...@open-mpi.org
>> >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> >>>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/09/15854.php
>> >>> _______________________________________________
>> >>> devel mailing list
>> >>> de...@open-mpi.org
>> >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> >>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/09/15855.php
>> >> <oob_tcp.patch>_______________________________________________
>> >> devel mailing list
>> >> de...@open-mpi.org
>> >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> >> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/09/15862.php
>> > _______________________________________________
>> > devel mailing list
>> > de...@open-mpi.org
>> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> > Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/09/15863.php
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/09/15880.php
>>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/09/15882.php
>
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/09/15883.php
>

Reply via email to