The patch looks fine to me - please go ahead and apply it. Thanks! On Sep 17, 2014, at 11:35 PM, Gilles Gouaillardet <gilles.gouaillar...@iferc.org> wrote:
> Ralph, > > yes and no ... > > mpi hello world with four nodes can be used to reproduce the issue, > > > you can increase the likelyhood of producing the race condition by hacking > ./opal/mca/event/libevent2021/libevent/poll.c > and replace > i = random() % nfds; > with > if (nfds < 2) { > i = 0; > } else { > i = nfds - 2; > } > > but since this is really a race condition, all i could do is show you > how to use a debugger in order to force it > > > here is what really happens : > - thanks to your patch, when vpid 2 cannot read the acknowledge, this is > no more a fatal error. > - that being said, the peer->recv_event is not removed from the libevent > - later, send_event will be added to the libevent > - and then peer->recv_event will be added to the libevent > /* this is clearly not supported, and the interesting behaviour is that > peer->send_event will be kicked out of libevent (!) */ > > The attached patch fixes this race condition, could you please review it ? > > Cheers, > > Gilles > > On 2014/09/17 22:17, Ralph Castain wrote: >> Do you have a reproducer you can share for testing this? I'm unable to get >> it to happen on my machine, but maybe you have a test code that triggers it >> so I can continue debugging >> >> Ralph >> >> On Sep 17, 2014, at 4:07 AM, Gilles Gouaillardet >> <gilles.gouaillar...@iferc.org> wrote: >> >>> Thanks Ralph, >>> >>> this is much better but there is still a bug : >>> with the very same scenario i described earlier, vpid 2 does not send >>> its message to vpid 3 once the connection has been established. >>> >>> i tried to debug it but i have been pretty unsuccessful so far .. >>> >>> vpid 2 calls tcp_peer_connected and execute the following snippet >>> >>> if (NULL != peer->send_msg && !peer->send_ev_active) { >>> opal_event_add(&peer->send_event, 0); >>> peer->send_ev_active = true; >>> } >>> >>> but when evmap_io_active is invoked later, the following part : >>> >>> TAILQ_FOREACH(ev, &ctx->events, ev_io_next) { >>> if (ev->ev_events & events) >>> event_active_nolock(ev, ev->ev_events & events, 1); >>> } >>> >>> finds only one ev (mca_oob_tcp_recv_handler and *no* >>> mca_oob_tcp_send_handler) >>> >>> i will resume my investigations tomorrow >>> >>> Cheers, >>> >>> Gilles >>> >>> On 2014/09/17 4:01, Ralph Castain wrote: >>>> Hi Gilles >>>> >>>> I took a crack at solving this in r32744 - CMRd it for 1.8.3 and assigned >>>> it to you for review. Give it a try and let me know if I (hopefully) got >>>> it. >>>> >>>> The approach we have used in the past is to have both sides close their >>>> connections, and then have the higher vpid retry while the lower one >>>> waits. The logic for that was still in place, but it looks like you are >>>> hitting a different code path, and I found another potential one as well. >>>> So I think I plugged the holes, but will wait to hear if you confirm. >>>> >>>> Thanks >>>> Ralph >>>> >>>> On Sep 16, 2014, at 6:27 AM, Gilles Gouaillardet >>>> <gilles.gouaillar...@gmail.com> wrote: >>>> >>>>> Ralph, >>>>> >>>>> here is the full description of a race condition in oob/tcp i very >>>>> briefly mentionned in a previous post : >>>>> >>>>> the race condition can occur when two not connected orted try to send a >>>>> message to each other for the first time and at the same time. >>>>> >>>>> that can occur when running mpi helloworld on 4 nodes with the >>>>> grpcomm/rcd module. >>>>> >>>>> here is a scenario in which the race condition occurs : >>>>> >>>>> orted vpid 2 and 3 enter the allgather >>>>> /* they are not orte yet oob/tcp connected*/ >>>>> and they call orte.send_buffer_nb each other. >>>>> from a libevent point of view, vpid 2 and 3 will call >>>>> mca_oob_tcp_peer_try_connect >>>>> >>>>> vpid 2 calls mca_oob_tcp_send_handler >>>>> >>>>> vpid 3 calls connection_event_handler >>>>> >>>>> depending on the value returned by random() in libevent, vpid 3 will >>>>> either call mca_oob_tcp_send_handler (likely) or recv_handler (unlikely) >>>>> if vpid 3 calls recv_handler, it will close the two sockets to vpid 2 >>>>> >>>>> then vpid 2 will call mca_oob_tcp_recv_handler >>>>> (peer->state is MCA_OOB_TCP_CONNECT_ACK) >>>>> that will invoke mca_oob_tcp_recv_connect_ack >>>>> tcp_peer_recv_blocking will fail >>>>> /* zero bytes are recv'ed since vpid 3 previously closed the socket >>>>> before writing a header */ >>>>> and this is handled by mca_oob_tcp_recv_handler as a fatal error >>>>> /* ORTE_FORCED_TERMINATE(1) */ >>>>> >>>>> could you please have a look at it ? >>>>> >>>>> if you are too busy, could you please advise where this scenario should >>>>> be handled differently ? >>>>> - should vpid 3 keep one socket instead of closing both and retrying ? >>>>> - should vpid 2 handle the failure as a non fatal error ? >>>>> >>>>> Cheers, >>>>> >>>>> Gilles >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> Link to this post: >>>>> http://www.open-mpi.org/community/lists/devel/2014/09/15836.php >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/devel/2014/09/15844.php >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> Link to this post: >>> http://www.open-mpi.org/community/lists/devel/2014/09/15854.php >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/09/15855.php > > <oob_tcp.patch>_______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/09/15862.php