Thanks Ralph,
this is much better but there is still a bug :
with the very same scenario i described earlier, vpid 2 does not send
its message to vpid 3 once the connection has been established.
i tried to debug it but i have been pretty unsuccessful so far ..
vpid 2 calls tcp_peer_connected and execute the following snippet
if (NULL != peer->send_msg && !peer->send_ev_active) {
opal_event_add(&peer->send_event, 0);
peer->send_ev_active = true;
}
but when evmap_io_active is invoked later, the following part :
TAILQ_FOREACH(ev, &ctx->events, ev_io_next) {
if (ev->ev_events & events)
event_active_nolock(ev, ev->ev_events & events, 1);
}
finds only one ev (mca_oob_tcp_recv_handler and *no*
mca_oob_tcp_send_handler)
i will resume my investigations tomorrow
Cheers,
Gilles
On 2014/09/17 4:01, Ralph Castain wrote:
> Hi Gilles
>
> I took a crack at solving this in r32744 - CMRd it for 1.8.3 and assigned it
> to you for review. Give it a try and let me know if I (hopefully) got it.
>
> The approach we have used in the past is to have both sides close their
> connections, and then have the higher vpid retry while the lower one waits.
> The logic for that was still in place, but it looks like you are hitting a
> different code path, and I found another potential one as well. So I think I
> plugged the holes, but will wait to hear if you confirm.
>
> Thanks
> Ralph
>
> On Sep 16, 2014, at 6:27 AM, Gilles Gouaillardet
> <[email protected]> wrote:
>
>> Ralph,
>>
>> here is the full description of a race condition in oob/tcp i very briefly
>> mentionned in a previous post :
>>
>> the race condition can occur when two not connected orted try to send a
>> message to each other for the first time and at the same time.
>>
>> that can occur when running mpi helloworld on 4 nodes with the grpcomm/rcd
>> module.
>>
>> here is a scenario in which the race condition occurs :
>>
>> orted vpid 2 and 3 enter the allgather
>> /* they are not orte yet oob/tcp connected*/
>> and they call orte.send_buffer_nb each other.
>> from a libevent point of view, vpid 2 and 3 will call
>> mca_oob_tcp_peer_try_connect
>>
>> vpid 2 calls mca_oob_tcp_send_handler
>>
>> vpid 3 calls connection_event_handler
>>
>> depending on the value returned by random() in libevent, vpid 3 will
>> either call mca_oob_tcp_send_handler (likely) or recv_handler (unlikely)
>> if vpid 3 calls recv_handler, it will close the two sockets to vpid 2
>>
>> then vpid 2 will call mca_oob_tcp_recv_handler
>> (peer->state is MCA_OOB_TCP_CONNECT_ACK)
>> that will invoke mca_oob_tcp_recv_connect_ack
>> tcp_peer_recv_blocking will fail
>> /* zero bytes are recv'ed since vpid 3 previously closed the socket before
>> writing a header */
>> and this is handled by mca_oob_tcp_recv_handler as a fatal error
>> /* ORTE_FORCED_TERMINATE(1) */
>>
>> could you please have a look at it ?
>>
>> if you are too busy, could you please advise where this scenario should be
>> handled differently ?
>> - should vpid 3 keep one socket instead of closing both and retrying ?
>> - should vpid 2 handle the failure as a non fatal error ?
>>
>> Cheers,
>>
>> Gilles
>> _______________________________________________
>> devel mailing list
>> [email protected]
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2014/09/15836.php
> _______________________________________________
> devel mailing list
> [email protected]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/09/15844.php