Re: [OMPI devel] Retrying a MPI_SEND

George Bosilca Fri, 18 Nov 2011 12:33:42 -0500

On Nov 18, 2011, at 07:29 , Hugo Daniel Meyer wrote:

> Hello again.
> 
> I was doing some trace into de PML_OB1 files. I start to follow a MPI_Ssend() 
> trying to find where a message is stored (in the sender) if it is not send 
> until the receiver post the recv, but i didn't find that place.


Right, you can't find this as the message is not stored on the sender. The 
pointer to the send request is sent encapsulated in the matching header, and 
the receiver will provide it back once the message has been matched (this means 
the data is now ready to flow).

> I've noticed that the message to be sent enters in 
> mca_pml_ob1_rndv_completion_request(pml_ob1_sendreq.c) and  the rc = 
> send_request_pml_complete_check(sendreq) returns false when the request 
> hasn't been completed, but the execution never passes through 
> MCA_PML_OB1_PROGRESS_PENDING, at least, none of the possible options is 
> executed.
> 
> So, re-orienting my question: where is stored this message until delivery? 
> and if there any way to know that the receiver goes down? With this 
> information i will be able to detect the failure of the receiver and will try 
> to resend the message to another place.

If you want to track the send requests, you will have to implement your own way 
of tracking them, as we do not expose this in our PML. Eventually, writing your 
own PML, might be necessary.

However, as a user I would find very disturbing that the MPI runtime decide to 
send the message to another peer on my behalf. I would rather prefer that the 
MPI_Send returns some kind of error, that allows the upper level algorithm to 
repost the send to another peer. Look at the proposals in the MPI Forum to get 
more information about what it is discussed regarding the MPI resilience.

  george.

> 
> Thanks again.
> 
> Hugo Meyer
> 
> 2011/11/17 Hugo Daniel Meyer <[email protected]>
> Hello @ll.
> 
> I'm doing some changes in the communication framework. Right now i'm working 
> on a "secure" MPI_Send, this send needs to know when an endpoint goes down, 
> and then retry the communication constructing a new endpoint, or at least, 
> overwriting the data of the old endpoint with the new address of the receiver 
> process. Overwriting the data of the endpoint is not a problem anymore, 
> because i've done that before.
> 
> For example, if we consider a Master/Worker application, where the master 
> sends data to the workers, and  workers start the computation, then, the 
> master posts a send to the worker1 that fails and get restarted in another 
> node and in his new location the worker1 posts the recv to the master's send. 
> The problem here is that the master post the send when the process was 
> residing in one node, but the process expects the message in another node. I 
> need the sender to realize that the process is now in another node, and 
> retries the communication with a modificated endpoint. Anyone could please 
> tell me where in the send code i can obtain the status of a message that 
> hasn't been send and resend it to a new location. Also i want to know, where 
> can i obtain information about an endpoint fail?.
> 
> Thanks in advance.
> 
> Hugo
> 
> _______________________________________________
> devel mailing list
> [email protected]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] Retrying a MPI_SEND

Reply via email to