Hello @ll.

I'm needing some help to restart the communication with a process that i
restore in a different node. My situation is as follows:

The process fails and it's restored in another node succesfully from a
previous checkpoint that i sent there. Now, when a process try to send a
message to this restored process it will fail, or at least, it will be
locked in *ompi_request_wait_completion. *
*
*
So, when this happens i have to send a message to the daemon of the sender
that will have the uri of where the process has been restored and answer to
the proc with this and it will update this info.

So, i need to know where in the code i can capture this attempt to send and
then send the message to his daemon and where and how i can update this info
to send the message to the right place (Same rank but new uri).

I have to do it in this way to avoid a collective communication.

If you give me a hand on this, it will be great.

Best regards.

Hugo

Reply via email to