Hi Jiajun, Thank you for your answer.
I use a TCP connection between the MPI application and a logger entity. It is just to store some information about the MPI messages. I have realized that when the connection between the MPI application and the Logger finishes, the DMTCP is able to make the checkpoint. It would be possible to finish the connection at moment before the checkpoint and restore it when the checkpoint has finished? I am trying to use the dmtcp_event_hook, but the event DMTCP_EVENT_WRITE_CKPT seems to be called only after the following message warning: [42000] WARNING at kernelbufferdrainer.cpp:124 in onTimeoutInterval; REASON='JWARNING(false) failed' _dataSockets[i]->socket().sockfd() = 14 buffer.size() = 0 WARN_INTERVAL_SEC = 10 Message: Still draining socket... perhaps remote host is not running under DMTCP? There is a way to capture the event before that message warning? Thanks a lot!!! Edson On Oct 26, 2015 5:25 PM, "Jiajun Cao" <jia...@ccs.neu.edu> wrote: > Hi Edson, > > The error is what's expected. DMTCP considers the computation as a whole, > i.e., for all processes involved in a computation, they must run under > DMTCP. Technically, this is because DMTCP must handle the network > communication. At the time of a checkpoint, DMTCP needs to drain the data > in the sockets so that there won't be any lost data in-flight. In your > case, the other side of the socket is not under the control of DMTCP. > > Also, if possible, could you tell us what kind of application are you > running? I haven't tested DMTCP on MPI applications communicating with the > external world. This can be a good test suite for us. > > > Best, > Jiajun > > On Mon, Oct 26, 2015 at 6:46 AM, Edson Tavares de Camargo < > etcamarg...@gmail.com> wrote: > >> Hi Everyone! >> >> >> I have a question: What is the expected behaviour of DMTCP when I use >> DMTCP on a MPI application that exchanges messages with another application >> that is not running on dmtcp_launch? >> >> I ask because I have an error when I execute a MPI application that >> exchanges message via TCP with another application. Both application are >> running on my cluster. But I only need to make the checkpoint the MPI >> application. The error is the following: >> >> ======== >> WARNING at kernelbufferdrainer.cpp:120 in onTimeoutInterval; >> REASON='JWARNING(false) failed' >> _dataSockets[i]->socket().sockfd() = 15 >> buffer.size() = 1059 >> WARN_INTERVAL_SEC = 10 >> Message: Still draining socket... perhaps remote host is not running >> under DMTCP? >> ======= >> >> Thanks! >> >> Edson >> ------- >> >> >> >> >> ------------------------------------------------------------------------------ >> >> _______________________________________________ >> Dmtcp-forum mailing list >> Dmtcp-forum@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum >> >> >
------------------------------------------------------------------------------
_______________________________________________ Dmtcp-forum mailing list Dmtcp-forum@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dmtcp-forum