Hi Jiajun, I haven't shut down the connection to the outside world when threads suspend. I've built a plugin with the dmtcp_event_hook funtion, but I haven't put any application logic in dmtcp_event_hook function. I am using libevent[1] to connect my MPI processes to the outside world. That is, libevent is responsible for managing the TCP connection. Maybe it happened because libevent...
Anyway, things are working now. If I have more problems I will let you know. Thanks a lot again! [1] http://libevent.org/ Edson 2015-10-31 22:12 GMT+01:00 Jiajun Cao <jia...@ccs.neu.edu>: > Hi Edson, > > I suppose you wrote the plugin based on what I suggested earlier, i.e., > shutting down the connection to the outside world when threads suspend. > Your plugin library is preloaded by dmtcp, and the behavior at each event ( > DMTCP_EVENT_THREADS_SUSPEND, DMTCP_EVENT_WRITE_CKPT, etc.) will be called > sequentially. In your case, at the event DMTCP_EVENT_THREADS_SUSPEND, > your plugin will shut down the tcp connection, so that dmtcp doesn't need > to handle it at later events (since it's closed, or disconnected). > Otherwise, for a connected socket, dmtcp will try to drain data from both > sides of the connection, but the other side (the outside world) is not > controlled under dmtcp. > > In fact, many internal functionalities of dmtcp are implemented using > plugins. This makes the implementation more loosely coupled, and easier for > end users to write their own plugin. We're very happy to help you on this, > and we'd like to take any advices from you about the improvement of the > plugin API, or any difficulties understanding/writing your own plugin. > > > Best, > Jiajun > > On Sat, Oct 31, 2015 at 11:53 AM, Edson Tavares de Camargo < > etcamarg...@gmail.com> wrote: > >> Hi Everyone >> >> Actually my problem was solved after I've built a simple plugin and I've >> run it with dmtcp_lauch. The DMTCP has managed the TCP connection of my >> MPI application with my external application (that is inside my cluster) >> only if the plugin is run together the dmtcp_lauch. For example: >> >> - dmtcp_launch mpirun ... >> >> It doesn't work. The dmtcp doesn't managed to drain the buffers. >> >> - dmtcp_launch --with-plugin plugin.so mpirun ... >> >> It works fine! >> >> Could you explain me why it works with the plugin and doesn't work >> without the plugin? >> >> Thanks! >> >> Edson >> >> 2015-10-29 2:44 GMT+01:00 Jiajun Cao <jia...@ccs.neu.edu>: >> >>> Hi Edson, >>> >>> The best way to achieve this is to write a tiny plugin, and in >>> dmtcp_event_hook(), do the shutting down connection job at >>> DMTCP_EVENT_THREADS_SUSPEND. >>> Your application should expose an API to do this. Or, you can define a week >>> symbol, and use dlsym() and RTLD_NEXT to find the symbol in your app. >>> >>> Best, >>> Jiajun >>> >>> On Wed, Oct 28, 2015 at 2:12 PM, Edson Tavares de Camargo < >>> etcamarg...@gmail.com> wrote: >>> >>>> Hi, Jiajun >>>> >>>> > I guess what you want to do is to make sure the connection to the >>>> outside world is shut down before DMTCP handles the network connection. >>>> >>>> Yes, that is exactly what I would like to do. >>>> >>>> > If that is the case, you should use the event >>>> DMTCP_EVENT_THREADS_SUSPEND instead, and restore the connection at event >>>> DMTCP_EVENT_THREADS_RESUME. >>>> >>>> But my question now is how I can get the event >>>> DMTCP_EVENT_THREADS_SUSPEND inside my application? >>>> >>>> I have managed to get the event DMTCP_EVENT_THREADS_SUSPEND inside a >>>> DMTCP plugin, but I still can't manage to understand how to make the DMTCP >>>> plugin send a message to my application. Or even to make the DMTCP pluging >>>> call a function in my application (asking to shutdown the connection). >>>> >>>> It would be very nice if I could have the dmtcp_event_hook inside my >>>> application. It is possible to do that? There is a another way to tell to >>>> my application to shutdown the connection to the outside world at moment >>>> before the checkpoint? >>>> >>>> Thanks! >>>> >>>> Edson >>>> >>>> >>>> >>>> 2015-10-27 20:45 GMT+01:00 Jiajun Cao <jia...@ccs.neu.edu>: >>>> >>>>> Hi Edson, >>>>> >>>>> DMTCP_EVENT_WRITE_CKPT corresponds to the event right at the time of >>>>> writing the checkpoint images into storage. At this point, the processing >>>>> of network connection is already finished. I guess what you want to do is >>>>> to make sure the connection to the outside world is shut down before DMTCP >>>>> handles the network connection. If that is the case, you should use the >>>>> event DMTCP_EVENT_THREADS_SUSPEND instead, and restore the connection at >>>>> event DMTCP_EVENT_THREADS_RESUME. >>>>> >>>>> Best, >>>>> Jiajun >>>>> >>>>> On Tue, Oct 27, 2015 at 2:48 PM, Edson Tavares de Camargo < >>>>> etcamarg...@gmail.com> wrote: >>>>> >>>>>> Hi Jiajun, >>>>>> >>>>>> Thank you for your answer. >>>>>> >>>>>> I use a TCP connection between the MPI application and a logger >>>>>> entity. It is just to store some information about the MPI messages. >>>>>> >>>>>> I have realized that when the connection between the MPI application >>>>>> and the Logger finishes, the DMTCP is able to make the checkpoint. It >>>>>> would be possible to finish the connection at moment before the >>>>>> checkpoint >>>>>> and restore it when the checkpoint has finished? >>>>>> >>>>>> I am trying to use the dmtcp_event_hook, but the event >>>>>> DMTCP_EVENT_WRITE_CKPT seems to be called only after the following >>>>>> message >>>>>> warning: >>>>>> >>>>>> [42000] WARNING at kernelbufferdrainer.cpp:124 in onTimeoutInterval; >>>>>> REASON='JWARNING(false) failed' >>>>>> _dataSockets[i]->socket().sockfd() = 14 >>>>>> buffer.size() = 0 >>>>>> >>>>>> WARN_INTERVAL_SEC = 10 >>>>>> Message: Still draining socket... perhaps remote host is not running >>>>>> under DMTCP? >>>>>> >>>>>> There is a way to capture the event before that message warning? >>>>>> >>>>>> Thanks a lot!!! >>>>>> >>>>>> Edson >>>>>> On Oct 26, 2015 5:25 PM, "Jiajun Cao" <jia...@ccs.neu.edu> wrote: >>>>>> >>>>>>> Hi Edson, >>>>>>> >>>>>>> The error is what's expected. DMTCP considers the computation as a >>>>>>> whole, i.e., for all processes involved in a computation, they must run >>>>>>> under DMTCP. Technically, this is because DMTCP must handle the network >>>>>>> communication. At the time of a checkpoint, DMTCP needs to drain the >>>>>>> data >>>>>>> in the sockets so that there won't be any lost data in-flight. In your >>>>>>> case, the other side of the socket is not under the control of DMTCP. >>>>>>> >>>>>>> Also, if possible, could you tell us what kind of application are >>>>>>> you running? I haven't tested DMTCP on MPI applications communicating >>>>>>> with >>>>>>> the external world. This can be a good test suite for us. >>>>>>> >>>>>>> >>>>>>> Best, >>>>>>> Jiajun >>>>>>> >>>>>>> On Mon, Oct 26, 2015 at 6:46 AM, Edson Tavares de Camargo < >>>>>>> etcamarg...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi Everyone! >>>>>>>> >>>>>>>> >>>>>>>> I have a question: What is the expected behaviour of DMTCP when I >>>>>>>> use DMTCP on a MPI application that exchanges messages with another >>>>>>>> application that is not running on dmtcp_launch? >>>>>>>> >>>>>>>> I ask because I have an error when I execute a MPI application that >>>>>>>> exchanges message via TCP with another application. Both application >>>>>>>> are >>>>>>>> running on my cluster. But I only need to make the checkpoint the MPI >>>>>>>> application. The error is the following: >>>>>>>> >>>>>>>> ======== >>>>>>>> WARNING at kernelbufferdrainer.cpp:120 in onTimeoutInterval; >>>>>>>> REASON='JWARNING(false) failed' >>>>>>>> _dataSockets[i]->socket().sockfd() = 15 >>>>>>>> buffer.size() = 1059 >>>>>>>> WARN_INTERVAL_SEC = 10 >>>>>>>> Message: Still draining socket... perhaps remote host is not >>>>>>>> running under DMTCP? >>>>>>>> ======= >>>>>>>> >>>>>>>> Thanks! >>>>>>>> >>>>>>>> Edson >>>>>>>> ------- >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> ------------------------------------------------------------------------------ >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Dmtcp-forum mailing list >>>>>>>> Dmtcp-forum@lists.sourceforge.net >>>>>>>> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>>> ------------------------------------------------------------------------------ >>>>>> >>>>>> _______________________________________________ >>>>>> Dmtcp-forum mailing list >>>>>> Dmtcp-forum@lists.sourceforge.net >>>>>> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum >>>>>> >>>>>> >>>>> >>>> >>> >> >
------------------------------------------------------------------------------
_______________________________________________ Dmtcp-forum mailing list Dmtcp-forum@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dmtcp-forum