Dear DMTCP-Team, i am trying to find a way to use dmtcp to migrate after checkpointing. Unfortunately I encountered the first problems with running DMTCP and MPICH without any third-party plugin or changes of any kind.
The problem is as follows: I start a dmtcp_coordinator on the localhost and then launch my mpi application. The mpi application is just sending messages from one process to another for a specified time. I use mpich-3.2 and mpirun with four processes on two hosts. All runs as expected until a checkpoint is initiated. As soon as a checkpoint is initiated dmtcp and my mpi application are stuck. I have to kill all connected processes manually. Ckpt images are not written to the specified directory. If I print out the process list using the coordinator the processes are sometimes listed as checkpointing and sometimes as suspended. If I do not initiated a checkpoint the application runs until it is finished. Often but not always dmtcp prints the following message upon getting stuck: [42000] WARNING at kernelbufferdrainer.cpp:125 in onTimeoutInterval; REASON='JWARNING(false) failed' _dataSockets[i]->socket().sockfd() = 10 buffer.size() = 129 WARN_INTERVAL_SEC = 10 Message: Still draining socket... perhaps remote host is not running under DMTCP? [40000] WARNING at kernelbufferdrainer.cpp:125 in onTimeoutInterval; REASON='JWARNING(false) failed' _dataSockets[i]->socket().sockfd() = 7 buffer.size() = 129 WARN_INTERVAL_SEC = 10 Message: Still draining socket... perhaps remote host is not running under DMTCP? [43000] WARNING at kernelbufferdrainer.cpp:125 in onTimeoutInterval; REASON='JWARNING(false) failed' _dataSockets[i]->socket().sockfd() = 16 buffer.size() = 177 WARN_INTERVAL_SEC = 10 Message: Still draining socket... perhaps remote host is not running under DMTCP? [44000] WARNING at kernelbufferdrainer.cpp:125 in onTimeoutInterval; REASON='JWARNING(false) failed' _dataSockets[i]->socket().sockfd() = 16 buffer.size() = 177 WARN_INTERVAL_SEC = 10 Message: Still draining socket... perhaps remote host is not running under DMTCP? A wired thing is that out of 20+ times trying to run it exactly as described above, I actually managed to run a checkpoint on two or three occasions before it crashed at the next initiated checkpoint. I did not change anything and the end result stayed the same. Although I then had a checkpoint Image from which to try a restart. I then encountered another problem. If I restart from the restart_script, not all processes are restarted. The dmtcp_ssh and dmtcp_sshd processes and the mpich process-manger processes hydra and mpiexec are not restarted. If I use dmtcp_restart and specify all images the application restarts without any problems, although it now only restarts on a single host. If I try to checkpoint now the situation is the same as above (it freezes). DMTCP runs smoothly on a single host. I can checkpoint, restart as often as I want to. The restart_script still seems to be swallowing a process. Initially six processes where connected to the coordinator, after restart with the restart_script only five processes are connected and after restart with dmtcp_restart six processes are connected to the coordinator. I am working on a local cluster at my university. I use two nodes connected via Ethernet. I would be very grateful if you could give me a hint as to how I can solve these problems. Kind regards, Moritz ------------------------------------------------------------------------------ _______________________________________________ Dmtcp-forum mailing list Dmtcp-forum@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dmtcp-forum