[Dmtcp-forum] Freezing after checkpoint is initiated - MPI application on two hosts

Eilfort, Moritz Emanuel Christoph Fri, 26 Aug 2016 03:23:44 -0700

Dear DMTCP-Team,

i am trying to find a way to use dmtcp to migrate after checkpointing.
Unfortunately I encountered the first problems with running DMTCP and
MPICH without any third-party plugin or changes of any kind.


The problem is as follows:
I start a dmtcp_coordinator on the localhost and then launch my mpi
application. The mpi application is just sending messages from one
process to another for a specified time. I use mpich-3.2 and mpirun
with four processes on two hosts. All runs as expected until a
checkpoint is initiated. As soon as a checkpoint is initiated dmtcp and
my mpi application are stuck. I have to kill all connected processes
manually. Ckpt images are not written to the specified directory. If I
print out the process list using the coordinator the processes are
sometimes listed as checkpointing and sometimes as suspended. If I do
not initiated a checkpoint the application runs until it is finished.

Often but not always dmtcp prints the following message upon getting
stuck:

[42000] WARNING at kernelbufferdrainer.cpp:125 in onTimeoutInterval;
REASON='JWARNING(false) failed'
     _dataSockets[i]->socket().sockfd() = 10
     buffer.size() = 129
     WARN_INTERVAL_SEC = 10
Message: Still draining socket... perhaps remote host is not running
under DMTCP?
[40000] WARNING at kernelbufferdrainer.cpp:125 in onTimeoutInterval;
REASON='JWARNING(false) failed'
     _dataSockets[i]->socket().sockfd() = 7
     buffer.size() = 129
     WARN_INTERVAL_SEC = 10
Message: Still draining socket... perhaps remote host is not running
under DMTCP?
[43000] WARNING at kernelbufferdrainer.cpp:125 in onTimeoutInterval;
REASON='JWARNING(false) failed'
     _dataSockets[i]->socket().sockfd() = 16
     buffer.size() = 177
     WARN_INTERVAL_SEC = 10
Message: Still draining socket... perhaps remote host is not running
under DMTCP?
[44000] WARNING at kernelbufferdrainer.cpp:125 in onTimeoutInterval;
REASON='JWARNING(false) failed'
     _dataSockets[i]->socket().sockfd() = 16
     buffer.size() = 177
     WARN_INTERVAL_SEC = 10
Message: Still draining socket... perhaps remote host is not running
under DMTCP?

A wired thing is that out of 20+ times trying to run it exactly as
described above, I actually managed to run a checkpoint on two or three
occasions before it crashed at the next initiated checkpoint.
I did not change anything and the end result stayed the same. 
Although I then had a checkpoint Image from which to try a restart. I
then encountered another problem. If I restart from the restart_script,
not all processes are restarted. The dmtcp_ssh and dmtcp_sshd processes
and the mpich process-manger processes hydra and mpiexec are not
restarted. If I use dmtcp_restart and specify all images the
application restarts without any problems, although it now only
restarts on a single host. If I try to checkpoint now the situation is
the same as above (it freezes). 

DMTCP runs smoothly on a single host. I can checkpoint, restart as
often as I want to. The restart_script still seems to be swallowing a
process. Initially six processes where connected to the coordinator,
after restart with the restart_script only five processes are connected
and after restart with dmtcp_restart six processes are connected to the
coordinator. 

I am working on a local cluster at my university. I use two nodes
connected via Ethernet. 

I would be very grateful if you could give me a hint as to how I can
solve these problems. 

Kind regards,
Moritz



        
        




        



------------------------------------------------------------------------------
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

[Dmtcp-forum] Freezing after checkpoint is initiated - MPI application on two hosts

Reply via email to