Hi Husen, I'll start with some basic questions about your setup.
- Could you share with us your launch scripts for MPICH and OpenMPI? - What DMTCP version are you using? - Do you have InfiniBand on your setup? If yes, then you'd need to configure DMTCP with the IB support (`./configure --enable-infiniband-support`), and use the `--ib` flag with dmtcp_launch. Next, you wrote: > I have tried to use --rm in mpich-3.2, and it doesn't work. mpich-3.2 > doesn't recognize --rm option. The `--rm` flag is a `dmtcp_launch` option, it's not an MPICH option. You seem to be seeing two kinds of warnings: a) "Still draining socket... perhaps remote host is not running under DMTCP"; and b) "Datagram Sockets not supported. Hopefully, this is a short lived connection". The first one indicates that there are sockets in your process going out to entities not running under DMTCP. I think this could be specific to your SLURM/MPI setup. The second warning could imply many different things. I haven't seen MPI's using datagram sockets usually. Datagram sockets are not supported in DMTCP out-of-the-box. Is your application doing that? Are you trying to checkpoint a GUI-based application? In either case, the warnings are not fatal, or at least, not immediately fatal. However, the warnings could lead to other issues that arise at restart time. Moving forward ... I think the first thing you need to do is to verify if the checkpoint was "successful". If the checkpoint was "successful", you should see checkpoint images corresponding to each MPI rank, i.e., there should be one checkpoint image (a *.dmtcp file) per MPI process. Do you see that? Do you see a restart script? Next step would be to verify the restart part. The restart script is a little tricky and might need some modifications depending on your setup. In other words, don't rely on it to work out-of-the-box. You could try to restart the computation manually to isolate the issue. Here's how I would do it: - Allocate N interactive nodes. N could be 1 or more; it's easier to debug with 1 node, assuming you have enough RAM on the node. - Start dmtcp_coordinator: you could start it on the head node or one of the allocated compute nodes - ssh to allocated node, and manually execute the restart command: dmtcp_restart -h <coord-host> -p <coord-port> ckpt_image1.dmtcp dmtcp_restart -h <coord-host> -p <coord-port> ckpt_image2.dmtcp ... The only thing you need to ensure when doing this manually is that the MPI ranks that were sharing a node prior to checkpointing are restarted on one node. This is because the MPI processes might be using (SysV) shared-memory for intra-node communication. On restart, DMTCP will try to restore the shared-memory region and fail if the processes are not restarted on one node. Finally, I think what you are seeing is because of some configuration issue. We have tested with different MPI's recently and it works. I could be wrong though. Would it be possible for you to give us a guest account for debugging on your setup? It'll be the most efficient way of resolving this. -Rohan On Fri, May 20, 2016 at 06:32:09PM +0700, Husen R wrote: > Hi Gene, > > Thank you for your reply! > > I have tried to use --rm in mpich-3.2, and it doesn't work. mpich-3.2 > doesn't recognize --rm option. > I don't know exactly, what's the difference between mpich-3.2 and mpich2 ? > > recently I tried to use openmpi-1.6 to checkpoint mpi application using > dmtcp and slurm. > but I got the following error : > > [40000] WARNING at socketconnection.cpp:187 in TcpConnection; > REASON='JWARNING(false) failed' > type = 2 > Message: Datagram Sockets not supported. Hopefully, this is a short lived > connection! > [46000] WARNING at socketconnection.cpp:187 in TcpConnection; > REASON='JWARNING(false) failed' > type = 2 > Message: Datagram Sockets not supported. Hopefully, this is a short lived > connection! > [50000] WARNING at socketconnection.cpp:187 in TcpConnection; > REASON='JWARNING(false) failed' > type = 2 > Message: Datagram Sockets not supported. Hopefully, this is a short lived > connection! > [45000] WARNING at socketconnection.cpp:187 in TcpConnection; > REASON='JWARNING(false) failed' > type = 2 > Message: Datagram Sockets not supported. Hopefully, this is a short lived > connection! > [48000] WARNING at socketconnection.cpp:187 in TcpConnection; > REASON='JWARNING(false) failed' > type = 2 > Message: Datagram Sockets not supported. Hopefully, this is a short lived > connection! > ... > ... > ... > [41000] ERROR at dmtcpmessagetypes.cpp:65 in assertValid; > REASON='JASSERT(strcmp ( DMTCP_MAGIC_STRING,_magicBits ) == 0) failed' > _magicBits = > Message: read invalid message, _magicBits mismatch. Did DMTCP coordinator > die uncleanly? > dmtcp_srun_helper (41000): Terminating... > > > > in addition, the slurm_restart.job seems not working at all. > I need help. > Thank you in advance, > > > Regards, > > Husen > > On Fri, May 20, 2016 at 5:36 PM, Gene Cooperman <g...@ccs.neu.edu> wrote: > > > Hi William and Husen, > > As far as I know, the combination "--rm --ib" should work with > > the major MPI implementations: Open MPI, MVAPICH2, Intel MPI, MPICH2. > > But I'm not sure which ones we've tested with very recently. > > I'm pretty sure that we've used MVAPICH2 and Open MPI in this way. > > > > Jiajun and Rohan, > > Could you confirm which implementations you've used _with the > > "--rm --ib" combination_? If it's not working with one of the > > major MPI implementations, we need to fix that. > > > > Thanks, > > - Gene > > > > On Thu, May 19, 2016 at 03:42:06PM -0700, William Fox wrote: > > > At least for me ( I am not a developer for dmtcp) I was forced to switch > > to > > > openmpi (version1.6 specifically) in order to get --rm to work correctly. > > > What version of mpi are you running? In addition, if you are using > > > infiniband, --ib will need to be installed and utilized in order to > > > accomplish a restart. > > > > > > On Wed, May 18, 2016 at 1:15 AM, Husen R <hus...@gmail.com> wrote: > > > > > > > dear all, > > > > > > > > I have tried to checkpoint mpi application using dmtcp but I failed > > with > > > > the error message as follows : > > > > > > > > > > > > [40000] WARNING at kernelbufferdrainer.cpp:124 in onTimeoutInterval; > > > > REASON='JWARNING(false) failed' > > > > _dataSockets[i]->socket().sockfd() = 9 > > > > buffer.size() = 0 > > > > WARN_INTERVAL_SEC = 10 > > > > Message: Still draining socket... perhaps remote host is not running > > under > > > > DMTCP? > > > > [40000] WARNING at kernelbufferdrainer.cpp:124 in onTimeoutInterval; > > > > REASON='JWARNING(false) failed' > > > > _dataSockets[i]->socket().sockfd() = 7 > > > > buffer.size() = 0 > > > > WARN_INTERVAL_SEC = 10 > > > > Message: Still draining socket... perhaps remote host is not running > > under > > > > DMTCP? > > > > ...... > > > > ...... > > > > ...... > > > > > > > > I use this sbatch script to submit job : > > > > > > > > #####################################SBATCH########################### > > > > #!/bin/bash > > > > # Put your SLURM options here > > > > #SBATCH --partition=comeon > > > > #SBATCH --time=01:15:00 > > > > #SBATCH --nodes=2 > > > > #SBATCH --ntasks-per-node=4 > > > > #SBATCH --job-name="dmtcp_job" > > > > #SBATCH --output=dmtcp_ckpt_img/dmtcp-%j.out > > > > > > > > start_coordinator() > > > > { > > > > > > > > fname=dmtcp_command.$SLURM_JOBID > > > > h=$(hostname) > > > > check_coordinator=$(which dmtcp_coordinator) > > > > > > > > if [ -z "$check_coordinator" ]; then > > > > echo "No dmtcp_coordinator found. Check your DMTCP installation > > > > and PATH settings." > > > > exit 0 > > > > fi > > > > > > > > dmtcp_coordinator --daemon --exit-on-last -p 0 --port-file $fname > > $@ > > > > 1>/dev/null 2>&1 > > > > > > > > p=`cat $fname` > > > > chmod +x $fname > > > > echo "#!/bin/bash" > $fname > > > > echo >> $fname > > > > echo "export PATH=$PATH" >> $fname > > > > echo "export DMTCP_COORD_HOST=$h" >> $fname > > > > echo "export DMTCP_COORD_PORT=$p" >> $fname > > > > echo "dmtcp_command \$@" >> $fname > > > > > > > > # Set up local environment for DMTCP > > > > export DMTCP_COORD_HOST=$h > > > > export DMTCP_COORD_PORT=$p > > > > } > > > > > > > > cd $SLURM_SUBMIT_DIR > > > > start_coordinator -i 240 > > > > dmtcp_launch -h $h -p $p mpiexec ./mm.o > > > > > > > > > > ######################################################################### > > > > > > > > I also have tried using --rm option in dmtcp_launch but it doesn't work > > > > and no output at all. > > > > > > > > anybody tell me how to solve this please ? I need help > > > > > > > > > > > > Regards, > > > > > > > > > > > > > > > > Husen > > > > > > > > > > > > > > ------------------------------------------------------------------------------ > > > > Mobile security can be enabling, not merely restricting. Employees who > > > > bring their own devices (BYOD) to work are irked by the imposition of > > MDM > > > > restrictions. Mobile Device Manager Plus allows you to control only the > > > > apps on BYO-devices by containerizing them, leaving personal data > > > > untouched! > > > > https://ad.doubleclick.net/ddm/clk/304595813;131938128;j > > > > _______________________________________________ > > > > Dmtcp-forum mailing list > > > > Dmtcp-forum@lists.sourceforge.net > > > > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum > > > > > > > > > > > > > > > > > -- > > > William Fox > > > > > > Lawrence Berkeley National Laboratory > > > Computational Research Division > > > > > > > ------------------------------------------------------------------------------ > > > Mobile security can be enabling, not merely restricting. Employees who > > > bring their own devices (BYOD) to work are irked by the imposition of MDM > > > restrictions. Mobile Device Manager Plus allows you to control only the > > > apps on BYO-devices by containerizing them, leaving personal data > > untouched! > > > https://ad.doubleclick.net/ddm/clk/304595813;131938128;j > > > > > _______________________________________________ > > > Dmtcp-forum mailing list > > > Dmtcp-forum@lists.sourceforge.net > > > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum > > > > > ------------------------------------------------------------------------------ > Mobile security can be enabling, not merely restricting. Employees who > bring their own devices (BYOD) to work are irked by the imposition of MDM > restrictions. Mobile Device Manager Plus allows you to control only the > apps on BYO-devices by containerizing them, leaving personal data untouched! > https://ad.doubleclick.net/ddm/clk/304595813;131938128;j > _______________________________________________ > Dmtcp-forum mailing list > Dmtcp-forum@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum ------------------------------------------------------------------------------ Mobile security can be enabling, not merely restricting. Employees who bring their own devices (BYOD) to work are irked by the imposition of MDM restrictions. Mobile Device Manager Plus allows you to control only the apps on BYO-devices by containerizing them, leaving personal data untouched! https://ad.doubleclick.net/ddm/clk/304595813;131938128;j _______________________________________________ Dmtcp-forum mailing list Dmtcp-forum@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dmtcp-forum