Hi Gene, Rohan, Thank you for your reply and step by step guidance !
Sorry, I made incorrect statement about --rm option. what I mean is --mca option. it seems that it only available in openmpi. Followings are the things about my setup: 1. Attached with this email 6 files (dmtcp_MPICH_launch.job, dmtcp_MPICH_result.out, dmtcp_openmpi_launch.job, dmtcp_openmpi_result.out, mm.c and slurm.conf) 2. I use dmtcp-2.4.4 3. I use ethernet as my cluster interconnect. 4. I have tried using MVAPICH2-2.2b, MPICH-3.2 and OpenMPI-1.10.2. Now, I use OpenMPI-1.6. 5. I configure openmpi-1.6 with the following command: ./configure --with-slurm --with-ft=cr --with-blcr=/usr/local -- enable-orterun-prefix-by-default I don't know exactly what --enable-orterun-prefix-by-default is, but if I don't specify it, mpirun command is doesn't work. 6. I just try to checkpoint a simple MPI matrix multiplication. I attached the code named mm.c as I mentioned in point 1. I don't know whether MPI matrix multiplication doing something with datagram socket or not. I use SLURM to submit job. as far as I know Slurm uses MUNGE as its authentication program. 7. Yes, There are some checkpoint files created. Yes, I see the restart script. Followings are the details : - dmtcp_command.[JOBID] - 8 ckpt_mm.o_*.dmtcp files ( I use 8 processes) - dmtcp_restart_script.sh - dmtcp_restart_script_[combination_of_numbers_and_letters].sh - ckpt_orterun_[combination_of_numbers_and_letters].dmtcp - ckpt_orted_[combination_of_numbers_and_letters].dmtcp - ckpt_dmtcp_srun_helper_[combination_of_numbers_and_letters].dmtcp - a directory named ckpt_orted_[combination_of_numbers_and_letters ]_files - a directory named ckpt_orterun_[combination_of_numbers_and_letters ]_files is the checkpoint was successful ? if it is, what about the warning and the error message in the output file? 8. So far, I just have a few restart experience. I'll try to restart manually using the following command (as you suggested ). dmtcp_restart -h <coord-host> -p <coord-port> ckpt_image1.dmtcp dmtcp_restart -h <coord-host> -p <coord-port> ckpt_image2.dmtcp ... 9. about a guest account, I have to consult to the administrator. 10. I use Ubuntu 14.04 LTS That's all the informations about my setup. if there is other information I have to provide, just let me know. Thank you in advance. Regards, Husen On Fri, May 20, 2016 at 7:40 PM, Rohan Garg <rohg...@ccs.neu.edu> wrote: > Hi Husen, > > I'll start with some basic questions about your setup. > > - Could you share with us your launch scripts for MPICH and OpenMPI? > - What DMTCP version are you using? > - Do you have InfiniBand on your setup? If yes, then you'd need to > configure DMTCP with the IB support (`./configure > --enable-infiniband-support`), > and use the `--ib` flag with dmtcp_launch. > > Next, you wrote: > > > I have tried to use --rm in mpich-3.2, and it doesn't work. mpich-3.2 > > doesn't recognize --rm option. > > The `--rm` flag is a `dmtcp_launch` option, it's not an MPICH option. > > You seem to be seeing two kinds of warnings: > > a) "Still draining socket... perhaps remote host is not running under > DMTCP"; and > b) "Datagram Sockets not supported. Hopefully, this is a short lived > connection". > > The first one indicates that there are sockets in your process going > out to entities not running under DMTCP. I think this could be specific > to your SLURM/MPI setup. > > The second warning could imply many different things. I haven't > seen MPI's using datagram sockets usually. Datagram sockets are not > supported in DMTCP out-of-the-box. Is your application doing that? > Are you trying to checkpoint a GUI-based application? > > In either case, the warnings are not fatal, or at least, not > immediately fatal. However, the warnings could lead to other issues > that arise at restart time. > > Moving forward ... > > I think the first thing you need to do is to verify if the checkpoint > was "successful". > > If the checkpoint was "successful", you should see checkpoint images > corresponding to each MPI rank, i.e., there should be one checkpoint > image (a *.dmtcp file) per MPI process. Do you see that? Do you see a > restart script? > > Next step would be to verify the restart part. > > The restart script is a little tricky and might need some modifications > depending on your setup. In other words, don't rely on it to work > out-of-the-box. You could try to restart the computation manually > to isolate the issue. Here's how I would do it: > > - Allocate N interactive nodes. N could be 1 or more; it's easier to > debug with 1 node, assuming you have enough RAM on the node. > - Start dmtcp_coordinator: you could start it on the head node or one > of the allocated compute nodes > - ssh to allocated node, and manually execute the restart command: > > dmtcp_restart -h <coord-host> -p <coord-port> ckpt_image1.dmtcp > dmtcp_restart -h <coord-host> -p <coord-port> ckpt_image2.dmtcp > ... > > The only thing you need to ensure when doing this manually is that > the MPI ranks that were sharing a node prior to checkpointing are > restarted on one node. This is because the MPI processes might be > using (SysV) shared-memory for intra-node communication. On restart, > DMTCP will try to restore the shared-memory region and fail if the > processes are not restarted on one node. > > Finally, I think what you are seeing is because of some configuration > issue. We have tested with different MPI's recently and it works. I > could be wrong though. > > Would it be possible for you to give us a guest account for debugging > on your setup? It'll be the most efficient way of resolving this. > > -Rohan > > On Fri, May 20, 2016 at 06:32:09PM +0700, Husen R wrote: > > Hi Gene, > > > > Thank you for your reply! > > > > I have tried to use --rm in mpich-3.2, and it doesn't work. mpich-3.2 > > doesn't recognize --rm option. > > I don't know exactly, what's the difference between mpich-3.2 and mpich2 > ? > > > > recently I tried to use openmpi-1.6 to checkpoint mpi application using > > dmtcp and slurm. > > but I got the following error : > > > > [40000] WARNING at socketconnection.cpp:187 in TcpConnection; > > REASON='JWARNING(false) failed' > > type = 2 > > Message: Datagram Sockets not supported. Hopefully, this is a short lived > > connection! > > [46000] WARNING at socketconnection.cpp:187 in TcpConnection; > > REASON='JWARNING(false) failed' > > type = 2 > > Message: Datagram Sockets not supported. Hopefully, this is a short lived > > connection! > > [50000] WARNING at socketconnection.cpp:187 in TcpConnection; > > REASON='JWARNING(false) failed' > > type = 2 > > Message: Datagram Sockets not supported. Hopefully, this is a short lived > > connection! > > [45000] WARNING at socketconnection.cpp:187 in TcpConnection; > > REASON='JWARNING(false) failed' > > type = 2 > > Message: Datagram Sockets not supported. Hopefully, this is a short lived > > connection! > > [48000] WARNING at socketconnection.cpp:187 in TcpConnection; > > REASON='JWARNING(false) failed' > > type = 2 > > Message: Datagram Sockets not supported. Hopefully, this is a short lived > > connection! > > ... > > ... > > ... > > [41000] ERROR at dmtcpmessagetypes.cpp:65 in assertValid; > > REASON='JASSERT(strcmp ( DMTCP_MAGIC_STRING,_magicBits ) == 0) failed' > > _magicBits = > > Message: read invalid message, _magicBits mismatch. Did DMTCP > coordinator > > die uncleanly? > > dmtcp_srun_helper (41000): Terminating... > > > > > > > > in addition, the slurm_restart.job seems not working at all. > > I need help. > > Thank you in advance, > > > > > > Regards, > > > > Husen > > > > On Fri, May 20, 2016 at 5:36 PM, Gene Cooperman <g...@ccs.neu.edu> > wrote: > > > > > Hi William and Husen, > > > As far as I know, the combination "--rm --ib" should work with > > > the major MPI implementations: Open MPI, MVAPICH2, Intel MPI, MPICH2. > > > But I'm not sure which ones we've tested with very recently. > > > I'm pretty sure that we've used MVAPICH2 and Open MPI in this way. > > > > > > Jiajun and Rohan, > > > Could you confirm which implementations you've used _with the > > > "--rm --ib" combination_? If it's not working with one of the > > > major MPI implementations, we need to fix that. > > > > > > Thanks, > > > - Gene > > > > > > On Thu, May 19, 2016 at 03:42:06PM -0700, William Fox wrote: > > > > At least for me ( I am not a developer for dmtcp) I was forced to > switch > > > to > > > > openmpi (version1.6 specifically) in order to get --rm to work > correctly. > > > > What version of mpi are you running? In addition, if you are using > > > > infiniband, --ib will need to be installed and utilized in order to > > > > accomplish a restart. > > > > > > > > On Wed, May 18, 2016 at 1:15 AM, Husen R <hus...@gmail.com> wrote: > > > > > > > > > dear all, > > > > > > > > > > I have tried to checkpoint mpi application using dmtcp but I failed > > > with > > > > > the error message as follows : > > > > > > > > > > > > > > > [40000] WARNING at kernelbufferdrainer.cpp:124 in > onTimeoutInterval; > > > > > REASON='JWARNING(false) failed' > > > > > _dataSockets[i]->socket().sockfd() = 9 > > > > > buffer.size() = 0 > > > > > WARN_INTERVAL_SEC = 10 > > > > > Message: Still draining socket... perhaps remote host is not > running > > > under > > > > > DMTCP? > > > > > [40000] WARNING at kernelbufferdrainer.cpp:124 in > onTimeoutInterval; > > > > > REASON='JWARNING(false) failed' > > > > > _dataSockets[i]->socket().sockfd() = 7 > > > > > buffer.size() = 0 > > > > > WARN_INTERVAL_SEC = 10 > > > > > Message: Still draining socket... perhaps remote host is not > running > > > under > > > > > DMTCP? > > > > > ...... > > > > > ...... > > > > > ...... > > > > > > > > > > I use this sbatch script to submit job : > > > > > > > > > > > #####################################SBATCH########################### > > > > > #!/bin/bash > > > > > # Put your SLURM options here > > > > > #SBATCH --partition=comeon > > > > > #SBATCH --time=01:15:00 > > > > > #SBATCH --nodes=2 > > > > > #SBATCH --ntasks-per-node=4 > > > > > #SBATCH --job-name="dmtcp_job" > > > > > #SBATCH --output=dmtcp_ckpt_img/dmtcp-%j.out > > > > > > > > > > start_coordinator() > > > > > { > > > > > > > > > > fname=dmtcp_command.$SLURM_JOBID > > > > > h=$(hostname) > > > > > check_coordinator=$(which dmtcp_coordinator) > > > > > > > > > > if [ -z "$check_coordinator" ]; then > > > > > echo "No dmtcp_coordinator found. Check your DMTCP > installation > > > > > and PATH settings." > > > > > exit 0 > > > > > fi > > > > > > > > > > dmtcp_coordinator --daemon --exit-on-last -p 0 --port-file > $fname > > > $@ > > > > > 1>/dev/null 2>&1 > > > > > > > > > > p=`cat $fname` > > > > > chmod +x $fname > > > > > echo "#!/bin/bash" > $fname > > > > > echo >> $fname > > > > > echo "export PATH=$PATH" >> $fname > > > > > echo "export DMTCP_COORD_HOST=$h" >> $fname > > > > > echo "export DMTCP_COORD_PORT=$p" >> $fname > > > > > echo "dmtcp_command \$@" >> $fname > > > > > > > > > > # Set up local environment for DMTCP > > > > > export DMTCP_COORD_HOST=$h > > > > > export DMTCP_COORD_PORT=$p > > > > > } > > > > > > > > > > cd $SLURM_SUBMIT_DIR > > > > > start_coordinator -i 240 > > > > > dmtcp_launch -h $h -p $p mpiexec ./mm.o > > > > > > > > > > > > > > ######################################################################### > > > > > > > > > > I also have tried using --rm option in dmtcp_launch but it doesn't > work > > > > > and no output at all. > > > > > > > > > > anybody tell me how to solve this please ? I need help > > > > > > > > > > > > > > > Regards, > > > > > > > > > > > > > > > > > > > > Husen > > > > > > > > > > > > > > > > > > > ------------------------------------------------------------------------------ > > > > > Mobile security can be enabling, not merely restricting. Employees > who > > > > > bring their own devices (BYOD) to work are irked by the imposition > of > > > MDM > > > > > restrictions. Mobile Device Manager Plus allows you to control > only the > > > > > apps on BYO-devices by containerizing them, leaving personal data > > > > > untouched! > > > > > https://ad.doubleclick.net/ddm/clk/304595813;131938128;j > > > > > _______________________________________________ > > > > > Dmtcp-forum mailing list > > > > > Dmtcp-forum@lists.sourceforge.net > > > > > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum > > > > > > > > > > > > > > > > > > > > > > -- > > > > William Fox > > > > > > > > Lawrence Berkeley National Laboratory > > > > Computational Research Division > > > > > > > > > > > ------------------------------------------------------------------------------ > > > > Mobile security can be enabling, not merely restricting. Employees > who > > > > bring their own devices (BYOD) to work are irked by the imposition > of MDM > > > > restrictions. Mobile Device Manager Plus allows you to control only > the > > > > apps on BYO-devices by containerizing them, leaving personal data > > > untouched! > > > > https://ad.doubleclick.net/ddm/clk/304595813;131938128;j > > > > > > > _______________________________________________ > > > > Dmtcp-forum mailing list > > > > Dmtcp-forum@lists.sourceforge.net > > > > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum > > > > > > > > > > ------------------------------------------------------------------------------ > > Mobile security can be enabling, not merely restricting. Employees who > > bring their own devices (BYOD) to work are irked by the imposition of MDM > > restrictions. Mobile Device Manager Plus allows you to control only the > > apps on BYO-devices by containerizing them, leaving personal data > untouched! > > https://ad.doubleclick.net/ddm/clk/304595813;131938128;j > > > _______________________________________________ > > Dmtcp-forum mailing list > > Dmtcp-forum@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum > >
dmtcp_MPICH_launch.job
Description: Binary data
dmtcp_MPICH_result.out
Description: Binary data
dmtcp_openmpi_launch.job
Description: Binary data
dmtcp_openmpi_result.out
Description: Binary data
/* * File: mpi_mm.c * Author: node1 * * Created on March 15, 2015, 12:15 PM */ #include <stdio.h> #include <stdlib.h> #include "mpi.h" #define NRA 6000 #define NCA 6000 #define NCB 6000 #define MASTER 0 #define FROM_MASTER 1 #define FROM_WORKER 2 double a[NRA][NCA], b[NCA][NCB], c[NRA][NCB]; int main (int argc, char *argv[]) { double start, time; int numtasks, taskid, numworkers, source, dest, mtype, rows, averow, extra, offset,name_len, i, j, k, rc; MPI_Status status; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &taskid); MPI_Comm_size(MPI_COMM_WORLD, &numtasks); if(numtasks < 2) { printf("Need at least two MPI tasks. Quitting...\n"); MPI_Abort(MPI_COMM_WORLD, rc); exit(1); } numworkers = numtasks-1; start = MPI_Wtime(); /*char processor_name[MPI_MAX_PROCESSOR_NAME];*/ /**************************** master task ************************************/ if(taskid == MASTER) { printf("Matrix multiplication (%d x %d) has started with %d task.\n", NRA,NCB,numtasks); printf("Initializing arrays...\n"); for(i=0;i<NRA;i++) for(j=0;j<NCA;j++) a[i][j] = i+j; for(i=0;i<NCA;i++) for(j=0;j<NCB;j++) b[i][j] = i*j; /* Send matrix data to the worker tasks */ averow = NRA/numworkers; extra = NRA%numworkers; offset = 0; mtype = FROM_MASTER; char processor_name[MPI_MAX_PROCESSOR_NAME]; for(dest = 1; dest<=numworkers; dest++) { rows = (dest <=extra) ? averow+1 : averow; printf("Sending %d rows to task %d offset=%d\n", rows, dest, offset); MPI_Send(&offset, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD); MPI_Send(&rows, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD); MPI_Send(&a[offset][0], rows*NCA, MPI_DOUBLE, dest, mtype, MPI_COMM_WORLD); MPI_Send(&b, NCA*NCB, MPI_DOUBLE, dest, mtype, MPI_COMM_WORLD); offset = offset + rows; } /* Receive results from worker tasks*/ mtype = FROM_WORKER; for(i=1;i<=numworkers;i++) { source = i; MPI_Recv(&offset, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status); MPI_Recv(&rows, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status); MPI_Recv(&c[offset][0], rows*NCB, MPI_DOUBLE, source, mtype, MPI_COMM_WORLD, &status); printf("Received results from task %d\n", source); } time = MPI_Wtime() - start; printf("Time : %.6f\n", time); //Print results /* printf("*************************************************\n"); printf("Result Matrix:\n"); for(i=0; i<NRA; i++) { printf("\n"); for(j=0; j<NCB; j++) { printf("%6.2f ", c[i][j]); } //printf("\n**************************************************\n"); //printf ("Done.\n"); } printf("\n****************************************************\n"); printf("Done.\n");*/ } /**************************** worker task ************************************/ if(taskid > MASTER) { char processor_name[MPI_MAX_PROCESSOR_NAME]; MPI_Get_processor_name(processor_name,&name_len); printf("Processing taskid %d on %s\n",taskid,processor_name); mtype = FROM_MASTER; MPI_Recv(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD, &status); MPI_Recv(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD, &status); MPI_Recv(&a, rows*NCA, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD, &status); MPI_Recv(&b, NCA*NCB, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD, &status); for(k=0; k<NCB; k++) for(i=0; i<rows; i++) { c[i][k] = 0.0; for(j=0; j<NCA; j++) c[i][k] = c[i][k] + a[i][j] * b[j][k]; } mtype = FROM_WORKER; MPI_Send(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD); MPI_Send(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD); MPI_Send(&c, rows*NCB, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD); } MPI_Finalize(); }
slurm.conf
Description: Binary data
------------------------------------------------------------------------------ Mobile security can be enabling, not merely restricting. Employees who bring their own devices (BYOD) to work are irked by the imposition of MDM restrictions. Mobile Device Manager Plus allows you to control only the apps on BYO-devices by containerizing them, leaving personal data untouched! https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
_______________________________________________ Dmtcp-forum mailing list Dmtcp-forum@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dmtcp-forum