Hi Sara, I followed the steps you wrote, but I have been unable to reproduce the issue locally. Checkpoint and restart works fine for me. Here's what I did:
1) Set up MPI-ULFM on node1. 2) Run the dummy program on node1 under dmtcp. $ dmtcp_launch --ckpt-signal 27 mpirun -n 2 ./dummy 10 10000 3) Take a checkpoint. I get three checkpoint images: 1 for the orterun process, and 2 for the two dummy MPI processes. 4) Restart from the checkpoint images. $ dmtcp_restart ckpt*.dmtcp (I was using DMTCP-2.5 top of the tree from Github.) Is it possible to have a guest account on your system? It'll be the most efficient way to debug this issue. Thanks, Rohan On Tue, Oct 25, 2016 at 08:24:58AM -0400, Rohan Garg wrote: > Thanks for the information, Sara. I'll try this locally and see if > I can reproduce the error you are seeing. > > On Tue, Oct 25, 2016 at 03:00:41AM +0000, Sara Salem Hamouda wrote: > > Dear Rohan, > > > > My sincere apologies for my late response. > > > > > > Regarding your first question: yes, your patch allowed me to checkpoint and > > restart MPICH program over the CentOS machines. > > > > > > ULFM fails in restart on the same machines and throws the errors I sent > > before. The following are steps you can follow to reproduce the problem: > > > > > > 1. Install MPI-ULFM dependencies (libtool, autoconf, and flex). > > > > On a debian machine you can run this command: > > > > sudo apt-get install libtool autoconf flex > > > > > > 2. Create a folder to install MPI-ULFM, say: > > > > mkdir /home/rohan/packages/ulfm > > > > > > 3. Download MPI-ULFM: > > > > hg clone https://bitbucket.org/icldistcomp/ulfm > > > > > > 4. A folder called ulfm will be download from the previous step, change > > director to that folder > > > > cd ulfm > > > > > > 5. run the following commands: > > > > ./autogen.pl > > ./configure --prefix=/home/rohan/packages/ulfm \ > > --enable-mpi-ext=ftmpi --with-ft=mpi \ > > --disable-io-romio --enable-contrib-no-build=vt \ > > --with-platform=optimized \ > > CC=gcc CXX=g++ F77=gfortran FC=gfortran > > make > > make install > > > > 6. update the following environment variables: > > > > export MPI=/home/rohan/packages/ulfm > > export PATH=$MPI/bin:$PATH > > export LD_LIBRARY_PATH=$MPI/lib:$LD_LIBRARY_PATH > > > > 7. Compile and run any program using MPI-ULFM. I attached dummy.c which I > > use for testing. The program repeats an all_reduce operation for a number > > of times given in the second parameter. The first parameter is the array > > size. > > > > > > On terminal-1: > > dmtcp_coordinator > > > > On terminal-2: > > mpicc dummy.c -o dummy.ulfm > > dmtcp_launch mpirun -n 3 ./dummy.ulfm 10 10000 > > > > 8. Take a checkpoint, terminate, then restart > > On terminal-1: > > press 'c' > > press 'q' > > ./dmtcp_restart_script.sh > > > > > > > > Thanks Rohan, and sorry again for my late response. > > > > > > Best Regards, > > Sara > > ________________________________ > > From: Rohan Garg <rohg...@ccs.neu.edu> > > Sent: Wednesday, October 19, 2016 6:48:30 AM > > To: Sara Salem Hamouda > > Cc: dmtcp-forum@lists.sourceforge.net > > Subject: Re: [Dmtcp-forum] DMTCP MPI restart error on a single node > > > > Just to clarify: you are able to now checkpoint and restart MPICH programs > > after the patch? > > > > For ULFM, could you send us the steps to follow to reproduce the problem > > locally? > > > > > > On Mon, Oct 17, 2016 at 02:56:26AM +0000, Sara Salem Hamouda wrote: > > > Dear Rohan, > > > > > > > > > Thanks very much for the patch, it fixed the error raised when restarting > > > my MPICH program over the CentOS machines. > > > > > > > > > My OpenMPI-ULFM programs now raise a different error upon restart: > > > > > > size = 1 > > > [40000] WARNING at socketconnection.cpp:540 in postRestart; > > > REASON='JWARNING(_real_bind(_fds[0], (sockaddr*) &_bindAddr,_bindAddrlen) > > > == 0) failed' > > > (strerror((*__errno_location ()))) = Address already in use > > > id() = 216034594ce6504-40000-58043957(100860) > > > Message: Bind failed. > > > [41000] ERROR at connection.cpp:79 in restoreOptions; > > > REASON='JASSERT(fcntl(_fds[0], F_SETFL, (int)_fcntlFlags) == 0) failed' > > > _fds[0] = 13 > > > _fcntlFlags = 32770 > > > (strerror((*__errno_location ()))) = Bad file descriptor > > > dummy.ulfm (41000): Terminating... > > > [40000] ERROR at connectionidentifier.h:96 in assertValid; > > > REASON='JASSERT(strcmp(sign, HANDSHAKE_SIGNATURE_MSG) == 0) failed' > > > sign = > > > Message: read invalid message, signature mismatch. (External socket?) > > > orterun (40000): Terminating... > > > > > > [43000] ERROR at connection.cpp:79 in restoreOptions; > > > REASON='JASSERT(fcntl(_fds[0], F_SETFL, (int)_fcntlFlags) == 0) failed' > > > _fds[0] = 13 > > > _fcntlFlags = 32770 > > > (strerror((*__errno_location ()))) = Bad file descriptor > > > dummy.ulfm (43000): Terminating... > > > [42000] ERROR at connection.cpp:79 in restoreOptions; > > > REASON='JASSERT(fcntl(_fds[0], F_SETFL, (int)_fcntlFlags) == 0) failed' > > > _fds[0] = 13 > > > _fcntlFlags = 32770 > > > (strerror((*__errno_location ()))) = Bad file descriptor > > > dummy.ulfm (42000): Terminating... > > > > > > Thanks Rohan, I really appreciate your support. > > > > > > > > > Best Regards, > > > > > > Sara > > > > > > Sara S. Hamouda > > > PhD Candidate (Computer Systems Group) > > > College of Engineering and Computer Science > > > The Australian National University > > > ________________________________ > > > From: Rohan Garg <rohg...@ccs.neu.edu> > > > Sent: Saturday, October 15, 2016 4:25:51 AM > > > To: Sara Salem Hamouda > > > Cc: dmtcp-forum@lists.sourceforge.net > > > Subject: Re: [Dmtcp-forum] DMTCP MPI restart error on a single node > > > > > > Hi Sara, > > > > > > Could you please re-try after applying the following patch to > > > the DMTCP source? > > > > > > diff --git a/src/util_misc.cpp b/src/util_misc.cpp > > > index f5bc84a..86650cf 100644 > > > --- a/src/util_misc.cpp > > > +++ b/src/util_misc.cpp > > > @@ -633,6 +633,7 @@ bool Util::isNscdArea(const ProcMapsArea& area) > > > if (strStartsWith(area.name, "/run/nscd") || // OpenSUSE (newer) > > > strStartsWith(area.name, "/var/run/nscd") || // OpenSUSE (older) > > > strStartsWith(area.name, "/var/cache/nscd") || // Debian/Ubuntu > > > + strStartsWith(area.name, "/ram/var/run/nscd") || // CentOS-6.8 > > > strStartsWith(area.name, "/var/db/nscd")) { // RedHat/Fedora > > > return true; > > > } > > > > > > Thanks, > > > Rohan > > > > > > On Fri, Oct 14, 2016 at 07:02:04AM +0000, Sara Salem Hamouda wrote: > > > > > > > > Hi Rohan, > > > > > > > > I am using the latest release on github, which is DMTCP-2.4.5. > > > > Same error received with mpirun. > > > > > > > > > > > > I tried another mpi implementation, called OpenMPI-ULFM > > > > (https://bitbucket.org/icldistcomp/ulfm), which I use in my research, > > > > and I got same error: > > [https://d301sr5gafysq2.cloudfront.net/564c96d1f0f9/img/repo-avatars/c.svg]<https://bitbucket.org/icldistcomp/ulfm> > > > > icldistcomp / ulfm<https://bitbucket.org/icldistcomp/ulfm> > > bitbucket.org > > Open MPI implementation of the User Level Fault Mitigation (ULFM) proposal. > > More info @ http://fault-tolerance.org. > > > > > > > > > > > > > > > > > > [40000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap; > > > > REASON='JASSERT(fd != -1 || errno == EEXIST) failed' > > > > area.name = /ram/var/run/nscd/dbuYHRnM > > > > orterun (40000): Terminating... > > > > ssh659@raijin3:~/dmtcp/dir_ckpt$ [41000] ERROR at fileconnlist.cpp:318 > > > > in recreateShmFileAndMap; REASON='JASSERT(fd != -1 || errno == EEXIST) > > > > failed' > > > > area.name = /ram/var/run/nscd/dbCEJazi > > > > dummy.ulfm (41000): Terminating... > > > > [42000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap; > > > > REASON='JASSERT(fd != -1 || errno == EEXIST) failed' > > > > area.name = /ram/var/run/nscd/dbCEJazi > > > > dummy.ulfm (42000): Terminating... > > > > [43000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap; > > > > REASON='JASSERT(fd != -1 || errno == EEXIST) failed' > > > > area.name = /ram/var/run/nscd/dbCEJazi > > > > dummy.ulfm (43000): Terminating... > > > > > > > > The HANDSHAKE error appeared with MPICH, but not with OpenMPI-ULFM. > > > > > > > > > > > > Best Regards, > > > > Sara > > > > > > > > Sara S. Hamouda > > > > PhD Candidate (Computer Systems Group) > > > > College of Engineering and Computer Science > > > > The Australian National University > > > > ________________________________ > > > > From: Rohan Garg <rohg...@ccs.neu.edu> > > > > Sent: Friday, October 14, 2016 7:11:12 AM > > > > To: Sara Salem Hamouda > > > > Cc: dmtcp-forum@lists.sourceforge.net > > > > Subject: Re: [Dmtcp-forum] DMTCP MPI restart error on a single node > > > > > > > > Hi Sara, > > > > > > > > What version of DMTCP were you using? DMTCP-3.0 has some known issues > > > > with mpich-3.2, as reported by a DMTCP user. I'd recommend trying with > > > > DMTCP-2.5. > > > > > > > > Also, could you try launching your MPI program with mpirun instead of > > > > mpiexec? > > > > > > > > Thanks, > > > > Rohan > > > > > > > > On Wed, Oct 12, 2016 at 11:30:41AM +0000, Sara Salem Hamouda wrote: > > > > > Dear DMTCP team, > > > > > > > > > > Appreciate your support regarding the below issue. > > > > > > > > > > > > > > > I am using a single machine to learn DMTCP. The operating system is > > > > > "CentOS release 6.8", and it uses a network file system. I run a > > > > > simple MPI program (dummy.c), using mpich V3.2. > > > > > > > > > > > > > > > On terminal-1: > > > > > > > > > > dmtcp_coordinator > > > > > > > > > > > > > > > On terminal-2: > > > > > > > > > > dmtcp_launch mpiexec -n 3 ./dummy.mpich2 10 10000 > > > > > > > > > > > > > > > While dummy is running in terminal-2, I move to terminal-1 and press > > > > > 'c' , then 'q' to exit. > > > > > > > > > > > > > > > To restart, I run the generated dmtcp_restart_script.sh script, but I > > > > > get the error below. Would you please advice on a possible fix for > > > > > this issue? > > > > > > > > > > > > > > > (P.S. I tried the same steps on another machine (with Ubuntu 14.04 > > > > > OS) that has a local file system, and the restart worked > > > > > successfully. Is there specific configuration I should use with > > > > > network file systems?) > > > > > > > > > > > > > > > size = 1 > > > > > [43000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap; > > > > > REASON='JASSERT(fd != -1 || errno == EEXIST) failed' > > > > > area.name = /ram/var/run/nscd/dbbxzrxW > > > > > dummy.mpich2 (43000): Terminating... > > > > > [44000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap; > > > > > REASON='JASSERT(fd != -1 || errno == EEXIST) failed' > > > > > area.name = /ram/var/run/nscd/dbbxzrxW > > > > > dummy.mpich2 (44000): Terminating... > > > > > [42000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap; > > > > > REASON='JASSERT(fd != -1 || errno == EEXIST) failed' > > > > > area.name = /ram/var/run/nscd/dbbxzrxW > > > > > dummy.mpich2 (42000): Terminating... > > > > > [40000] ERROR at connectionidentifier.h:96 in assertValid; > > > > > REASON='JASSERT(strcmp(sign, HANDSHAKE_SIGNATURE_MSG) == 0) failed' > > > > > sign = > > > > > Message: read invalid message, signature mismatch. (External socket?) > > > > > mpiexec.hydra (40000): Terminating... > > > > > [41000] ERROR at connectionidentifier.h:96 in assertValid; > > > > > REASON='JASSERT(strcmp(sign, HANDSHAKE_SIGNATURE_MSG) == 0) failed' > > > > > sign = > > > > > Message: read invalid message, signature mismatch. (External socket?) > > > > > hydra_pmi_proxy (41000): Terminating... > > > > > > > > > > > > > > > > > > > > Best Regards, > > > > > Sara > > > > > > > > > > Sara S. Hamouda > > > > > PhD Candidate (Computer Systems Group) > > > > > College of Engineering and Computer Science > > > > > The Australian National University > > > > > > > > > ------------------------------------------------------------------------------ > > > > > Check out the vibrant tech community on one of the world's most > > > > > engaging tech sites, SlashDot.org! http://sdm.link/slashdot > > Slashdot: News for nerds, stuff that matters<http://sdm.link/slashdot> > > sdm.link > > Slashdot: News for nerds, stuff that matters. Timely news source for > > technology related news with a heavy slant towards Linux and Open Source > > issues. > > > > > > > > > > > > > > > _______________________________________________ > > > > > Dmtcp-forum mailing list > > > > > Dmtcp-forum@lists.sourceforge.net > > > > > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum > > Dmtcp-forum Info Page - > > SourceForge<https://lists.sourceforge.net/lists/listinfo/dmtcp-forum> > > lists.sourceforge.net > > To see the collection of prior postings to the list, visit the Dmtcp-forum > > Archives. Using Dmtcp-forum: To post a message to all the list members ... > > > > > > > > > > Dmtcp-forum Info Page - > > > > SourceForge<https://lists.sourceforge.net/lists/listinfo/dmtcp-forum> > > Dmtcp-forum Info Page - > > SourceForge<https://lists.sourceforge.net/lists/listinfo/dmtcp-forum> > > lists.sourceforge.net > > To see the collection of prior postings to the list, visit the Dmtcp-forum > > Archives. Using Dmtcp-forum: To post a message to all the list members ... > > > > > > > > > > lists.sourceforge.net > > > > To see the collection of prior postings to the list, visit the > > > > Dmtcp-forum Archives. Using Dmtcp-forum: To post a message to all the > > > > list members ... > > > > > > > > > > > > > > > > > > > // Author: Wes Kendall > > // Copyright 2013 www.mpitutorial.com > > // This code is provided freely with the tutorials on mpitutorial.com. Feel > > // free to modify it for your own use. Any distribution of the code must > > // either provide a link to www.mpitutorial.com or keep this header intact. > > // > > // Program that computes the standard deviation of an array of elements in > > parallel using > > // MPI_Reduce. > > // > > #include <stdio.h> > > #include <stdlib.h> > > #include <mpi.h> > > #include <math.h> > > #include <assert.h> > > #include <unistd.h> > > > > unsigned int microseconds = 20000; > > > > // Creates an array of random numbers. Each number has a value from 0 - 1 > > float *create_rand_nums(int num_elements) { > > float *rand_nums = (float *)malloc(sizeof(float) * num_elements); > > assert(rand_nums != NULL); > > int i; > > for (i = 0; i < num_elements; i++) { > > rand_nums[i] = (rand() / (float)RAND_MAX); > > } > > return rand_nums; > > } > > > > int main(int argc, char** argv) { > > if (argc != 3) { > > fprintf(stderr, "Usage: avg num_elements_per_proc repeat_times\n"); > > exit(1); > > } > > > > int num_elements_per_proc = atoi(argv[1]); > > int num_repeat = atoi(argv[2]); > > int repeat_id=0; > > MPI_Init(NULL, NULL); > > > > int world_rank; > > MPI_Comm_rank(MPI_COMM_WORLD, &world_rank); > > int world_size; > > MPI_Comm_size(MPI_COMM_WORLD, &world_size); > > > > // Create a random array of elements on all processes. > > srand(time(NULL)*world_rank); // Seed the random number generator of > > processes uniquely > > float *rand_nums = NULL; > > rand_nums = create_rand_nums(num_elements_per_proc); > > > > while (repeat_id < num_repeat) { > > usleep(microseconds); > > > > if (world_rank == 0) > > printf("\repeat-%d ",repeat_id); > > > > // Sum the numbers locally > > float local_sum = 0; > > int i; > > for (i = 0; i < num_elements_per_proc; i++) { > > local_sum += rand_nums[i]; > > } > > > > // Reduce all of the local sums into the global sum in order to > > // calculate the mean > > float global_sum; > > MPI_Allreduce(&local_sum, &global_sum, 1, MPI_FLOAT, MPI_SUM, > > MPI_COMM_WORLD); > > float mean = global_sum / (num_elements_per_proc * world_size); > > > > // Compute the local sum of the squared differences from the mean > > float local_sq_diff = 0; > > for (i = 0; i < num_elements_per_proc; i++) { > > local_sq_diff += (rand_nums[i] - mean) * (rand_nums[i] - mean); > > } > > > > // Reduce the global sum of the squared differences to the root process > > // and print off the answer > > float global_sq_diff; > > MPI_Reduce(&local_sq_diff, &global_sq_diff, 1, MPI_FLOAT, MPI_SUM, 0, > > MPI_COMM_WORLD); > > > > // The standard deviation is the square root of the mean of the squared > > // differences. > > if (world_rank == 0) { > > float stddev = sqrt(global_sq_diff / > > (num_elements_per_proc * world_size)); > > printf("Mean - %f, Standard deviation = %f\n", mean, stddev); > > } > > repeat_id++; > > > > } > > > > // Clean up > > free(rand_nums); > > > > MPI_Barrier(MPI_COMM_WORLD); > > MPI_Finalize(); > > } > > > > > ------------------------------------------------------------------------------ > The Command Line: Reinvented for Modern Developers > Did the resurgence of CLI tooling catch you by surprise? > Reconnect with the command line and become more productive. > Learn the new .NET and ASP.NET CLI. Get your free copy! > http://sdm.link/telerik > _______________________________________________ > Dmtcp-forum mailing list > Dmtcp-forum@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum ------------------------------------------------------------------------------ Developer Access Program for Intel Xeon Phi Processors Access to Intel Xeon Phi processor-based developer platforms. With one year of Intel Parallel Studio XE. Training and support from Colfax. Order your platform today. http://sdm.link/xeonphi _______________________________________________ Dmtcp-forum mailing list Dmtcp-forum@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dmtcp-forum