Hi Rohan,
Thanks a lot for giving the time to try reproduce my error.
>From what you described, the main difference between us is the way we restart
>the job; I use ./dmtcp_restart_script.sh , and you use dmtcp_restart
>ckpt*.dmtcp
Would you please try using ./dmtcp_restart_script.sh ?
Best Regards,
Sara
________________________________
From: Rohan Garg <rohg...@ccs.neu.edu>
Sent: Friday, November 4, 2016 6:10:03 AM
To: Sara Salem Hamouda; dmtcp-forum@lists.sourceforge.net
Subject: Re: [Dmtcp-forum] DMTCP MPI restart error on a single node
Hi Sara,
I followed the steps you wrote, but I have been unable to reproduce the
issue locally. Checkpoint and restart works fine for me. Here's what I
did:
1) Set up MPI-ULFM on node1.
2) Run the dummy program on node1 under dmtcp.
$ dmtcp_launch --ckpt-signal 27 mpirun -n 2 ./dummy 10 10000
3) Take a checkpoint. I get three checkpoint images: 1 for the orterun
process, and 2 for the two dummy MPI processes.
4) Restart from the checkpoint images.
$ dmtcp_restart ckpt*.dmtcp
(I was using DMTCP-2.5 top of the tree from Github.)
Is it possible to have a guest account on your system? It'll be the
most efficient way to debug this issue.
Thanks,
Rohan
On Tue, Oct 25, 2016 at 08:24:58AM -0400, Rohan Garg wrote:
> Thanks for the information, Sara. I'll try this locally and see if
> I can reproduce the error you are seeing.
>
> On Tue, Oct 25, 2016 at 03:00:41AM +0000, Sara Salem Hamouda wrote:
> > Dear Rohan,
> >
> > My sincere apologies for my late response.
> >
> >
> > Regarding your first question: yes, your patch allowed me to checkpoint and
> > restart MPICH program over the CentOS machines.
> >
> >
> > ULFM fails in restart on the same machines and throws the errors I sent
> > before. The following are steps you can follow to reproduce the problem:
> >
> >
> > 1. Install MPI-ULFM dependencies (libtool, autoconf, and flex).
> >
> > On a debian machine you can run this command:
> >
> > sudo apt-get install libtool autoconf flex
> >
> >
> > 2. Create a folder to install MPI-ULFM, say:
> >
> > mkdir /home/rohan/packages/ulfm
> >
> >
> > 3. Download MPI-ULFM:
> >
> > hg clone https://bitbucket.org/icldistcomp/ulfm
> >
> >
> > 4. A folder called ulfm will be download from the previous step, change
> > director to that folder
> >
> > cd ulfm
> >
> >
> > 5. run the following commands:
> >
> > ./autogen.pl
> > ./configure --prefix=/home/rohan/packages/ulfm \
> > --enable-mpi-ext=ftmpi --with-ft=mpi \
> > --disable-io-romio --enable-contrib-no-build=vt \
> > --with-platform=optimized \
> > CC=gcc CXX=g++ F77=gfortran FC=gfortran
> > make
> > make install
> >
> > 6. update the following environment variables:
> >
> > export MPI=/home/rohan/packages/ulfm
> > export PATH=$MPI/bin:$PATH
> > export LD_LIBRARY_PATH=$MPI/lib:$LD_LIBRARY_PATH
> >
> > 7. Compile and run any program using MPI-ULFM. I attached dummy.c which I
> > use for testing. The program repeats an all_reduce operation for a number
> > of times given in the second parameter. The first parameter is the array
> > size.
> >
> >
> > On terminal-1:
> > dmtcp_coordinator
> >
> > On terminal-2:
> > mpicc dummy.c -o dummy.ulfm
> > dmtcp_launch mpirun -n 3 ./dummy.ulfm 10 10000
> >
> > 8. Take a checkpoint, terminate, then restart
> > On terminal-1:
> > press 'c'
> > press 'q'
> > ./dmtcp_restart_script.sh
> >
> >
> >
> > Thanks Rohan, and sorry again for my late response.
> >
> >
> > Best Regards,
> > Sara
> > ________________________________
> > From: Rohan Garg <rohg...@ccs.neu.edu>
> > Sent: Wednesday, October 19, 2016 6:48:30 AM
> > To: Sara Salem Hamouda
> > Cc: dmtcp-forum@lists.sourceforge.net
> > Subject: Re: [Dmtcp-forum] DMTCP MPI restart error on a single node
> >
> > Just to clarify: you are able to now checkpoint and restart MPICH programs
> > after the patch?
> >
> > For ULFM, could you send us the steps to follow to reproduce the problem
> > locally?
> >
> >
> > On Mon, Oct 17, 2016 at 02:56:26AM +0000, Sara Salem Hamouda wrote:
> > > Dear Rohan,
> > >
> > >
> > > Thanks very much for the patch, it fixed the error raised when restarting
> > > my MPICH program over the CentOS machines.
> > >
> > >
> > > My OpenMPI-ULFM programs now raise a different error upon restart:
> > >
> > > size = 1
> > > [40000] WARNING at socketconnection.cpp:540 in postRestart;
> > > REASON='JWARNING(_real_bind(_fds[0], (sockaddr*) &_bindAddr,_bindAddrlen)
> > > == 0) failed'
> > > (strerror((*__errno_location ()))) = Address already in use
> > > id() = 216034594ce6504-40000-58043957(100860)
> > > Message: Bind failed.
> > > [41000] ERROR at connection.cpp:79 in restoreOptions;
> > > REASON='JASSERT(fcntl(_fds[0], F_SETFL, (int)_fcntlFlags) == 0) failed'
> > > _fds[0] = 13
> > > _fcntlFlags = 32770
> > > (strerror((*__errno_location ()))) = Bad file descriptor
> > > dummy.ulfm (41000): Terminating...
> > > [40000] ERROR at connectionidentifier.h:96 in assertValid;
> > > REASON='JASSERT(strcmp(sign, HANDSHAKE_SIGNATURE_MSG) == 0) failed'
> > > sign =
> > > Message: read invalid message, signature mismatch. (External socket?)
> > > orterun (40000): Terminating...
> > >
> > > [43000] ERROR at connection.cpp:79 in restoreOptions;
> > > REASON='JASSERT(fcntl(_fds[0], F_SETFL, (int)_fcntlFlags) == 0) failed'
> > > _fds[0] = 13
> > > _fcntlFlags = 32770
> > > (strerror((*__errno_location ()))) = Bad file descriptor
> > > dummy.ulfm (43000): Terminating...
> > > [42000] ERROR at connection.cpp:79 in restoreOptions;
> > > REASON='JASSERT(fcntl(_fds[0], F_SETFL, (int)_fcntlFlags) == 0) failed'
> > > _fds[0] = 13
> > > _fcntlFlags = 32770
> > > (strerror((*__errno_location ()))) = Bad file descriptor
> > > dummy.ulfm (42000): Terminating...
> > >
> > > Thanks Rohan, I really appreciate your support.
> > >
> > >
> > > Best Regards,
> > >
> > > Sara
> > >
> > > Sara S. Hamouda
> > > PhD Candidate (Computer Systems Group)
> > > College of Engineering and Computer Science
> > > The Australian National University
> > > ________________________________
> > > From: Rohan Garg <rohg...@ccs.neu.edu>
> > > Sent: Saturday, October 15, 2016 4:25:51 AM
> > > To: Sara Salem Hamouda
> > > Cc: dmtcp-forum@lists.sourceforge.net
> > > Subject: Re: [Dmtcp-forum] DMTCP MPI restart error on a single node
> > >
> > > Hi Sara,
> > >
> > > Could you please re-try after applying the following patch to
> > > the DMTCP source?
> > >
> > > diff --git a/src/util_misc.cpp b/src/util_misc.cpp
> > > index f5bc84a..86650cf 100644
> > > --- a/src/util_misc.cpp
> > > +++ b/src/util_misc.cpp
> > > @@ -633,6 +633,7 @@ bool Util::isNscdArea(const ProcMapsArea& area)
> > > if (strStartsWith(area.name, "/run/nscd") || // OpenSUSE (newer)
> > > strStartsWith(area.name, "/var/run/nscd") || // OpenSUSE (older)
> > > strStartsWith(area.name, "/var/cache/nscd") || // Debian/Ubuntu
> > > + strStartsWith(area.name, "/ram/var/run/nscd") || // CentOS-6.8
> > > strStartsWith(area.name, "/var/db/nscd")) { // RedHat/Fedora
> > > return true;
> > > }
> > >
> > > Thanks,
> > > Rohan
> > >
> > > On Fri, Oct 14, 2016 at 07:02:04AM +0000, Sara Salem Hamouda wrote:
> > > >
> > > > Hi Rohan,
> > > >
> > > > I am using the latest release on github, which is DMTCP-2.4.5.
> > > > Same error received with mpirun.
> > > >
> > > >
> > > > I tried another mpi implementation, called OpenMPI-ULFM
> > > > (https://bitbucket.org/icldistcomp/ulfm), which I use in my research,
> > > > and I got same error:
> > [https://d301sr5gafysq2.cloudfront.net/564c96d1f0f9/img/repo-avatars/c.svg]<https://bitbucket.org/icldistcomp/ulfm>
> >
> > icldistcomp / ulfm<https://bitbucket.org/icldistcomp/ulfm>
> > bitbucket.org
> > Open MPI implementation of the User Level Fault Mitigation (ULFM) proposal.
> > More info @ http://fault-tolerance.org.
> >
> >
> >
> > > >
> > > >
> > > > [40000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap;
> > > > REASON='JASSERT(fd != -1 || errno == EEXIST) failed'
> > > > area.name = /ram/var/run/nscd/dbuYHRnM
> > > > orterun (40000): Terminating...
> > > > ssh659@raijin3:~/dmtcp/dir_ckpt$ [41000] ERROR at fileconnlist.cpp:318
> > > > in recreateShmFileAndMap; REASON='JASSERT(fd != -1 || errno == EEXIST)
> > > > failed'
> > > > area.name = /ram/var/run/nscd/dbCEJazi
> > > > dummy.ulfm (41000): Terminating...
> > > > [42000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap;
> > > > REASON='JASSERT(fd != -1 || errno == EEXIST) failed'
> > > > area.name = /ram/var/run/nscd/dbCEJazi
> > > > dummy.ulfm (42000): Terminating...
> > > > [43000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap;
> > > > REASON='JASSERT(fd != -1 || errno == EEXIST) failed'
> > > > area.name = /ram/var/run/nscd/dbCEJazi
> > > > dummy.ulfm (43000): Terminating...
> > > >
> > > > The HANDSHAKE error appeared with MPICH, but not with OpenMPI-ULFM.
> > > >
> > > >
> > > > Best Regards,
> > > > Sara
> > > >
> > > > Sara S. Hamouda
> > > > PhD Candidate (Computer Systems Group)
> > > > College of Engineering and Computer Science
> > > > The Australian National University
> > > > ________________________________
> > > > From: Rohan Garg <rohg...@ccs.neu.edu>
> > > > Sent: Friday, October 14, 2016 7:11:12 AM
> > > > To: Sara Salem Hamouda
> > > > Cc: dmtcp-forum@lists.sourceforge.net
> > > > Subject: Re: [Dmtcp-forum] DMTCP MPI restart error on a single node
> > > >
> > > > Hi Sara,
> > > >
> > > > What version of DMTCP were you using? DMTCP-3.0 has some known issues
> > > > with mpich-3.2, as reported by a DMTCP user. I'd recommend trying with
> > > > DMTCP-2.5.
> > > >
> > > > Also, could you try launching your MPI program with mpirun instead of
> > > > mpiexec?
> > > >
> > > > Thanks,
> > > > Rohan
> > > >
> > > > On Wed, Oct 12, 2016 at 11:30:41AM +0000, Sara Salem Hamouda wrote:
> > > > > Dear DMTCP team,
> > > > >
> > > > > Appreciate your support regarding the below issue.
> > > > >
> > > > >
> > > > > I am using a single machine to learn DMTCP. The operating system is
> > > > > "CentOS release 6.8", and it uses a network file system. I run a
> > > > > simple MPI program (dummy.c), using mpich V3.2.
> > > > >
> > > > >
> > > > > On terminal-1:
> > > > >
> > > > > dmtcp_coordinator
> > > > >
> > > > >
> > > > > On terminal-2:
> > > > >
> > > > > dmtcp_launch mpiexec -n 3 ./dummy.mpich2 10 10000
> > > > >
> > > > >
> > > > > While dummy is running in terminal-2, I move to terminal-1 and press
> > > > > 'c' , then 'q' to exit.
> > > > >
> > > > >
> > > > > To restart, I run the generated dmtcp_restart_script.sh script, but I
> > > > > get the error below. Would you please advice on a possible fix for
> > > > > this issue?
> > > > >
> > > > >
> > > > > (P.S. I tried the same steps on another machine (with Ubuntu 14.04
> > > > > OS) that has a local file system, and the restart worked
> > > > > successfully. Is there specific configuration I should use with
> > > > > network file systems?)
> > > > >
> > > > >
> > > > > size = 1
> > > > > [43000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap;
> > > > > REASON='JASSERT(fd != -1 || errno == EEXIST) failed'
> > > > > area.name = /ram/var/run/nscd/dbbxzrxW
> > > > > dummy.mpich2 (43000): Terminating...
> > > > > [44000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap;
> > > > > REASON='JASSERT(fd != -1 || errno == EEXIST) failed'
> > > > > area.name = /ram/var/run/nscd/dbbxzrxW
> > > > > dummy.mpich2 (44000): Terminating...
> > > > > [42000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap;
> > > > > REASON='JASSERT(fd != -1 || errno == EEXIST) failed'
> > > > > area.name = /ram/var/run/nscd/dbbxzrxW
> > > > > dummy.mpich2 (42000): Terminating...
> > > > > [40000] ERROR at connectionidentifier.h:96 in assertValid;
> > > > > REASON='JASSERT(strcmp(sign, HANDSHAKE_SIGNATURE_MSG) == 0) failed'
> > > > > sign =
> > > > > Message: read invalid message, signature mismatch. (External socket?)
> > > > > mpiexec.hydra (40000): Terminating...
> > > > > [41000] ERROR at connectionidentifier.h:96 in assertValid;
> > > > > REASON='JASSERT(strcmp(sign, HANDSHAKE_SIGNATURE_MSG) == 0) failed'
> > > > > sign =
> > > > > Message: read invalid message, signature mismatch. (External socket?)
> > > > > hydra_pmi_proxy (41000): Terminating...
> > > > >
> > > > >
> > > > >
> > > > > Best Regards,
> > > > > Sara
> > > > >
> > > > > Sara S. Hamouda
> > > > > PhD Candidate (Computer Systems Group)
> > > > > College of Engineering and Computer Science
> > > > > The Australian National University
> > > >
> > > > > ------------------------------------------------------------------------------
> > > > > Check out the vibrant tech community on one of the world's most
> > > > > engaging tech sites, SlashDot.org! http://sdm.link/slashdot
> > Slashdot: News for nerds, stuff that matters<http://sdm.link/slashdot>
> > sdm.link
> > Slashdot: News for nerds, stuff that matters. Timely news source for
> > technology related news with a heavy slant towards Linux and Open Source
> > issues.
> >
> >
> >
> > > >
> > > > > _______________________________________________
> > > > > Dmtcp-forum mailing list
> > > > > Dmtcp-forum@lists.sourceforge.net
> > > > > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
> > Dmtcp-forum Info Page -
> > SourceForge<https://lists.sourceforge.net/lists/listinfo/dmtcp-forum>
> > lists.sourceforge.net
> > To see the collection of prior postings to the list, visit the Dmtcp-forum
> > Archives. Using Dmtcp-forum: To post a message to all the list members ...
> >
> >
> >
> > > > Dmtcp-forum Info Page -
> > > > SourceForge<https://lists.sourceforge.net/lists/listinfo/dmtcp-forum>
> > Dmtcp-forum Info Page -
> > SourceForge<https://lists.sourceforge.net/lists/listinfo/dmtcp-forum>
> > lists.sourceforge.net
> > To see the collection of prior postings to the list, visit the Dmtcp-forum
> > Archives. Using Dmtcp-forum: To post a message to all the list members ...
> >
> >
> >
> > > > lists.sourceforge.net
> > > > To see the collection of prior postings to the list, visit the
> > > > Dmtcp-forum Archives. Using Dmtcp-forum: To post a message to all the
> > > > list members ...
> > > >
> > > >
> > > >
> > > >
>
> > // Author: Wes Kendall
> > // Copyright 2013 www.mpitutorial.com<http://www.mpitutorial.com>
> > // This code is provided freely with the tutorials on mpitutorial.com. Feel
> > // free to modify it for your own use. Any distribution of the code must
> > // either provide a link to www.mpitutorial.com<http://www.mpitutorial.com>
> > or keep this header intact.
> > //
> > // Program that computes the standard deviation of an array of elements in
> > parallel using
> > // MPI_Reduce.
> > //
> > #include <stdio.h>
> > #include <stdlib.h>
> > #include <mpi.h>
> > #include <math.h>
> > #include <assert.h>
> > #include <unistd.h>
> >
> > unsigned int microseconds = 20000;
> >
> > // Creates an array of random numbers. Each number has a value from 0 - 1
> > float *create_rand_nums(int num_elements) {
> > float *rand_nums = (float *)malloc(sizeof(float) * num_elements);
> > assert(rand_nums != NULL);
> > int i;
> > for (i = 0; i < num_elements; i++) {
> > rand_nums[i] = (rand() / (float)RAND_MAX);
> > }
> > return rand_nums;
> > }
> >
> > int main(int argc, char** argv) {
> > if (argc != 3) {
> > fprintf(stderr, "Usage: avg num_elements_per_proc repeat_times\n");
> > exit(1);
> > }
> >
> > int num_elements_per_proc = atoi(argv[1]);
> > int num_repeat = atoi(argv[2]);
> > int repeat_id=0;
> > MPI_Init(NULL, NULL);
> >
> > int world_rank;
> > MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
> > int world_size;
> > MPI_Comm_size(MPI_COMM_WORLD, &world_size);
> >
> > // Create a random array of elements on all processes.
> > srand(time(NULL)*world_rank); // Seed the random number generator of
> > processes uniquely
> > float *rand_nums = NULL;
> > rand_nums = create_rand_nums(num_elements_per_proc);
> >
> > while (repeat_id < num_repeat) {
> > usleep(microseconds);
> >
> > if (world_rank == 0)
> > printf("\repeat-%d ",repeat_id);
> >
> > // Sum the numbers locally
> > float local_sum = 0;
> > int i;
> > for (i = 0; i < num_elements_per_proc; i++) {
> > local_sum += rand_nums[i];
> > }
> >
> > // Reduce all of the local sums into the global sum in order to
> > // calculate the mean
> > float global_sum;
> > MPI_Allreduce(&local_sum, &global_sum, 1, MPI_FLOAT, MPI_SUM,
> > MPI_COMM_WORLD);
> > float mean = global_sum / (num_elements_per_proc * world_size);
> >
> > // Compute the local sum of the squared differences from the mean
> > float local_sq_diff = 0;
> > for (i = 0; i < num_elements_per_proc; i++) {
> > local_sq_diff += (rand_nums[i] - mean) * (rand_nums[i] - mean);
> > }
> >
> > // Reduce the global sum of the squared differences to the root process
> > // and print off the answer
> > float global_sq_diff;
> > MPI_Reduce(&local_sq_diff, &global_sq_diff, 1, MPI_FLOAT, MPI_SUM, 0,
> > MPI_COMM_WORLD);
> >
> > // The standard deviation is the square root of the mean of the squared
> > // differences.
> > if (world_rank == 0) {
> > float stddev = sqrt(global_sq_diff /
> > (num_elements_per_proc * world_size));
> > printf("Mean - %f, Standard deviation = %f\n", mean, stddev);
> > }
> > repeat_id++;
> >
> > }
> >
> > // Clean up
> > free(rand_nums);
> >
> > MPI_Barrier(MPI_COMM_WORLD);
> > MPI_Finalize();
> > }
> >
>
>
> ------------------------------------------------------------------------------
> The Command Line: Reinvented for Modern Developers
> Did the resurgence of CLI tooling catch you by surprise?
> Reconnect with the command line and become more productive.
> Learn the new .NET and ASP.NET CLI. Get your free copy!
> http://sdm.link/telerik
> _______________________________________________
> Dmtcp-forum mailing list
> Dmtcp-forum@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
------------------------------------------------------------------------------
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum