Hi Rohan,

 Thanks a lot for giving the time to try reproduce my error.


>From what you described, the main difference between us is the way we restart 
>the job; I use ./dmtcp_restart_script.sh  , and you use dmtcp_restart 
>ckpt*.dmtcp

Would you please try using ./dmtcp_restart_script.sh ?


Best Regards,
Sara
________________________________
From: Rohan Garg <rohg...@ccs.neu.edu>
Sent: Friday, November 4, 2016 6:10:03 AM
To: Sara Salem Hamouda; dmtcp-forum@lists.sourceforge.net
Subject: Re: [Dmtcp-forum] DMTCP MPI restart error on a single node

Hi Sara,

I followed the steps you wrote, but I have been unable to reproduce the
issue locally. Checkpoint and restart works fine for me. Here's what I
did:

 1) Set up MPI-ULFM on node1.

 2) Run the dummy program on node1 under dmtcp.

      $ dmtcp_launch --ckpt-signal 27 mpirun -n 2 ./dummy 10 10000

 3) Take a checkpoint. I get three checkpoint images: 1 for the orterun
    process, and 2 for the two dummy MPI processes.

 4) Restart from the checkpoint images.

      $ dmtcp_restart ckpt*.dmtcp

(I was using DMTCP-2.5 top of the tree from Github.)

Is it possible to have a guest account on your system? It'll be the
most efficient way to debug this issue.

Thanks,
Rohan


On Tue, Oct 25, 2016 at 08:24:58AM -0400, Rohan Garg wrote:
> Thanks for the information, Sara. I'll try this locally and see if
> I can reproduce the error you are seeing.
>
> On Tue, Oct 25, 2016 at 03:00:41AM +0000, Sara Salem Hamouda wrote:
> > Dear Rohan,
> >
> >    My sincere apologies for my late response.
> >
> >
> > Regarding your first question: yes, your patch allowed me to checkpoint and 
> > restart MPICH program over the CentOS machines.
> >
> >
> > ULFM fails in restart on the same machines and throws the errors I sent 
> > before. The following are steps you can follow to reproduce the problem:
> >
> >
> > 1. Install MPI-ULFM dependencies (libtool, autoconf, and flex).
> >
> > On a debian machine you can run this command:
> >
> > sudo apt-get install libtool autoconf flex
> >
> >
> > 2. Create a folder to install MPI-ULFM, say:
> >
> > mkdir /home/rohan/packages/ulfm
> >
> >
> > 3. Download MPI-ULFM:
> >
> > hg clone https://bitbucket.org/icldistcomp/ulfm
> >
> >
> > 4. A folder called ulfm will be download from the previous step, change 
> > director to that folder
> >
> > cd ulfm
> >
> >
> > 5. run the following commands:
> >
> > ./autogen.pl
> > ./configure --prefix=/home/rohan/packages/ulfm \
> >        --enable-mpi-ext=ftmpi --with-ft=mpi \
> >        --disable-io-romio --enable-contrib-no-build=vt \
> >        --with-platform=optimized \
> >        CC=gcc CXX=g++ F77=gfortran FC=gfortran
> > make
> > make install
> >
> > 6. update the following environment variables:
> >
> > export MPI=/home/rohan/packages/ulfm
> > export PATH=$MPI/bin:$PATH
> > export LD_LIBRARY_PATH=$MPI/lib:$LD_LIBRARY_PATH
> >
> > 7. Compile and run any program using MPI-ULFM. I attached dummy.c which I 
> > use for testing. The program repeats an all_reduce operation for a number 
> > of times given in the second parameter. The first parameter is the array 
> > size.
> >
> >
> > On terminal-1:
> > dmtcp_coordinator
> >
> > On terminal-2:
> > mpicc dummy.c -o dummy.ulfm
> > dmtcp_launch mpirun -n 3 ./dummy.ulfm 10 10000
> >
> > 8. Take a checkpoint, terminate, then restart
> > On terminal-1:
> > press 'c'
> > press 'q'
> > ./dmtcp_restart_script.sh
> >
> >
> >
> > Thanks Rohan, and sorry again for my late response.
> >
> >
> > Best Regards,
> > Sara
> > ________________________________
> > From: Rohan Garg <rohg...@ccs.neu.edu>
> > Sent: Wednesday, October 19, 2016 6:48:30 AM
> > To: Sara Salem Hamouda
> > Cc: dmtcp-forum@lists.sourceforge.net
> > Subject: Re: [Dmtcp-forum] DMTCP MPI restart error on a single node
> >
> > Just to clarify: you are able to now checkpoint and restart MPICH programs
> > after the patch?
> >
> > For ULFM, could you send us the steps to follow to reproduce the problem
> > locally?
> >
> >
> > On Mon, Oct 17, 2016 at 02:56:26AM +0000, Sara Salem Hamouda wrote:
> > > Dear Rohan,
> > >
> > >
> > > Thanks very much for the patch, it fixed the error raised when restarting 
> > > my MPICH program over the CentOS machines.
> > >
> > >
> > > My OpenMPI-ULFM programs now raise a different error upon restart:
> > >
> > > size = 1
> > > [40000] WARNING at socketconnection.cpp:540 in postRestart; 
> > > REASON='JWARNING(_real_bind(_fds[0], (sockaddr*) &_bindAddr,_bindAddrlen) 
> > > == 0) failed'
> > >      (strerror((*__errno_location ()))) = Address already in use
> > >      id() = 216034594ce6504-40000-58043957(100860)
> > > Message: Bind failed.
> > > [41000] ERROR at connection.cpp:79 in restoreOptions; 
> > > REASON='JASSERT(fcntl(_fds[0], F_SETFL, (int)_fcntlFlags) == 0) failed'
> > >      _fds[0] = 13
> > >      _fcntlFlags = 32770
> > >      (strerror((*__errno_location ()))) = Bad file descriptor
> > > dummy.ulfm (41000): Terminating...
> > > [40000] ERROR at connectionidentifier.h:96 in assertValid; 
> > > REASON='JASSERT(strcmp(sign, HANDSHAKE_SIGNATURE_MSG) == 0) failed'
> > >      sign =
> > > Message: read invalid message, signature mismatch. (External socket?)
> > > orterun (40000): Terminating...
> > >
> > > [43000] ERROR at connection.cpp:79 in restoreOptions; 
> > > REASON='JASSERT(fcntl(_fds[0], F_SETFL, (int)_fcntlFlags) == 0) failed'
> > >      _fds[0] = 13
> > >      _fcntlFlags = 32770
> > >      (strerror((*__errno_location ()))) = Bad file descriptor
> > > dummy.ulfm (43000): Terminating...
> > > [42000] ERROR at connection.cpp:79 in restoreOptions; 
> > > REASON='JASSERT(fcntl(_fds[0], F_SETFL, (int)_fcntlFlags) == 0) failed'
> > >      _fds[0] = 13
> > >      _fcntlFlags = 32770
> > >      (strerror((*__errno_location ()))) = Bad file descriptor
> > > dummy.ulfm (42000): Terminating...
> > >
> > > Thanks Rohan, I really appreciate your support.
> > >
> > >
> > > Best Regards,
> > >
> > > Sara
> > >
> > > Sara S. Hamouda
> > > PhD Candidate (Computer Systems Group)
> > > College of Engineering and Computer Science
> > > The Australian National University
> > > ________________________________
> > > From: Rohan Garg <rohg...@ccs.neu.edu>
> > > Sent: Saturday, October 15, 2016 4:25:51 AM
> > > To: Sara Salem Hamouda
> > > Cc: dmtcp-forum@lists.sourceforge.net
> > > Subject: Re: [Dmtcp-forum] DMTCP MPI restart error on a single node
> > >
> > > Hi Sara,
> > >
> > > Could you please re-try after applying the following patch to
> > > the DMTCP source?
> > >
> > > diff --git a/src/util_misc.cpp b/src/util_misc.cpp
> > > index f5bc84a..86650cf 100644
> > > --- a/src/util_misc.cpp
> > > +++ b/src/util_misc.cpp
> > > @@ -633,6 +633,7 @@ bool Util::isNscdArea(const ProcMapsArea& area)
> > >    if (strStartsWith(area.name, "/run/nscd") || // OpenSUSE (newer)
> > >        strStartsWith(area.name, "/var/run/nscd") || // OpenSUSE (older)
> > >        strStartsWith(area.name, "/var/cache/nscd") || // Debian/Ubuntu
> > > +      strStartsWith(area.name, "/ram/var/run/nscd") || // CentOS-6.8
> > >        strStartsWith(area.name, "/var/db/nscd")) { // RedHat/Fedora
> > >      return true;
> > >    }
> > >
> > > Thanks,
> > > Rohan
> > >
> > > On Fri, Oct 14, 2016 at 07:02:04AM +0000, Sara Salem Hamouda wrote:
> > > >
> > > > Hi Rohan,
> > > >
> > > >     I am using the latest release on github, which is DMTCP-2.4.5.  
> > > > Same error received with mpirun.
> > > >
> > > >
> > > > I tried another mpi implementation, called OpenMPI-ULFM 
> > > > (https://bitbucket.org/icldistcomp/ulfm), which I use in my research, 
> > > > and I got same error:
> > [https://d301sr5gafysq2.cloudfront.net/564c96d1f0f9/img/repo-avatars/c.svg]<https://bitbucket.org/icldistcomp/ulfm>
> >
> > icldistcomp / ulfm<https://bitbucket.org/icldistcomp/ulfm>
> > bitbucket.org
> > Open MPI implementation of the User Level Fault Mitigation (ULFM) proposal. 
> > More info @ http://fault-tolerance.org.
> >
> >
> >
> > > >
> > > >
> > > > [40000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap; 
> > > > REASON='JASSERT(fd != -1 || errno == EEXIST) failed'
> > > >      area.name = /ram/var/run/nscd/dbuYHRnM
> > > > orterun (40000): Terminating...
> > > > ssh659@raijin3:~/dmtcp/dir_ckpt$ [41000] ERROR at fileconnlist.cpp:318 
> > > > in recreateShmFileAndMap; REASON='JASSERT(fd != -1 || errno == EEXIST) 
> > > > failed'
> > > >      area.name = /ram/var/run/nscd/dbCEJazi
> > > > dummy.ulfm (41000): Terminating...
> > > > [42000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap; 
> > > > REASON='JASSERT(fd != -1 || errno == EEXIST) failed'
> > > >      area.name = /ram/var/run/nscd/dbCEJazi
> > > > dummy.ulfm (42000): Terminating...
> > > > [43000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap; 
> > > > REASON='JASSERT(fd != -1 || errno == EEXIST) failed'
> > > >      area.name = /ram/var/run/nscd/dbCEJazi
> > > > dummy.ulfm (43000): Terminating...
> > > >
> > > > The HANDSHAKE error appeared with MPICH, but not with OpenMPI-ULFM.
> > > >
> > > >
> > > > Best Regards,
> > > > Sara
> > > >
> > > > Sara S. Hamouda
> > > > PhD Candidate (Computer Systems Group)
> > > > College of Engineering and Computer Science
> > > > The Australian National University
> > > > ________________________________
> > > > From: Rohan Garg <rohg...@ccs.neu.edu>
> > > > Sent: Friday, October 14, 2016 7:11:12 AM
> > > > To: Sara Salem Hamouda
> > > > Cc: dmtcp-forum@lists.sourceforge.net
> > > > Subject: Re: [Dmtcp-forum] DMTCP MPI restart error on a single node
> > > >
> > > > Hi Sara,
> > > >
> > > > What version of DMTCP were you using? DMTCP-3.0 has some known issues
> > > > with mpich-3.2, as reported by a DMTCP user. I'd recommend trying with
> > > > DMTCP-2.5.
> > > >
> > > > Also, could you try launching your MPI program with mpirun instead of
> > > > mpiexec?
> > > >
> > > > Thanks,
> > > > Rohan
> > > >
> > > > On Wed, Oct 12, 2016 at 11:30:41AM +0000, Sara Salem Hamouda wrote:
> > > > > Dear DMTCP team,
> > > > >
> > > > >   Appreciate your support regarding the below issue.
> > > > >
> > > > >
> > > > > I am using a single machine to learn DMTCP. The operating system is 
> > > > > "CentOS release 6.8", and it uses a network file system. I run a 
> > > > > simple MPI program (dummy.c), using mpich V3.2.
> > > > >
> > > > >
> > > > > On terminal-1:
> > > > >
> > > > > dmtcp_coordinator
> > > > >
> > > > >
> > > > > On terminal-2:
> > > > >
> > > > > dmtcp_launch mpiexec -n 3 ./dummy.mpich2 10 10000
> > > > >
> > > > >
> > > > > While dummy is running in terminal-2, I move to terminal-1 and press 
> > > > > 'c' , then 'q' to exit.
> > > > >
> > > > >
> > > > > To restart, I run the generated dmtcp_restart_script.sh script, but I 
> > > > > get the error below. Would you please advice on a possible fix for 
> > > > > this issue?
> > > > >
> > > > >
> > > > > (P.S. I tried the same steps on another machine (with Ubuntu 14.04 
> > > > > OS) that has a local file system, and the restart worked 
> > > > > successfully. Is there specific configuration I should use with 
> > > > > network file systems?)
> > > > >
> > > > >
> > > > > size = 1
> > > > > [43000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap; 
> > > > > REASON='JASSERT(fd != -1 || errno == EEXIST) failed'
> > > > >      area.name = /ram/var/run/nscd/dbbxzrxW
> > > > > dummy.mpich2 (43000): Terminating...
> > > > > [44000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap; 
> > > > > REASON='JASSERT(fd != -1 || errno == EEXIST) failed'
> > > > >      area.name = /ram/var/run/nscd/dbbxzrxW
> > > > > dummy.mpich2 (44000): Terminating...
> > > > > [42000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap; 
> > > > > REASON='JASSERT(fd != -1 || errno == EEXIST) failed'
> > > > >      area.name = /ram/var/run/nscd/dbbxzrxW
> > > > > dummy.mpich2 (42000): Terminating...
> > > > > [40000] ERROR at connectionidentifier.h:96 in assertValid; 
> > > > > REASON='JASSERT(strcmp(sign, HANDSHAKE_SIGNATURE_MSG) == 0) failed'
> > > > >      sign =
> > > > > Message: read invalid message, signature mismatch. (External socket?)
> > > > > mpiexec.hydra (40000): Terminating...
> > > > > [41000] ERROR at connectionidentifier.h:96 in assertValid; 
> > > > > REASON='JASSERT(strcmp(sign, HANDSHAKE_SIGNATURE_MSG) == 0) failed'
> > > > >      sign =
> > > > > Message: read invalid message, signature mismatch. (External socket?)
> > > > > hydra_pmi_proxy (41000): Terminating...
> > > > >
> > > > >
> > > > >
> > > > > Best Regards,
> > > > > Sara
> > > > >
> > > > > Sara S. Hamouda
> > > > > PhD Candidate (Computer Systems Group)
> > > > > College of Engineering and Computer Science
> > > > > The Australian National University
> > > >
> > > > > ------------------------------------------------------------------------------
> > > > > Check out the vibrant tech community on one of the world's most
> > > > > engaging tech sites, SlashDot.org! http://sdm.link/slashdot
> > Slashdot: News for nerds, stuff that matters<http://sdm.link/slashdot>
> > sdm.link
> > Slashdot: News for nerds, stuff that matters. Timely news source for 
> > technology related news with a heavy slant towards Linux and Open Source 
> > issues.
> >
> >
> >
> > > >
> > > > > _______________________________________________
> > > > > Dmtcp-forum mailing list
> > > > > Dmtcp-forum@lists.sourceforge.net
> > > > > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
> > Dmtcp-forum Info Page - 
> > SourceForge<https://lists.sourceforge.net/lists/listinfo/dmtcp-forum>
> > lists.sourceforge.net
> > To see the collection of prior postings to the list, visit the Dmtcp-forum 
> > Archives. Using Dmtcp-forum: To post a message to all the list members ...
> >
> >
> >
> > > > Dmtcp-forum Info Page - 
> > > > SourceForge<https://lists.sourceforge.net/lists/listinfo/dmtcp-forum>
> > Dmtcp-forum Info Page - 
> > SourceForge<https://lists.sourceforge.net/lists/listinfo/dmtcp-forum>
> > lists.sourceforge.net
> > To see the collection of prior postings to the list, visit the Dmtcp-forum 
> > Archives. Using Dmtcp-forum: To post a message to all the list members ...
> >
> >
> >
> > > > lists.sourceforge.net
> > > > To see the collection of prior postings to the list, visit the 
> > > > Dmtcp-forum Archives. Using Dmtcp-forum: To post a message to all the 
> > > > list members ...
> > > >
> > > >
> > > >
> > > >
>
> > // Author: Wes Kendall
> > // Copyright 2013 www.mpitutorial.com<http://www.mpitutorial.com>
> > // This code is provided freely with the tutorials on mpitutorial.com. Feel
> > // free to modify it for your own use. Any distribution of the code must
> > // either provide a link to www.mpitutorial.com<http://www.mpitutorial.com> 
> > or keep this header intact.
> > //
> > // Program that computes the standard deviation of an array of elements in 
> > parallel using
> > // MPI_Reduce.
> > //
> > #include <stdio.h>
> > #include <stdlib.h>
> > #include <mpi.h>
> > #include <math.h>
> > #include <assert.h>
> > #include <unistd.h>
> >
> > unsigned int microseconds = 20000;
> >
> > // Creates an array of random numbers. Each number has a value from 0 - 1
> > float *create_rand_nums(int num_elements) {
> >   float *rand_nums = (float *)malloc(sizeof(float) * num_elements);
> >   assert(rand_nums != NULL);
> >   int i;
> >   for (i = 0; i < num_elements; i++) {
> >     rand_nums[i] = (rand() / (float)RAND_MAX);
> >   }
> >   return rand_nums;
> > }
> >
> > int main(int argc, char** argv) {
> >   if (argc != 3) {
> >     fprintf(stderr, "Usage: avg num_elements_per_proc repeat_times\n");
> >     exit(1);
> >   }
> >
> >   int num_elements_per_proc = atoi(argv[1]);
> >   int num_repeat = atoi(argv[2]);
> >   int repeat_id=0;
> >   MPI_Init(NULL, NULL);
> >
> >   int world_rank;
> >   MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
> >   int world_size;
> >   MPI_Comm_size(MPI_COMM_WORLD, &world_size);
> >
> >   // Create a random array of elements on all processes.
> >   srand(time(NULL)*world_rank); // Seed the random number generator of 
> > processes uniquely
> >   float *rand_nums = NULL;
> >   rand_nums = create_rand_nums(num_elements_per_proc);
> >
> >   while (repeat_id < num_repeat) {
> >     usleep(microseconds);
> >
> >     if (world_rank == 0)
> >       printf("\repeat-%d ",repeat_id);
> >
> >     // Sum the numbers locally
> >     float local_sum = 0;
> >     int i;
> >     for (i = 0; i < num_elements_per_proc; i++) {
> >       local_sum += rand_nums[i];
> >     }
> >
> >     // Reduce all of the local sums into the global sum in order to
> >     // calculate the mean
> >     float global_sum;
> >     MPI_Allreduce(&local_sum, &global_sum, 1, MPI_FLOAT, MPI_SUM,
> >                 MPI_COMM_WORLD);
> >     float mean = global_sum / (num_elements_per_proc * world_size);
> >
> >     // Compute the local sum of the squared differences from the mean
> >     float local_sq_diff = 0;
> >     for (i = 0; i < num_elements_per_proc; i++) {
> >       local_sq_diff += (rand_nums[i] - mean) * (rand_nums[i] - mean);
> >     }
> >
> >     // Reduce the global sum of the squared differences to the root process
> >     // and print off the answer
> >     float global_sq_diff;
> >     MPI_Reduce(&local_sq_diff, &global_sq_diff, 1, MPI_FLOAT, MPI_SUM, 0,
> >              MPI_COMM_WORLD);
> >
> >     // The standard deviation is the square root of the mean of the squared
> >     // differences.
> >     if (world_rank == 0) {
> >       float stddev = sqrt(global_sq_diff /
> >                         (num_elements_per_proc * world_size));
> >       printf("Mean - %f, Standard deviation = %f\n", mean, stddev);
> >     }
> >     repeat_id++;
> >
> >   }
> >
> >   // Clean up
> >   free(rand_nums);
> >
> >   MPI_Barrier(MPI_COMM_WORLD);
> >   MPI_Finalize();
> > }
> >
>
>
> ------------------------------------------------------------------------------
> The Command Line: Reinvented for Modern Developers
> Did the resurgence of CLI tooling catch you by surprise?
> Reconnect with the command line and become more productive.
> Learn the new .NET and ASP.NET CLI. Get your free copy!
> http://sdm.link/telerik
> _______________________________________________
> Dmtcp-forum mailing list
> Dmtcp-forum@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
------------------------------------------------------------------------------
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to