Dear Rohan,

   My sincere apologies for my late response.


Regarding your first question: yes, your patch allowed me to checkpoint and 
restart MPICH program over the CentOS machines.


ULFM fails in restart on the same machines and throws the errors I sent before. 
The following are steps you can follow to reproduce the problem:


1. Install MPI-ULFM dependencies (libtool, autoconf, and flex).

On a debian machine you can run this command:

sudo apt-get install libtool autoconf flex


2. Create a folder to install MPI-ULFM, say:

mkdir /home/rohan/packages/ulfm


3. Download MPI-ULFM:

hg clone https://bitbucket.org/icldistcomp/ulfm


4. A folder called ulfm will be download from the previous step, change 
director to that folder

cd ulfm


5. run the following commands:

./autogen.pl
./configure --prefix=/home/rohan/packages/ulfm \
       --enable-mpi-ext=ftmpi --with-ft=mpi \
       --disable-io-romio --enable-contrib-no-build=vt \
       --with-platform=optimized \
       CC=gcc CXX=g++ F77=gfortran FC=gfortran
make
make install

6. update the following environment variables:

export MPI=/home/rohan/packages/ulfm
export PATH=$MPI/bin:$PATH
export LD_LIBRARY_PATH=$MPI/lib:$LD_LIBRARY_PATH

7. Compile and run any program using MPI-ULFM. I attached dummy.c which I use 
for testing. The program repeats an all_reduce operation for a number of times 
given in the second parameter. The first parameter is the array size.


On terminal-1:
dmtcp_coordinator

On terminal-2:
mpicc dummy.c -o dummy.ulfm
dmtcp_launch mpirun -n 3 ./dummy.ulfm 10 10000

8. Take a checkpoint, terminate, then restart
On terminal-1:
press 'c'
press 'q'
./dmtcp_restart_script.sh



Thanks Rohan, and sorry again for my late response.


Best Regards,
Sara
________________________________
From: Rohan Garg <rohg...@ccs.neu.edu>
Sent: Wednesday, October 19, 2016 6:48:30 AM
To: Sara Salem Hamouda
Cc: dmtcp-forum@lists.sourceforge.net
Subject: Re: [Dmtcp-forum] DMTCP MPI restart error on a single node

Just to clarify: you are able to now checkpoint and restart MPICH programs
after the patch?

For ULFM, could you send us the steps to follow to reproduce the problem
locally?


On Mon, Oct 17, 2016 at 02:56:26AM +0000, Sara Salem Hamouda wrote:
> Dear Rohan,
>
>
> Thanks very much for the patch, it fixed the error raised when restarting my 
> MPICH program over the CentOS machines.
>
>
> My OpenMPI-ULFM programs now raise a different error upon restart:
>
> size = 1
> [40000] WARNING at socketconnection.cpp:540 in postRestart; 
> REASON='JWARNING(_real_bind(_fds[0], (sockaddr*) &_bindAddr,_bindAddrlen) == 
> 0) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 216034594ce6504-40000-58043957(100860)
> Message: Bind failed.
> [41000] ERROR at connection.cpp:79 in restoreOptions; 
> REASON='JASSERT(fcntl(_fds[0], F_SETFL, (int)_fcntlFlags) == 0) failed'
>      _fds[0] = 13
>      _fcntlFlags = 32770
>      (strerror((*__errno_location ()))) = Bad file descriptor
> dummy.ulfm (41000): Terminating...
> [40000] ERROR at connectionidentifier.h:96 in assertValid; 
> REASON='JASSERT(strcmp(sign, HANDSHAKE_SIGNATURE_MSG) == 0) failed'
>      sign =
> Message: read invalid message, signature mismatch. (External socket?)
> orterun (40000): Terminating...
>
> [43000] ERROR at connection.cpp:79 in restoreOptions; 
> REASON='JASSERT(fcntl(_fds[0], F_SETFL, (int)_fcntlFlags) == 0) failed'
>      _fds[0] = 13
>      _fcntlFlags = 32770
>      (strerror((*__errno_location ()))) = Bad file descriptor
> dummy.ulfm (43000): Terminating...
> [42000] ERROR at connection.cpp:79 in restoreOptions; 
> REASON='JASSERT(fcntl(_fds[0], F_SETFL, (int)_fcntlFlags) == 0) failed'
>      _fds[0] = 13
>      _fcntlFlags = 32770
>      (strerror((*__errno_location ()))) = Bad file descriptor
> dummy.ulfm (42000): Terminating...
>
> Thanks Rohan, I really appreciate your support.
>
>
> Best Regards,
>
> Sara
>
> Sara S. Hamouda
> PhD Candidate (Computer Systems Group)
> College of Engineering and Computer Science
> The Australian National University
> ________________________________
> From: Rohan Garg <rohg...@ccs.neu.edu>
> Sent: Saturday, October 15, 2016 4:25:51 AM
> To: Sara Salem Hamouda
> Cc: dmtcp-forum@lists.sourceforge.net
> Subject: Re: [Dmtcp-forum] DMTCP MPI restart error on a single node
>
> Hi Sara,
>
> Could you please re-try after applying the following patch to
> the DMTCP source?
>
> diff --git a/src/util_misc.cpp b/src/util_misc.cpp
> index f5bc84a..86650cf 100644
> --- a/src/util_misc.cpp
> +++ b/src/util_misc.cpp
> @@ -633,6 +633,7 @@ bool Util::isNscdArea(const ProcMapsArea& area)
>    if (strStartsWith(area.name, "/run/nscd") || // OpenSUSE (newer)
>        strStartsWith(area.name, "/var/run/nscd") || // OpenSUSE (older)
>        strStartsWith(area.name, "/var/cache/nscd") || // Debian/Ubuntu
> +      strStartsWith(area.name, "/ram/var/run/nscd") || // CentOS-6.8
>        strStartsWith(area.name, "/var/db/nscd")) { // RedHat/Fedora
>      return true;
>    }
>
> Thanks,
> Rohan
>
> On Fri, Oct 14, 2016 at 07:02:04AM +0000, Sara Salem Hamouda wrote:
> >
> > Hi Rohan,
> >
> >     I am using the latest release on github, which is DMTCP-2.4.5.  Same 
> > error received with mpirun.
> >
> >
> > I tried another mpi implementation, called OpenMPI-ULFM 
> > (https://bitbucket.org/icldistcomp/ulfm), which I use in my research, and I 
> > got same error:
[https://d301sr5gafysq2.cloudfront.net/564c96d1f0f9/img/repo-avatars/c.svg]<https://bitbucket.org/icldistcomp/ulfm>

icldistcomp / ulfm<https://bitbucket.org/icldistcomp/ulfm>
bitbucket.org
Open MPI implementation of the User Level Fault Mitigation (ULFM) proposal. 
More info @ http://fault-tolerance.org.



> >
> >
> > [40000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap; 
> > REASON='JASSERT(fd != -1 || errno == EEXIST) failed'
> >      area.name = /ram/var/run/nscd/dbuYHRnM
> > orterun (40000): Terminating...
> > ssh659@raijin3:~/dmtcp/dir_ckpt$ [41000] ERROR at fileconnlist.cpp:318 in 
> > recreateShmFileAndMap; REASON='JASSERT(fd != -1 || errno == EEXIST) failed'
> >      area.name = /ram/var/run/nscd/dbCEJazi
> > dummy.ulfm (41000): Terminating...
> > [42000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap; 
> > REASON='JASSERT(fd != -1 || errno == EEXIST) failed'
> >      area.name = /ram/var/run/nscd/dbCEJazi
> > dummy.ulfm (42000): Terminating...
> > [43000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap; 
> > REASON='JASSERT(fd != -1 || errno == EEXIST) failed'
> >      area.name = /ram/var/run/nscd/dbCEJazi
> > dummy.ulfm (43000): Terminating...
> >
> > The HANDSHAKE error appeared with MPICH, but not with OpenMPI-ULFM.
> >
> >
> > Best Regards,
> > Sara
> >
> > Sara S. Hamouda
> > PhD Candidate (Computer Systems Group)
> > College of Engineering and Computer Science
> > The Australian National University
> > ________________________________
> > From: Rohan Garg <rohg...@ccs.neu.edu>
> > Sent: Friday, October 14, 2016 7:11:12 AM
> > To: Sara Salem Hamouda
> > Cc: dmtcp-forum@lists.sourceforge.net
> > Subject: Re: [Dmtcp-forum] DMTCP MPI restart error on a single node
> >
> > Hi Sara,
> >
> > What version of DMTCP were you using? DMTCP-3.0 has some known issues
> > with mpich-3.2, as reported by a DMTCP user. I'd recommend trying with
> > DMTCP-2.5.
> >
> > Also, could you try launching your MPI program with mpirun instead of
> > mpiexec?
> >
> > Thanks,
> > Rohan
> >
> > On Wed, Oct 12, 2016 at 11:30:41AM +0000, Sara Salem Hamouda wrote:
> > > Dear DMTCP team,
> > >
> > >   Appreciate your support regarding the below issue.
> > >
> > >
> > > I am using a single machine to learn DMTCP. The operating system is 
> > > "CentOS release 6.8", and it uses a network file system. I run a simple 
> > > MPI program (dummy.c), using mpich V3.2.
> > >
> > >
> > > On terminal-1:
> > >
> > > dmtcp_coordinator
> > >
> > >
> > > On terminal-2:
> > >
> > > dmtcp_launch mpiexec -n 3 ./dummy.mpich2 10 10000
> > >
> > >
> > > While dummy is running in terminal-2, I move to terminal-1 and press 'c' 
> > > , then 'q' to exit.
> > >
> > >
> > > To restart, I run the generated dmtcp_restart_script.sh script, but I get 
> > > the error below. Would you please advice on a possible fix for this issue?
> > >
> > >
> > > (P.S. I tried the same steps on another machine (with Ubuntu 14.04 OS) 
> > > that has a local file system, and the restart worked successfully. Is 
> > > there specific configuration I should use with network file systems?)
> > >
> > >
> > > size = 1
> > > [43000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap; 
> > > REASON='JASSERT(fd != -1 || errno == EEXIST) failed'
> > >      area.name = /ram/var/run/nscd/dbbxzrxW
> > > dummy.mpich2 (43000): Terminating...
> > > [44000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap; 
> > > REASON='JASSERT(fd != -1 || errno == EEXIST) failed'
> > >      area.name = /ram/var/run/nscd/dbbxzrxW
> > > dummy.mpich2 (44000): Terminating...
> > > [42000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap; 
> > > REASON='JASSERT(fd != -1 || errno == EEXIST) failed'
> > >      area.name = /ram/var/run/nscd/dbbxzrxW
> > > dummy.mpich2 (42000): Terminating...
> > > [40000] ERROR at connectionidentifier.h:96 in assertValid; 
> > > REASON='JASSERT(strcmp(sign, HANDSHAKE_SIGNATURE_MSG) == 0) failed'
> > >      sign =
> > > Message: read invalid message, signature mismatch. (External socket?)
> > > mpiexec.hydra (40000): Terminating...
> > > [41000] ERROR at connectionidentifier.h:96 in assertValid; 
> > > REASON='JASSERT(strcmp(sign, HANDSHAKE_SIGNATURE_MSG) == 0) failed'
> > >      sign =
> > > Message: read invalid message, signature mismatch. (External socket?)
> > > hydra_pmi_proxy (41000): Terminating...
> > >
> > >
> > >
> > > Best Regards,
> > > Sara
> > >
> > > Sara S. Hamouda
> > > PhD Candidate (Computer Systems Group)
> > > College of Engineering and Computer Science
> > > The Australian National University
> >
> > > ------------------------------------------------------------------------------
> > > Check out the vibrant tech community on one of the world's most
> > > engaging tech sites, SlashDot.org! http://sdm.link/slashdot
Slashdot: News for nerds, stuff that matters<http://sdm.link/slashdot>
sdm.link
Slashdot: News for nerds, stuff that matters. Timely news source for technology 
related news with a heavy slant towards Linux and Open Source issues.



> >
> > > _______________________________________________
> > > Dmtcp-forum mailing list
> > > Dmtcp-forum@lists.sourceforge.net
> > > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
Dmtcp-forum Info Page - 
SourceForge<https://lists.sourceforge.net/lists/listinfo/dmtcp-forum>
lists.sourceforge.net
To see the collection of prior postings to the list, visit the Dmtcp-forum 
Archives. Using Dmtcp-forum: To post a message to all the list members ...



> > Dmtcp-forum Info Page - 
> > SourceForge<https://lists.sourceforge.net/lists/listinfo/dmtcp-forum>
Dmtcp-forum Info Page - 
SourceForge<https://lists.sourceforge.net/lists/listinfo/dmtcp-forum>
lists.sourceforge.net
To see the collection of prior postings to the list, visit the Dmtcp-forum 
Archives. Using Dmtcp-forum: To post a message to all the list members ...



> > lists.sourceforge.net
> > To see the collection of prior postings to the list, visit the Dmtcp-forum 
> > Archives. Using Dmtcp-forum: To post a message to all the list members ...
> >
> >
> >
> >
// Author: Wes Kendall
// Copyright 2013 www.mpitutorial.com
// This code is provided freely with the tutorials on mpitutorial.com. Feel
// free to modify it for your own use. Any distribution of the code must
// either provide a link to www.mpitutorial.com or keep this header intact.
//
// Program that computes the standard deviation of an array of elements in parallel using
// MPI_Reduce.
//
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#include <math.h>
#include <assert.h>
#include <unistd.h>

unsigned int microseconds = 20000;

// Creates an array of random numbers. Each number has a value from 0 - 1
float *create_rand_nums(int num_elements) {
  float *rand_nums = (float *)malloc(sizeof(float) * num_elements);
  assert(rand_nums != NULL);
  int i;
  for (i = 0; i < num_elements; i++) {
    rand_nums[i] = (rand() / (float)RAND_MAX);
  }
  return rand_nums;
}

int main(int argc, char** argv) {
  if (argc != 3) {
    fprintf(stderr, "Usage: avg num_elements_per_proc repeat_times\n");
    exit(1);
  }

  int num_elements_per_proc = atoi(argv[1]);
  int num_repeat = atoi(argv[2]);
  int repeat_id=0;
  MPI_Init(NULL, NULL);

  int world_rank;
  MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
  int world_size;
  MPI_Comm_size(MPI_COMM_WORLD, &world_size);

  // Create a random array of elements on all processes.
  srand(time(NULL)*world_rank); // Seed the random number generator of processes uniquely
  float *rand_nums = NULL;
  rand_nums = create_rand_nums(num_elements_per_proc);

  while (repeat_id < num_repeat) {
    usleep(microseconds);

    if (world_rank == 0)
      printf("\repeat-%d ",repeat_id);

    // Sum the numbers locally
    float local_sum = 0;
    int i;
    for (i = 0; i < num_elements_per_proc; i++) {
      local_sum += rand_nums[i];
    }

    // Reduce all of the local sums into the global sum in order to
    // calculate the mean
    float global_sum;
    MPI_Allreduce(&local_sum, &global_sum, 1, MPI_FLOAT, MPI_SUM,
                MPI_COMM_WORLD);
    float mean = global_sum / (num_elements_per_proc * world_size);

    // Compute the local sum of the squared differences from the mean
    float local_sq_diff = 0;
    for (i = 0; i < num_elements_per_proc; i++) {
      local_sq_diff += (rand_nums[i] - mean) * (rand_nums[i] - mean);
    }

    // Reduce the global sum of the squared differences to the root process
    // and print off the answer
    float global_sq_diff;
    MPI_Reduce(&local_sq_diff, &global_sq_diff, 1, MPI_FLOAT, MPI_SUM, 0,
             MPI_COMM_WORLD);

    // The standard deviation is the square root of the mean of the squared
    // differences.
    if (world_rank == 0) {
      float stddev = sqrt(global_sq_diff /
                        (num_elements_per_proc * world_size));
      printf("Mean - %f, Standard deviation = %f\n", mean, stddev);
    }
    repeat_id++;

  }

  // Clean up
  free(rand_nums);

  MPI_Barrier(MPI_COMM_WORLD);
  MPI_Finalize();
}

------------------------------------------------------------------------------
The Command Line: Reinvented for Modern Developers
Did the resurgence of CLI tooling catch you by surprise?
Reconnect with the command line and become more productive. 
Learn the new .NET and ASP.NET CLI. Get your free copy!
http://sdm.link/telerik
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to