Just to clarify: you are able to now checkpoint and restart MPICH programs
after the patch?

For ULFM, could you send us the steps to follow to reproduce the problem
locally?


On Mon, Oct 17, 2016 at 02:56:26AM +0000, Sara Salem Hamouda wrote:
> Dear Rohan,
> 
> 
> Thanks very much for the patch, it fixed the error raised when restarting my 
> MPICH program over the CentOS machines.
> 
> 
> My OpenMPI-ULFM programs now raise a different error upon restart:
> 
> size = 1
> [40000] WARNING at socketconnection.cpp:540 in postRestart; 
> REASON='JWARNING(_real_bind(_fds[0], (sockaddr*) &_bindAddr,_bindAddrlen) == 
> 0) failed'
>      (strerror((*__errno_location ()))) = Address already in use
>      id() = 216034594ce6504-40000-58043957(100860)
> Message: Bind failed.
> [41000] ERROR at connection.cpp:79 in restoreOptions; 
> REASON='JASSERT(fcntl(_fds[0], F_SETFL, (int)_fcntlFlags) == 0) failed'
>      _fds[0] = 13
>      _fcntlFlags = 32770
>      (strerror((*__errno_location ()))) = Bad file descriptor
> dummy.ulfm (41000): Terminating...
> [40000] ERROR at connectionidentifier.h:96 in assertValid; 
> REASON='JASSERT(strcmp(sign, HANDSHAKE_SIGNATURE_MSG) == 0) failed'
>      sign =
> Message: read invalid message, signature mismatch. (External socket?)
> orterun (40000): Terminating...
> 
> [43000] ERROR at connection.cpp:79 in restoreOptions; 
> REASON='JASSERT(fcntl(_fds[0], F_SETFL, (int)_fcntlFlags) == 0) failed'
>      _fds[0] = 13
>      _fcntlFlags = 32770
>      (strerror((*__errno_location ()))) = Bad file descriptor
> dummy.ulfm (43000): Terminating...
> [42000] ERROR at connection.cpp:79 in restoreOptions; 
> REASON='JASSERT(fcntl(_fds[0], F_SETFL, (int)_fcntlFlags) == 0) failed'
>      _fds[0] = 13
>      _fcntlFlags = 32770
>      (strerror((*__errno_location ()))) = Bad file descriptor
> dummy.ulfm (42000): Terminating...
> 
> Thanks Rohan, I really appreciate your support.
> 
> 
> Best Regards,
> 
> Sara
> 
> Sara S. Hamouda
> PhD Candidate (Computer Systems Group)
> College of Engineering and Computer Science
> The Australian National University
> ________________________________
> From: Rohan Garg <rohg...@ccs.neu.edu>
> Sent: Saturday, October 15, 2016 4:25:51 AM
> To: Sara Salem Hamouda
> Cc: dmtcp-forum@lists.sourceforge.net
> Subject: Re: [Dmtcp-forum] DMTCP MPI restart error on a single node
> 
> Hi Sara,
> 
> Could you please re-try after applying the following patch to
> the DMTCP source?
> 
> diff --git a/src/util_misc.cpp b/src/util_misc.cpp
> index f5bc84a..86650cf 100644
> --- a/src/util_misc.cpp
> +++ b/src/util_misc.cpp
> @@ -633,6 +633,7 @@ bool Util::isNscdArea(const ProcMapsArea& area)
>    if (strStartsWith(area.name, "/run/nscd") || // OpenSUSE (newer)
>        strStartsWith(area.name, "/var/run/nscd") || // OpenSUSE (older)
>        strStartsWith(area.name, "/var/cache/nscd") || // Debian/Ubuntu
> +      strStartsWith(area.name, "/ram/var/run/nscd") || // CentOS-6.8
>        strStartsWith(area.name, "/var/db/nscd")) { // RedHat/Fedora
>      return true;
>    }
> 
> Thanks,
> Rohan
> 
> On Fri, Oct 14, 2016 at 07:02:04AM +0000, Sara Salem Hamouda wrote:
> >
> > Hi Rohan,
> >
> >     I am using the latest release on github, which is DMTCP-2.4.5.  Same 
> > error received with mpirun.
> >
> >
> > I tried another mpi implementation, called OpenMPI-ULFM 
> > (https://bitbucket.org/icldistcomp/ulfm), which I use in my research, and I 
> > got same error:
> >
> >
> > [40000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap; 
> > REASON='JASSERT(fd != -1 || errno == EEXIST) failed'
> >      area.name = /ram/var/run/nscd/dbuYHRnM
> > orterun (40000): Terminating...
> > ssh659@raijin3:~/dmtcp/dir_ckpt$ [41000] ERROR at fileconnlist.cpp:318 in 
> > recreateShmFileAndMap; REASON='JASSERT(fd != -1 || errno == EEXIST) failed'
> >      area.name = /ram/var/run/nscd/dbCEJazi
> > dummy.ulfm (41000): Terminating...
> > [42000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap; 
> > REASON='JASSERT(fd != -1 || errno == EEXIST) failed'
> >      area.name = /ram/var/run/nscd/dbCEJazi
> > dummy.ulfm (42000): Terminating...
> > [43000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap; 
> > REASON='JASSERT(fd != -1 || errno == EEXIST) failed'
> >      area.name = /ram/var/run/nscd/dbCEJazi
> > dummy.ulfm (43000): Terminating...
> >
> > The HANDSHAKE error appeared with MPICH, but not with OpenMPI-ULFM.
> >
> >
> > Best Regards,
> > Sara
> >
> > Sara S. Hamouda
> > PhD Candidate (Computer Systems Group)
> > College of Engineering and Computer Science
> > The Australian National University
> > ________________________________
> > From: Rohan Garg <rohg...@ccs.neu.edu>
> > Sent: Friday, October 14, 2016 7:11:12 AM
> > To: Sara Salem Hamouda
> > Cc: dmtcp-forum@lists.sourceforge.net
> > Subject: Re: [Dmtcp-forum] DMTCP MPI restart error on a single node
> >
> > Hi Sara,
> >
> > What version of DMTCP were you using? DMTCP-3.0 has some known issues
> > with mpich-3.2, as reported by a DMTCP user. I'd recommend trying with
> > DMTCP-2.5.
> >
> > Also, could you try launching your MPI program with mpirun instead of
> > mpiexec?
> >
> > Thanks,
> > Rohan
> >
> > On Wed, Oct 12, 2016 at 11:30:41AM +0000, Sara Salem Hamouda wrote:
> > > Dear DMTCP team,
> > >
> > >   Appreciate your support regarding the below issue.
> > >
> > >
> > > I am using a single machine to learn DMTCP. The operating system is 
> > > "CentOS release 6.8", and it uses a network file system. I run a simple 
> > > MPI program (dummy.c), using mpich V3.2.
> > >
> > >
> > > On terminal-1:
> > >
> > > dmtcp_coordinator
> > >
> > >
> > > On terminal-2:
> > >
> > > dmtcp_launch mpiexec -n 3 ./dummy.mpich2 10 10000
> > >
> > >
> > > While dummy is running in terminal-2, I move to terminal-1 and press 'c' 
> > > , then 'q' to exit.
> > >
> > >
> > > To restart, I run the generated dmtcp_restart_script.sh script, but I get 
> > > the error below. Would you please advice on a possible fix for this issue?
> > >
> > >
> > > (P.S. I tried the same steps on another machine (with Ubuntu 14.04 OS) 
> > > that has a local file system, and the restart worked successfully. Is 
> > > there specific configuration I should use with network file systems?)
> > >
> > >
> > > size = 1
> > > [43000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap; 
> > > REASON='JASSERT(fd != -1 || errno == EEXIST) failed'
> > >      area.name = /ram/var/run/nscd/dbbxzrxW
> > > dummy.mpich2 (43000): Terminating...
> > > [44000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap; 
> > > REASON='JASSERT(fd != -1 || errno == EEXIST) failed'
> > >      area.name = /ram/var/run/nscd/dbbxzrxW
> > > dummy.mpich2 (44000): Terminating...
> > > [42000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap; 
> > > REASON='JASSERT(fd != -1 || errno == EEXIST) failed'
> > >      area.name = /ram/var/run/nscd/dbbxzrxW
> > > dummy.mpich2 (42000): Terminating...
> > > [40000] ERROR at connectionidentifier.h:96 in assertValid; 
> > > REASON='JASSERT(strcmp(sign, HANDSHAKE_SIGNATURE_MSG) == 0) failed'
> > >      sign =
> > > Message: read invalid message, signature mismatch. (External socket?)
> > > mpiexec.hydra (40000): Terminating...
> > > [41000] ERROR at connectionidentifier.h:96 in assertValid; 
> > > REASON='JASSERT(strcmp(sign, HANDSHAKE_SIGNATURE_MSG) == 0) failed'
> > >      sign =
> > > Message: read invalid message, signature mismatch. (External socket?)
> > > hydra_pmi_proxy (41000): Terminating...
> > >
> > >
> > >
> > > Best Regards,
> > > Sara
> > >
> > > Sara S. Hamouda
> > > PhD Candidate (Computer Systems Group)
> > > College of Engineering and Computer Science
> > > The Australian National University
> >
> > > ------------------------------------------------------------------------------
> > > Check out the vibrant tech community on one of the world's most
> > > engaging tech sites, SlashDot.org! http://sdm.link/slashdot
> >
> > > _______________________________________________
> > > Dmtcp-forum mailing list
> > > Dmtcp-forum@lists.sourceforge.net
> > > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
> > Dmtcp-forum Info Page - 
> > SourceForge<https://lists.sourceforge.net/lists/listinfo/dmtcp-forum>
> > lists.sourceforge.net
> > To see the collection of prior postings to the list, visit the Dmtcp-forum 
> > Archives. Using Dmtcp-forum: To post a message to all the list members ...
> >
> >
> >
> >

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to