Dear Rohan,

Thanks very much for the patch, it fixed the error raised when restarting my 
MPICH program over the CentOS machines.


My OpenMPI-ULFM programs now raise a different error upon restart:

size = 1
[40000] WARNING at socketconnection.cpp:540 in postRestart; 
REASON='JWARNING(_real_bind(_fds[0], (sockaddr*) &_bindAddr,_bindAddrlen) == 0) 
failed'
     (strerror((*__errno_location ()))) = Address already in use
     id() = 216034594ce6504-40000-58043957(100860)
Message: Bind failed.
[41000] ERROR at connection.cpp:79 in restoreOptions; 
REASON='JASSERT(fcntl(_fds[0], F_SETFL, (int)_fcntlFlags) == 0) failed'
     _fds[0] = 13
     _fcntlFlags = 32770
     (strerror((*__errno_location ()))) = Bad file descriptor
dummy.ulfm (41000): Terminating...
[40000] ERROR at connectionidentifier.h:96 in assertValid; 
REASON='JASSERT(strcmp(sign, HANDSHAKE_SIGNATURE_MSG) == 0) failed'
     sign =
Message: read invalid message, signature mismatch. (External socket?)
orterun (40000): Terminating...

[43000] ERROR at connection.cpp:79 in restoreOptions; 
REASON='JASSERT(fcntl(_fds[0], F_SETFL, (int)_fcntlFlags) == 0) failed'
     _fds[0] = 13
     _fcntlFlags = 32770
     (strerror((*__errno_location ()))) = Bad file descriptor
dummy.ulfm (43000): Terminating...
[42000] ERROR at connection.cpp:79 in restoreOptions; 
REASON='JASSERT(fcntl(_fds[0], F_SETFL, (int)_fcntlFlags) == 0) failed'
     _fds[0] = 13
     _fcntlFlags = 32770
     (strerror((*__errno_location ()))) = Bad file descriptor
dummy.ulfm (42000): Terminating...

Thanks Rohan, I really appreciate your support.


Best Regards,

Sara

Sara S. Hamouda
PhD Candidate (Computer Systems Group)
College of Engineering and Computer Science
The Australian National University
________________________________
From: Rohan Garg <rohg...@ccs.neu.edu>
Sent: Saturday, October 15, 2016 4:25:51 AM
To: Sara Salem Hamouda
Cc: dmtcp-forum@lists.sourceforge.net
Subject: Re: [Dmtcp-forum] DMTCP MPI restart error on a single node

Hi Sara,

Could you please re-try after applying the following patch to
the DMTCP source?

diff --git a/src/util_misc.cpp b/src/util_misc.cpp
index f5bc84a..86650cf 100644
--- a/src/util_misc.cpp
+++ b/src/util_misc.cpp
@@ -633,6 +633,7 @@ bool Util::isNscdArea(const ProcMapsArea& area)
   if (strStartsWith(area.name, "/run/nscd") || // OpenSUSE (newer)
       strStartsWith(area.name, "/var/run/nscd") || // OpenSUSE (older)
       strStartsWith(area.name, "/var/cache/nscd") || // Debian/Ubuntu
+      strStartsWith(area.name, "/ram/var/run/nscd") || // CentOS-6.8
       strStartsWith(area.name, "/var/db/nscd")) { // RedHat/Fedora
     return true;
   }

Thanks,
Rohan

On Fri, Oct 14, 2016 at 07:02:04AM +0000, Sara Salem Hamouda wrote:
>
> Hi Rohan,
>
>     I am using the latest release on github, which is DMTCP-2.4.5.  Same 
> error received with mpirun.
>
>
> I tried another mpi implementation, called OpenMPI-ULFM 
> (https://bitbucket.org/icldistcomp/ulfm), which I use in my research, and I 
> got same error:
>
>
> [40000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap; 
> REASON='JASSERT(fd != -1 || errno == EEXIST) failed'
>      area.name = /ram/var/run/nscd/dbuYHRnM
> orterun (40000): Terminating...
> ssh659@raijin3:~/dmtcp/dir_ckpt$ [41000] ERROR at fileconnlist.cpp:318 in 
> recreateShmFileAndMap; REASON='JASSERT(fd != -1 || errno == EEXIST) failed'
>      area.name = /ram/var/run/nscd/dbCEJazi
> dummy.ulfm (41000): Terminating...
> [42000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap; 
> REASON='JASSERT(fd != -1 || errno == EEXIST) failed'
>      area.name = /ram/var/run/nscd/dbCEJazi
> dummy.ulfm (42000): Terminating...
> [43000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap; 
> REASON='JASSERT(fd != -1 || errno == EEXIST) failed'
>      area.name = /ram/var/run/nscd/dbCEJazi
> dummy.ulfm (43000): Terminating...
>
> The HANDSHAKE error appeared with MPICH, but not with OpenMPI-ULFM.
>
>
> Best Regards,
> Sara
>
> Sara S. Hamouda
> PhD Candidate (Computer Systems Group)
> College of Engineering and Computer Science
> The Australian National University
> ________________________________
> From: Rohan Garg <rohg...@ccs.neu.edu>
> Sent: Friday, October 14, 2016 7:11:12 AM
> To: Sara Salem Hamouda
> Cc: dmtcp-forum@lists.sourceforge.net
> Subject: Re: [Dmtcp-forum] DMTCP MPI restart error on a single node
>
> Hi Sara,
>
> What version of DMTCP were you using? DMTCP-3.0 has some known issues
> with mpich-3.2, as reported by a DMTCP user. I'd recommend trying with
> DMTCP-2.5.
>
> Also, could you try launching your MPI program with mpirun instead of
> mpiexec?
>
> Thanks,
> Rohan
>
> On Wed, Oct 12, 2016 at 11:30:41AM +0000, Sara Salem Hamouda wrote:
> > Dear DMTCP team,
> >
> >   Appreciate your support regarding the below issue.
> >
> >
> > I am using a single machine to learn DMTCP. The operating system is "CentOS 
> > release 6.8", and it uses a network file system. I run a simple MPI program 
> > (dummy.c), using mpich V3.2.
> >
> >
> > On terminal-1:
> >
> > dmtcp_coordinator
> >
> >
> > On terminal-2:
> >
> > dmtcp_launch mpiexec -n 3 ./dummy.mpich2 10 10000
> >
> >
> > While dummy is running in terminal-2, I move to terminal-1 and press 'c' , 
> > then 'q' to exit.
> >
> >
> > To restart, I run the generated dmtcp_restart_script.sh script, but I get 
> > the error below. Would you please advice on a possible fix for this issue?
> >
> >
> > (P.S. I tried the same steps on another machine (with Ubuntu 14.04 OS) that 
> > has a local file system, and the restart worked successfully. Is there 
> > specific configuration I should use with network file systems?)
> >
> >
> > size = 1
> > [43000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap; 
> > REASON='JASSERT(fd != -1 || errno == EEXIST) failed'
> >      area.name = /ram/var/run/nscd/dbbxzrxW
> > dummy.mpich2 (43000): Terminating...
> > [44000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap; 
> > REASON='JASSERT(fd != -1 || errno == EEXIST) failed'
> >      area.name = /ram/var/run/nscd/dbbxzrxW
> > dummy.mpich2 (44000): Terminating...
> > [42000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap; 
> > REASON='JASSERT(fd != -1 || errno == EEXIST) failed'
> >      area.name = /ram/var/run/nscd/dbbxzrxW
> > dummy.mpich2 (42000): Terminating...
> > [40000] ERROR at connectionidentifier.h:96 in assertValid; 
> > REASON='JASSERT(strcmp(sign, HANDSHAKE_SIGNATURE_MSG) == 0) failed'
> >      sign =
> > Message: read invalid message, signature mismatch. (External socket?)
> > mpiexec.hydra (40000): Terminating...
> > [41000] ERROR at connectionidentifier.h:96 in assertValid; 
> > REASON='JASSERT(strcmp(sign, HANDSHAKE_SIGNATURE_MSG) == 0) failed'
> >      sign =
> > Message: read invalid message, signature mismatch. (External socket?)
> > hydra_pmi_proxy (41000): Terminating...
> >
> >
> >
> > Best Regards,
> > Sara
> >
> > Sara S. Hamouda
> > PhD Candidate (Computer Systems Group)
> > College of Engineering and Computer Science
> > The Australian National University
>
> > ------------------------------------------------------------------------------
> > Check out the vibrant tech community on one of the world's most
> > engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>
> > _______________________________________________
> > Dmtcp-forum mailing list
> > Dmtcp-forum@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
> Dmtcp-forum Info Page - 
> SourceForge<https://lists.sourceforge.net/lists/listinfo/dmtcp-forum>
> lists.sourceforge.net
> To see the collection of prior postings to the list, visit the Dmtcp-forum 
> Archives. Using Dmtcp-forum: To post a message to all the list members ...
>
>
>
>
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to