Hi Sara,

Could you please re-try after applying the following patch to
the DMTCP source?

diff --git a/src/util_misc.cpp b/src/util_misc.cpp
index f5bc84a..86650cf 100644
--- a/src/util_misc.cpp
+++ b/src/util_misc.cpp
@@ -633,6 +633,7 @@ bool Util::isNscdArea(const ProcMapsArea& area)
   if (strStartsWith(area.name, "/run/nscd") || // OpenSUSE (newer)
       strStartsWith(area.name, "/var/run/nscd") || // OpenSUSE (older)
       strStartsWith(area.name, "/var/cache/nscd") || // Debian/Ubuntu
+      strStartsWith(area.name, "/ram/var/run/nscd") || // CentOS-6.8
       strStartsWith(area.name, "/var/db/nscd")) { // RedHat/Fedora
     return true;
   }

Thanks,
Rohan

On Fri, Oct 14, 2016 at 07:02:04AM +0000, Sara Salem Hamouda wrote:
> 
> Hi Rohan,
> 
>     I am using the latest release on github, which is DMTCP-2.4.5.  Same 
> error received with mpirun.
> 
> 
> I tried another mpi implementation, called OpenMPI-ULFM 
> (https://bitbucket.org/icldistcomp/ulfm), which I use in my research, and I 
> got same error:
> 
> 
> [40000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap; 
> REASON='JASSERT(fd != -1 || errno == EEXIST) failed'
>      area.name = /ram/var/run/nscd/dbuYHRnM
> orterun (40000): Terminating...
> ssh659@raijin3:~/dmtcp/dir_ckpt$ [41000] ERROR at fileconnlist.cpp:318 in 
> recreateShmFileAndMap; REASON='JASSERT(fd != -1 || errno == EEXIST) failed'
>      area.name = /ram/var/run/nscd/dbCEJazi
> dummy.ulfm (41000): Terminating...
> [42000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap; 
> REASON='JASSERT(fd != -1 || errno == EEXIST) failed'
>      area.name = /ram/var/run/nscd/dbCEJazi
> dummy.ulfm (42000): Terminating...
> [43000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap; 
> REASON='JASSERT(fd != -1 || errno == EEXIST) failed'
>      area.name = /ram/var/run/nscd/dbCEJazi
> dummy.ulfm (43000): Terminating...
> 
> The HANDSHAKE error appeared with MPICH, but not with OpenMPI-ULFM.
> 
> 
> Best Regards,
> Sara
> 
> Sara S. Hamouda
> PhD Candidate (Computer Systems Group)
> College of Engineering and Computer Science
> The Australian National University
> ________________________________
> From: Rohan Garg <rohg...@ccs.neu.edu>
> Sent: Friday, October 14, 2016 7:11:12 AM
> To: Sara Salem Hamouda
> Cc: dmtcp-forum@lists.sourceforge.net
> Subject: Re: [Dmtcp-forum] DMTCP MPI restart error on a single node
> 
> Hi Sara,
> 
> What version of DMTCP were you using? DMTCP-3.0 has some known issues
> with mpich-3.2, as reported by a DMTCP user. I'd recommend trying with
> DMTCP-2.5.
> 
> Also, could you try launching your MPI program with mpirun instead of
> mpiexec?
> 
> Thanks,
> Rohan
> 
> On Wed, Oct 12, 2016 at 11:30:41AM +0000, Sara Salem Hamouda wrote:
> > Dear DMTCP team,
> >
> >   Appreciate your support regarding the below issue.
> >
> >
> > I am using a single machine to learn DMTCP. The operating system is "CentOS 
> > release 6.8", and it uses a network file system. I run a simple MPI program 
> > (dummy.c), using mpich V3.2.
> >
> >
> > On terminal-1:
> >
> > dmtcp_coordinator
> >
> >
> > On terminal-2:
> >
> > dmtcp_launch mpiexec -n 3 ./dummy.mpich2 10 10000
> >
> >
> > While dummy is running in terminal-2, I move to terminal-1 and press 'c' , 
> > then 'q' to exit.
> >
> >
> > To restart, I run the generated dmtcp_restart_script.sh script, but I get 
> > the error below. Would you please advice on a possible fix for this issue?
> >
> >
> > (P.S. I tried the same steps on another machine (with Ubuntu 14.04 OS) that 
> > has a local file system, and the restart worked successfully. Is there 
> > specific configuration I should use with network file systems?)
> >
> >
> > size = 1
> > [43000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap; 
> > REASON='JASSERT(fd != -1 || errno == EEXIST) failed'
> >      area.name = /ram/var/run/nscd/dbbxzrxW
> > dummy.mpich2 (43000): Terminating...
> > [44000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap; 
> > REASON='JASSERT(fd != -1 || errno == EEXIST) failed'
> >      area.name = /ram/var/run/nscd/dbbxzrxW
> > dummy.mpich2 (44000): Terminating...
> > [42000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap; 
> > REASON='JASSERT(fd != -1 || errno == EEXIST) failed'
> >      area.name = /ram/var/run/nscd/dbbxzrxW
> > dummy.mpich2 (42000): Terminating...
> > [40000] ERROR at connectionidentifier.h:96 in assertValid; 
> > REASON='JASSERT(strcmp(sign, HANDSHAKE_SIGNATURE_MSG) == 0) failed'
> >      sign =
> > Message: read invalid message, signature mismatch. (External socket?)
> > mpiexec.hydra (40000): Terminating...
> > [41000] ERROR at connectionidentifier.h:96 in assertValid; 
> > REASON='JASSERT(strcmp(sign, HANDSHAKE_SIGNATURE_MSG) == 0) failed'
> >      sign =
> > Message: read invalid message, signature mismatch. (External socket?)
> > hydra_pmi_proxy (41000): Terminating...
> >
> >
> >
> > Best Regards,
> > Sara
> >
> > Sara S. Hamouda
> > PhD Candidate (Computer Systems Group)
> > College of Engineering and Computer Science
> > The Australian National University
> 
> > ------------------------------------------------------------------------------
> > Check out the vibrant tech community on one of the world's most
> > engaging tech sites, SlashDot.org! http://sdm.link/slashdot
> 
> > _______________________________________________
> > Dmtcp-forum mailing list
> > Dmtcp-forum@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
> Dmtcp-forum Info Page - 
> SourceForge<https://lists.sourceforge.net/lists/listinfo/dmtcp-forum>
> lists.sourceforge.net
> To see the collection of prior postings to the list, visit the Dmtcp-forum 
> Archives. Using Dmtcp-forum: To post a message to all the list members ...
> 
> 
> 
> 

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to