Hi Rohan,

    I am using the latest release on github, which is DMTCP-2.4.5.  Same error 
received with mpirun.


I tried another mpi implementation, called OpenMPI-ULFM 
(https://bitbucket.org/icldistcomp/ulfm), which I use in my research, and I got 
same error:


[40000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap; 
REASON='JASSERT(fd != -1 || errno == EEXIST) failed'
     area.name = /ram/var/run/nscd/dbuYHRnM
orterun (40000): Terminating...
ssh659@raijin3:~/dmtcp/dir_ckpt$ [41000] ERROR at fileconnlist.cpp:318 in 
recreateShmFileAndMap; REASON='JASSERT(fd != -1 || errno == EEXIST) failed'
     area.name = /ram/var/run/nscd/dbCEJazi
dummy.ulfm (41000): Terminating...
[42000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap; 
REASON='JASSERT(fd != -1 || errno == EEXIST) failed'
     area.name = /ram/var/run/nscd/dbCEJazi
dummy.ulfm (42000): Terminating...
[43000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap; 
REASON='JASSERT(fd != -1 || errno == EEXIST) failed'
     area.name = /ram/var/run/nscd/dbCEJazi
dummy.ulfm (43000): Terminating...

The HANDSHAKE error appeared with MPICH, but not with OpenMPI-ULFM.


Best Regards,
Sara

Sara S. Hamouda
PhD Candidate (Computer Systems Group)
College of Engineering and Computer Science
The Australian National University
________________________________
From: Rohan Garg <rohg...@ccs.neu.edu>
Sent: Friday, October 14, 2016 7:11:12 AM
To: Sara Salem Hamouda
Cc: dmtcp-forum@lists.sourceforge.net
Subject: Re: [Dmtcp-forum] DMTCP MPI restart error on a single node

Hi Sara,

What version of DMTCP were you using? DMTCP-3.0 has some known issues
with mpich-3.2, as reported by a DMTCP user. I'd recommend trying with
DMTCP-2.5.

Also, could you try launching your MPI program with mpirun instead of
mpiexec?

Thanks,
Rohan

On Wed, Oct 12, 2016 at 11:30:41AM +0000, Sara Salem Hamouda wrote:
> Dear DMTCP team,
>
>   Appreciate your support regarding the below issue.
>
>
> I am using a single machine to learn DMTCP. The operating system is "CentOS 
> release 6.8", and it uses a network file system. I run a simple MPI program 
> (dummy.c), using mpich V3.2.
>
>
> On terminal-1:
>
> dmtcp_coordinator
>
>
> On terminal-2:
>
> dmtcp_launch mpiexec -n 3 ./dummy.mpich2 10 10000
>
>
> While dummy is running in terminal-2, I move to terminal-1 and press 'c' , 
> then 'q' to exit.
>
>
> To restart, I run the generated dmtcp_restart_script.sh script, but I get the 
> error below. Would you please advice on a possible fix for this issue?
>
>
> (P.S. I tried the same steps on another machine (with Ubuntu 14.04 OS) that 
> has a local file system, and the restart worked successfully. Is there 
> specific configuration I should use with network file systems?)
>
>
> size = 1
> [43000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap; 
> REASON='JASSERT(fd != -1 || errno == EEXIST) failed'
>      area.name = /ram/var/run/nscd/dbbxzrxW
> dummy.mpich2 (43000): Terminating...
> [44000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap; 
> REASON='JASSERT(fd != -1 || errno == EEXIST) failed'
>      area.name = /ram/var/run/nscd/dbbxzrxW
> dummy.mpich2 (44000): Terminating...
> [42000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap; 
> REASON='JASSERT(fd != -1 || errno == EEXIST) failed'
>      area.name = /ram/var/run/nscd/dbbxzrxW
> dummy.mpich2 (42000): Terminating...
> [40000] ERROR at connectionidentifier.h:96 in assertValid; 
> REASON='JASSERT(strcmp(sign, HANDSHAKE_SIGNATURE_MSG) == 0) failed'
>      sign =
> Message: read invalid message, signature mismatch. (External socket?)
> mpiexec.hydra (40000): Terminating...
> [41000] ERROR at connectionidentifier.h:96 in assertValid; 
> REASON='JASSERT(strcmp(sign, HANDSHAKE_SIGNATURE_MSG) == 0) failed'
>      sign =
> Message: read invalid message, signature mismatch. (External socket?)
> hydra_pmi_proxy (41000): Terminating...
>
>
>
> Best Regards,
> Sara
>
> Sara S. Hamouda
> PhD Candidate (Computer Systems Group)
> College of Engineering and Computer Science
> The Australian National University

> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, SlashDot.org! http://sdm.link/slashdot

> _______________________________________________
> Dmtcp-forum mailing list
> Dmtcp-forum@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
Dmtcp-forum Info Page - 
SourceForge<https://lists.sourceforge.net/lists/listinfo/dmtcp-forum>
lists.sourceforge.net
To see the collection of prior postings to the list, visit the Dmtcp-forum 
Archives. Using Dmtcp-forum: To post a message to all the list members ...




------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to