Hi Sara,

What version of DMTCP were you using? DMTCP-3.0 has some known issues
with mpich-3.2, as reported by a DMTCP user. I'd recommend trying with
DMTCP-2.5.

Also, could you try launching your MPI program with mpirun instead of
mpiexec?

Thanks,
Rohan

On Wed, Oct 12, 2016 at 11:30:41AM +0000, Sara Salem Hamouda wrote:
> Dear DMTCP team,
> 
>   Appreciate your support regarding the below issue.
> 
> 
> I am using a single machine to learn DMTCP. The operating system is "CentOS 
> release 6.8", and it uses a network file system. I run a simple MPI program 
> (dummy.c), using mpich V3.2.
> 
> 
> On terminal-1:
> 
> dmtcp_coordinator
> 
> 
> On terminal-2:
> 
> dmtcp_launch mpiexec -n 3 ./dummy.mpich2 10 10000
> 
> 
> While dummy is running in terminal-2, I move to terminal-1 and press 'c' , 
> then 'q' to exit.
> 
> 
> To restart, I run the generated dmtcp_restart_script.sh script, but I get the 
> error below. Would you please advice on a possible fix for this issue?
> 
> 
> (P.S. I tried the same steps on another machine (with Ubuntu 14.04 OS) that 
> has a local file system, and the restart worked successfully. Is there 
> specific configuration I should use with network file systems?)
> 
> 
> size = 1
> [43000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap; 
> REASON='JASSERT(fd != -1 || errno == EEXIST) failed'
>      area.name = /ram/var/run/nscd/dbbxzrxW
> dummy.mpich2 (43000): Terminating...
> [44000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap; 
> REASON='JASSERT(fd != -1 || errno == EEXIST) failed'
>      area.name = /ram/var/run/nscd/dbbxzrxW
> dummy.mpich2 (44000): Terminating...
> [42000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap; 
> REASON='JASSERT(fd != -1 || errno == EEXIST) failed'
>      area.name = /ram/var/run/nscd/dbbxzrxW
> dummy.mpich2 (42000): Terminating...
> [40000] ERROR at connectionidentifier.h:96 in assertValid; 
> REASON='JASSERT(strcmp(sign, HANDSHAKE_SIGNATURE_MSG) == 0) failed'
>      sign =
> Message: read invalid message, signature mismatch. (External socket?)
> mpiexec.hydra (40000): Terminating...
> [41000] ERROR at connectionidentifier.h:96 in assertValid; 
> REASON='JASSERT(strcmp(sign, HANDSHAKE_SIGNATURE_MSG) == 0) failed'
>      sign =
> Message: read invalid message, signature mismatch. (External socket?)
> hydra_pmi_proxy (41000): Terminating...
> 
> 
> 
> Best Regards,
> Sara
> 
> Sara S. Hamouda
> PhD Candidate (Computer Systems Group)
> College of Engineering and Computer Science
> The Australian National University

> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most 
> engaging tech sites, SlashDot.org! http://sdm.link/slashdot

> _______________________________________________
> Dmtcp-forum mailing list
> Dmtcp-forum@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to