Hi Sara, What version of DMTCP were you using? DMTCP-3.0 has some known issues with mpich-3.2, as reported by a DMTCP user. I'd recommend trying with DMTCP-2.5.
Also, could you try launching your MPI program with mpirun instead of mpiexec? Thanks, Rohan On Wed, Oct 12, 2016 at 11:30:41AM +0000, Sara Salem Hamouda wrote: > Dear DMTCP team, > > Appreciate your support regarding the below issue. > > > I am using a single machine to learn DMTCP. The operating system is "CentOS > release 6.8", and it uses a network file system. I run a simple MPI program > (dummy.c), using mpich V3.2. > > > On terminal-1: > > dmtcp_coordinator > > > On terminal-2: > > dmtcp_launch mpiexec -n 3 ./dummy.mpich2 10 10000 > > > While dummy is running in terminal-2, I move to terminal-1 and press 'c' , > then 'q' to exit. > > > To restart, I run the generated dmtcp_restart_script.sh script, but I get the > error below. Would you please advice on a possible fix for this issue? > > > (P.S. I tried the same steps on another machine (with Ubuntu 14.04 OS) that > has a local file system, and the restart worked successfully. Is there > specific configuration I should use with network file systems?) > > > size = 1 > [43000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap; > REASON='JASSERT(fd != -1 || errno == EEXIST) failed' > area.name = /ram/var/run/nscd/dbbxzrxW > dummy.mpich2 (43000): Terminating... > [44000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap; > REASON='JASSERT(fd != -1 || errno == EEXIST) failed' > area.name = /ram/var/run/nscd/dbbxzrxW > dummy.mpich2 (44000): Terminating... > [42000] ERROR at fileconnlist.cpp:318 in recreateShmFileAndMap; > REASON='JASSERT(fd != -1 || errno == EEXIST) failed' > area.name = /ram/var/run/nscd/dbbxzrxW > dummy.mpich2 (42000): Terminating... > [40000] ERROR at connectionidentifier.h:96 in assertValid; > REASON='JASSERT(strcmp(sign, HANDSHAKE_SIGNATURE_MSG) == 0) failed' > sign = > Message: read invalid message, signature mismatch. (External socket?) > mpiexec.hydra (40000): Terminating... > [41000] ERROR at connectionidentifier.h:96 in assertValid; > REASON='JASSERT(strcmp(sign, HANDSHAKE_SIGNATURE_MSG) == 0) failed' > sign = > Message: read invalid message, signature mismatch. (External socket?) > hydra_pmi_proxy (41000): Terminating... > > > > Best Regards, > Sara > > Sara S. Hamouda > PhD Candidate (Computer Systems Group) > College of Engineering and Computer Science > The Australian National University > ------------------------------------------------------------------------------ > Check out the vibrant tech community on one of the world's most > engaging tech sites, SlashDot.org! http://sdm.link/slashdot > _______________________________________________ > Dmtcp-forum mailing list > Dmtcp-forum@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot _______________________________________________ Dmtcp-forum mailing list Dmtcp-forum@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dmtcp-forum