Hi Marcela, I have a few more questions about your configuration. 1. What version of DMTCP are you using? I'd recommend using the latest version from github: git clone https://github.com/dmtcp/dmtcp.git We've been enhancing the MPI support there, in preparation for the next release. 2. Are you using TCP (Ethernet) or InfiniBand? (I'm guessing TCP.) 3. Are you sure that you don't have any older DMTCP coordinators running? To be safe, you can do: pkill -9 dmtcp on each of your two hosts. 4. I assume that you're doing something like: dmtcp_launch mpirun a.out (without using SLURM or other resource managers). Please let us know if it's something different. 5. Have you tried testing with two MPI ranks on a single host? (Your hostfile could use "localhost" in this case.)
We'll take this in steps. Jiajun is the member of our team who has been extending the MPI support (different dialects of MPI, resource managers, etc.). He'll be back on Thursday. If we can't diagnose the bug easily in this remote way, will it be possible to provide a temporary guest account (or virtual machine snapshot) where we can confirm the bug ourselves? Best, - Gene On Mon, Mar 09, 2015 at 12:05:00PM +0100, Marcela Castro León wrote: > Hi > I'm trying to use dmtcp with an open-mpi (1.6.5) aplication (BT of NAS > benchmark). > In the moment I ask for a checkpoint in the coordinator by pressing "c", > the running application terminate before printing this error message: > > > [40000] ERROR at connectionidentifier.h:96 in assertValid; > REASON='JASSERT(strcmp(sign, HANDSHAKE_SIGNATURE_MSG) == 0) failed' > sign = > Message: read invalid message, signature mismatch. (External socket?) > orterun (40000): Terminating... > mcastrol@chubut:~/disconfs/software/NPB3.3.1/NPB3.3-MPI/bin$ [48000] ERROR > at connectionidentifier.h:96 in assertValid; REASON='JASSERT(strcmp(sign, > HANDSHAKE_SIGNATURE_MSG) == 0) failed' > sign = > Message: read invalid message, signature mismatch. (External socket?) > bt.A.4 (48000): Terminating... > [49000] ERROR at connectionidentifier.h:96 in assertValid; > REASON='JASSERT(strcmp(sign, HANDSHAKE_SIGNATURE_MSG) == 0) failed' > sign = > Message: read invalid message, signature mismatch. (External socket?) > bt.A.4 (49000): Terminating... > > > > I'm using two identical nodes, they have the same user and the ssh public > keys (id_dsa.pub) are interchanged. The OS is ubuntu 12.04 kernel 3.13.0-46. > I'd appreciate any clue to solve this issue. > Thank you very much in advance. > Marcela > ------------------------------------------------------------------------------ > Dive into the World of Parallel Programming The Go Parallel Website, sponsored > by Intel and developed in partnership with Slashdot Media, is your hub for all > things parallel software development, from weekly thought leadership blogs to > news, videos, case studies, tutorials and more. Take a look and join the > conversation now. http://goparallel.sourceforge.net/ > _______________________________________________ > Dmtcp-forum mailing list > Dmtcp-forum@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum ------------------------------------------------------------------------------ Dive into the World of Parallel Programming The Go Parallel Website, sponsored by Intel and developed in partnership with Slashdot Media, is your hub for all things parallel software development, from weekly thought leadership blogs to news, videos, case studies, tutorials and more. Take a look and join the conversation now. http://goparallel.sourceforge.net/ _______________________________________________ Dmtcp-forum mailing list Dmtcp-forum@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dmtcp-forum