Hi Marcela,
    I have a few more questions about your configuration.
1.  What version of DMTCP are you using?  I'd recommend using the latest
    version from github:
      git clone https://github.com/dmtcp/dmtcp.git
    We've been enhancing the MPI support there, in preparation for
    the next release.
2.  Are you using TCP (Ethernet) or InfiniBand?  (I'm guessing TCP.)
3.  Are you sure that you don't have any older DMTCP coordinators running?
    To be safe, you can do:
      pkill -9 dmtcp
    on each of your two hosts.
4.  I assume that you're doing something like:
      dmtcp_launch mpirun a.out
    (without using SLURM or other resource managers).  Please let us know
    if it's something different.
5.  Have you tried testing with two MPI ranks on a single host?
    (Your hostfile could use "localhost" in this case.)

We'll take this in steps.  Jiajun is the member of our team who has
been extending the MPI support (different dialects of MPI, resource
managers, etc.).  He'll be back on Thursday.

If we can't diagnose the bug easily in this remote way, will it
be possible to provide a temporary guest account (or virtual machine snapshot)
where we can confirm the bug ourselves?

Best,
- Gene

On Mon, Mar 09, 2015 at 12:05:00PM +0100, Marcela Castro León wrote:
> Hi
> I'm trying to use dmtcp with an open-mpi (1.6.5) aplication (BT of NAS
> benchmark).
> In the moment I ask for a checkpoint in the coordinator by pressing "c",
> the running application terminate before printing this error message:
> 
> 
> [40000] ERROR at connectionidentifier.h:96 in assertValid;
> REASON='JASSERT(strcmp(sign, HANDSHAKE_SIGNATURE_MSG) == 0) failed'
>      sign =
> Message: read invalid message, signature mismatch. (External socket?)
> orterun (40000): Terminating...
> mcastrol@chubut:~/disconfs/software/NPB3.3.1/NPB3.3-MPI/bin$ [48000] ERROR
> at connectionidentifier.h:96 in assertValid; REASON='JASSERT(strcmp(sign,
> HANDSHAKE_SIGNATURE_MSG) == 0) failed'
>      sign =
> Message: read invalid message, signature mismatch. (External socket?)
> bt.A.4 (48000): Terminating...
> [49000] ERROR at connectionidentifier.h:96 in assertValid;
> REASON='JASSERT(strcmp(sign, HANDSHAKE_SIGNATURE_MSG) == 0) failed'
>      sign =
> Message: read invalid message, signature mismatch. (External socket?)
> bt.A.4 (49000): Terminating...
> 
> 
> 
> I'm using two identical nodes, they have the same user and the ssh public
> keys (id_dsa.pub) are interchanged. The OS is ubuntu 12.04 kernel 3.13.0-46.
> I'd appreciate any clue to solve this issue.
> Thank you very much in advance.
> Marcela

> ------------------------------------------------------------------------------
> Dive into the World of Parallel Programming The Go Parallel Website, sponsored
> by Intel and developed in partnership with Slashdot Media, is your hub for all
> things parallel software development, from weekly thought leadership blogs to
> news, videos, case studies, tutorials and more. Take a look and join the 
> conversation now. http://goparallel.sourceforge.net/

> _______________________________________________
> Dmtcp-forum mailing list
> Dmtcp-forum@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum


------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to