Hi Gene
I'll answer your questions below.
Thank you very much in advance.
Marcela
2015-03-10 20:44 GMT+01:00 Gene Cooperman <g...@ccs.neu.edu>:
> Hi Marcela,
> I have a few more questions about your configuration.
> 1. What version of DMTCP are you using? I'd recommend using the latest
> version from github:
> git clone
>
>
> https://github.com/dmtcp/dmtcp.git
I've tried with dmtcp-2.3.1 and 2.2.1 and now with the last that you are
referring (dmtcp-master)
>
> We've been enhancing the MPI support there, in preparation for
> the next release.
> 2. Are you using TCP (Ethernet) or InfiniBand? (I'm guessing TCP.)
> TCP
>
>
>
> 3. Are you sure that you don't have any older DMTCP coordinators running?
> To be safe, you can do:
> pkill -9 dmtcp
> on each of your two hosts.
>
no, I don't have. I usually run the dmtcp_coordinator in a different shell
but in the same node I'm executing the mpirun.
I'm sure the dmtcp_launch and coordinator are of the same version.
> 4. I assume that you're doing something like:
> dmtcp_launch mpirun a.out
> (without using SLURM or other resource managers). Please let us know
> if it's something different.
>
Yes, I'm doing exactly that
dmtcp_launch mpirun -machinefile mf2 -np 4 bt.A.4
5. Have you tried testing with two MPI ranks on a single host?
> (Your hostfile could use "localhost" in this case.)
>
I've tried this option with all the version and in all the nodes of the
cluster. if I execute in only one node (
dmtcp_launch mpirun -np 4 bt.A.4 ) I'm able to perform checkpoint ok, but I
coudn't restart.
I'm obtainted this error with versions 2.3.1 and 2.2.1
sh dmtcp_restart_script.sh
dmtcp_coordinator starting...
Host: rionegro (192.168.1.4)
Port: 7779
Checkpoint Interval: disabled (checkpoint manually instead)
Exit on last client: 1
Backgrounding...
[4458] mtcp_restart.c:1296 open_shared_file:
unable to create file /tmp/openmpi-sessions-mcastrol@rionegro_0
/52412/1/shared_mem_pool.rionegro
[4459] mtcp_restart.c:1296 open_shared_file:
unable to create file /tmp/openmpi-sessions-mcastrol@rionegro_0
/52412/1/shared_mem_pool.rionegro
[4457] mtcp_restart.c:1296 open_shared_file:
unable to create file /tmp/openmpi-sessions-mcastrol@rionegro_0
/52412/1/shared_mem_pool.rionegro
[4460] mtcp_restart.c:1296 open_shared_file:
unable to create file /tmp/openmpi-sessions-mcastrol@rionegro_0
/52412/1/shared_mem_pool.rionegro
with dmtcp-master version, the restart hangs in a loop throwing this error:
[warn] epoll_wait: Bad file descriptor.
> We'll take this in steps. Jiajun is the member of our team who has
> been extending the MPI support (different dialects of MPI, resource
> managers, etc.). He'll be back on Thursday.
>
> If we can't diagnose the bug easily in this remote way, will it
> be possible to provide a temporary guest account (or virtual machine
> snapshot)
> where we can confirm the bug ourselves?
> Yes, it is possible. Let me know.
>
>
> Best,
> - Gene
>
> On Mon, Mar 09, 2015 at 12:05:00PM +0100, Marcela Castro León wrote:
> > Hi
> > I'm trying to use dmtcp with an open-mpi (1.6.5) aplication (BT of NAS
> > benchmark).
> > In the moment I ask for a checkpoint in the coordinator by pressing "c",
> > the running application terminate before printing this error message:
> >
> >
> > [40000] ERROR at connectionidentifier.h:96 in assertValid;
> > REASON='JASSERT(strcmp(sign, HANDSHAKE_SIGNATURE_MSG) == 0) failed'
> > sign =
> > Message: read invalid message, signature mismatch. (External socket?)
> > orterun (40000): Terminating...
> > mcastrol@chubut:~/disconfs/software/NPB3.3.1/NPB3.3-MPI/bin$ [48000]
> ERROR
> > at connectionidentifier.h:96 in assertValid; REASON='JASSERT(strcmp(sign,
> > HANDSHAKE_SIGNATURE_MSG) == 0) failed'
> > sign =
> > Message: read invalid message, signature mismatch. (External socket?)
> > bt.A.4 (48000): Terminating...
> > [49000] ERROR at connectionidentifier.h:96 in assertValid;
> > REASON='JASSERT(strcmp(sign, HANDSHAKE_SIGNATURE_MSG) == 0) failed'
> > sign =
> > Message: read invalid message, signature mismatch. (External socket?)
> > bt.A.4 (49000): Terminating...
> >
> >
> >
> > I'm using two identical nodes, they have the same user and the ssh public
> > keys (id_dsa.pub) are interchanged. The OS is ubuntu 12.04 kernel
> 3.13.0-46.
> > I'd appreciate any clue to solve this issue.
> > Thank you very much in advance.
> > Marcela
>
> >
> ------------------------------------------------------------------------------
> > Dive into the World of Parallel Programming The Go Parallel Website,
> sponsored
> > by Intel and developed in partnership with Slashdot Media, is your hub
> for all
> > things parallel software development, from weekly thought leadership
> blogs to
> > news, videos, case studies, tutorials and more. Take a look and join the
> > conversation now. http://goparallel.sourceforge.net/
>
> > _______________________________________________
> > Dmtcp-forum mailing list
> > Dmtcp-forum@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>
>
------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum