Thanks for the detailed report.
Concerning a guest account, you wrote:
>> Yes, it is possible. Let me know.

That would be great.  My office phone number is +1 617 373-8686
if you'd like to use an absolutely secure medium to provide a guest
account.  Otherwise, if you send me a personal e-mail (to me only) at:
  g...@ccs.neu.edu
that should be relatively secure.

If you can provide a guest account, I don't need answers to any of
the issues below.  But in case there is a delay, here are the next steps:

I propose that we begin by diagnosing using the github version ('master'
branch).
Let's also begin with executing on a single node.
Running: dmtcp_launch mpirun -np 4 bt.A.4 
with:  open-mpi (1.6.5) aplication (BT of NAS)

Let's try to do 'gdb attach' on the process that hangs and look at its
stack.  Could you do the following, and send back the stack trace?
RUN with a single node:
  dmtcp_launch mpirun -np 4 bt.A.4
THEN:
  gdb attach <PID>
OR:
  gdb bt.A.4 <PID>
AND THEN:
  (gdb) thread apply all where

Please send back the output.

Thanks,
- Gene

On Wed, Mar 11, 2015 at 01:24:17PM +0100, Marcela Castro León wrote:
> Hi Gene
>  I'll answer your questions below.
> Thank you very much in advance.
> Marcela
> 
> 
> 
> 2015-03-10 20:44 GMT+01:00 Gene Cooperman <g...@ccs.neu.edu>:
> 
> > Hi Marcela,
> >     I have a few more questions about your configuration.
> > 1.  What version of DMTCP are you using?  I'd recommend using the latest
> >     version from github:
> >       git clone
> > ​​
> > ​​
> > https://github.com/dmtcp/dmtcp.git
> 
> ​I've tried with dmtcp-2.3.1 and 2.2.1 and now with the last that you are
> referring (dmtcp-master)
> 
> >
> >     We've been enhancing the MPI support there, in preparation for
> >     the next release.
> > 2.  Are you using TCP (Ethernet) or InfiniBand?  (I'm guessing TCP.)
> > ​ TCP
> >
> 
> 
> > ​
> >
> > 3.  Are you sure that you don't have any older DMTCP coordinators running?
> >     To be safe, you can do:
> >       pkill -9 dmtcp
> >     on each of your two hosts.
> >
> no, I don't have. I usually run the dmtcp_coordinator in a different shell
> but in the same node I'm executing the mpirun.​
> 
> ​I'm sure the dmtcp_launch and coordinator are of the same version.​
> 
> > 4.  I assume that you're doing something like:
> >       dmtcp_launch mpirun a.out
> >     (without using SLURM or other resource managers).  Please let us know
> >     if it's something different.
> >
> ​Yes, I'm doing exactly that
> dmtcp_launch mpirun -machinefile mf2 -np 4 bt.A.4 ​
> 
> 
> 5.  Have you tried testing with two MPI ranks on a single host?
> >     (Your hostfile could use "localhost" in this case.)
> >
> ​I've tried this option with all the version and in all the nodes of the
> cluster. if I execute in only one node ​(
> dmtcp_launch mpirun -np 4 bt.A.4 ) I'm able to perform checkpoint ok, but I
> coudn't restart.
> I'm obtainted this error with versions 2.3.1 and 2.2.1
> 
>  sh dmtcp_restart_script.sh
> 
> dmtcp_coordinator starting...
> 
>     Host: rionegro (192.168.1.4)
> 
>     Port: 7779
> 
>     Checkpoint Interval: disabled (checkpoint manually instead)
> 
>     Exit on last client: 1
> 
> Backgrounding...
> 
> [4458] mtcp_restart.c:1296 open_shared_file:
> 
>   unable to create file /tmp/openmpi-sessions-mcastrol@rionegro_0
> /52412/1/shared_mem_pool.rionegro
> 
> [4459] mtcp_restart.c:1296 open_shared_file:
> 
>   unable to create file /tmp/openmpi-sessions-mcastrol@rionegro_0
> /52412/1/shared_mem_pool.rionegro
> 
> [4457] mtcp_restart.c:1296 open_shared_file:
> 
>   unable to create file /tmp/openmpi-sessions-mcastrol@rionegro_0
> /52412/1/shared_mem_pool.rionegro
> 
> [4460] mtcp_restart.c:1296 open_shared_file:
> 
>   unable to create file /tmp/openmpi-sessions-mcastrol@rionegro_0
> /52412/1/shared_mem_pool.rionegro
> 
> 
> with dmtcp-master version, the restart hangs in a loop throwing this error:
> [warn] epoll_wait: Bad file descriptor.
> 
> 
> 
> > We'll take this in steps.  Jiajun is the member of our team who has
> > been extending the MPI support (different dialects of MPI, resource
> > managers, etc.).  He'll be back on Thursday.
> >
> > If we can't diagnose the bug easily in this remote way, will it
> > be possible to provide a temporary guest account (or virtual machine
> > snapshot)
> > where we can confirm the bug ourselves?
> > ​Yes, it is possible. Let me know.​
> >
> >
> 
> > Best,
> > - Gene
> >
> > On Mon, Mar 09, 2015 at 12:05:00PM +0100, Marcela Castro León wrote:
> > > Hi
> > > I'm trying to use dmtcp with an open-mpi (1.6.5) aplication (BT of NAS
> > > benchmark).
> > > In the moment I ask for a checkpoint in the coordinator by pressing "c",
> > > the running application terminate before printing this error message:
> > >
> > >
> > > [40000] ERROR at connectionidentifier.h:96 in assertValid;
> > > REASON='JASSERT(strcmp(sign, HANDSHAKE_SIGNATURE_MSG) == 0) failed'
> > >      sign =
> > > Message: read invalid message, signature mismatch. (External socket?)
> > > orterun (40000): Terminating...
> > > mcastrol@chubut:~/disconfs/software/NPB3.3.1/NPB3.3-MPI/bin$ [48000]
> > ERROR
> > > at connectionidentifier.h:96 in assertValid; REASON='JASSERT(strcmp(sign,
> > > HANDSHAKE_SIGNATURE_MSG) == 0) failed'
> > >      sign =
> > > Message: read invalid message, signature mismatch. (External socket?)
> > > bt.A.4 (48000): Terminating...
> > > [49000] ERROR at connectionidentifier.h:96 in assertValid;
> > > REASON='JASSERT(strcmp(sign, HANDSHAKE_SIGNATURE_MSG) == 0) failed'
> > >      sign =
> > > Message: read invalid message, signature mismatch. (External socket?)
> > > bt.A.4 (49000): Terminating...
> > >
> > >
> > >
> > > I'm using two identical nodes, they have the same user and the ssh public
> > > keys (id_dsa.pub) are interchanged. The OS is ubuntu 12.04 kernel
> > 3.13.0-46.
> > > I'd appreciate any clue to solve this issue.
> > > Thank you very much in advance.
> > > Marcela
> >
> > >
> > ------------------------------------------------------------------------------
> > > Dive into the World of Parallel Programming The Go Parallel Website,
> > sponsored
> > > by Intel and developed in partnership with Slashdot Media, is your hub
> > for all
> > > things parallel software development, from weekly thought leadership
> > blogs to
> > > news, videos, case studies, tutorials and more. Take a look and join the
> > > conversation now. http://goparallel.sourceforge.net/
> >
> > > _______________________________________________
> > > Dmtcp-forum mailing list
> > > Dmtcp-forum@lists.sourceforge.net
> > > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
> >
> >

------------------------------------------------------------------------------
Dive into the World of Parallel Programming The Go Parallel Website, sponsored
by Intel and developed in partnership with Slashdot Media, is your hub for all
things parallel software development, from weekly thought leadership blogs to
news, videos, case studies, tutorials and more. Take a look and join the 
conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to