Hi all,

I am trying to checkpoint an MVAPICH application. It does not behave as
expected, so maybe you can give me some support.

I have compiled DMTCP with "--enable-infiniband-support " as only flag. I
have MVAPICH installed.

I can execute a test MPI application in two nodes, without DMTCP. I also
can execute the application in a single node with DMTCP. however, it I
execute it in two nodes with DMTCP, only the first one will run.

Below there is a series of test commands with a lot of output, together
with the versions of everything.

Any ideas?

thanks for your help,


Manuel


---
---

# mpichversion

MVAPICH2 Version:     2.2a

MVAPICH2 Release date: Mon Aug 17 20:00:00 EDT 2015

MVAPICH2 Device:      ch3:mrail

MVAPICH2 configure:   --disable-mcast

MVAPICH2 CC:  gcc    -DNDEBUG -DNVALGRIND -O2

MVAPICH2 CXX: g++   -DNDEBUG -DNVALGRIND -O2

MVAPICH2 F77: gfortran -L/lib -L/lib   -O2

MVAPICH2 FC:  gfortran   -O2

# dmtcp_coordinator --version

dmtcp_coordinator (DMTCP) 2.4.1

---

---


I can execute a test MPI application in two nodes (acme11 and 12), with

---
---
# mpirun_rsh  -n 2  acme11 acme12 ./helloWorldMPI

Process 0 of 2 is on acme11.ciemat.es

Process 1 of 2 is on acme12.ciemat.es

Hello world from process 0 of 2

Hello world from process 1 of 2

Goodbye world from process 0 of 2

Goodbye world from process 1 of 2
---
---

As you can see, it works correctly.


If I try to execute the application with DMTCP, however, it does not.

I run the coordinator on acme11, with port 7779.


I can execute the application on a single node. For example,

---
---

#  dmtcp_launch -h acme11 -p 7779 --ib mpirun_rsh  -n 1  acme12
./helloWorldMPI

[41000] NOTE at ssh.cpp:369 in prepareForExec; REASON='New ssh command'

     newCommand = /home/localsoft/dmtcp/bin/dmtcp_ssh
/home/localsoft/dmtcp/bin/dmtcp_nocheckpoint /usr/bin/ssh -q acme12 cd
/home/slurm/tests;/home/localsoft/dmtcp/bin/dmtcp_launch --coord-host
172.17.29.173 --coord-port 7779 --ckptdir /home/slurm/tests --infiniband
/home/localsoft/dmtcp/bin/dmtcp_sshd /usr/bin/env  MPISPAWN_MPIRUN_MPD=0
USE_LINEAR_SSH=1 MPISPAWN_MPIRUN_HOST=acme11.ciemat.es
MPISPAWN_MPIRUN_HOSTIP=172.17.29.173 MPIRUN_RSH_LAUNCH=1
MPISPAWN_CHECKIN_PORT=33687 MPISPAWN_MPIRUN_PORT=33687 MPISPAWN_NNODES=1
MPISPAWN_GLOBAL_NPROCS=1 MPISPAWN_MPIRUN_ID=40000 MPISPAWN_ARGC=1
MPDMAN_KVS_TEMPLATE=kvs_885_acme11.ciemat.es_40000 MPISPAWN_LOCAL_NPROCS=1
MPISPAWN_ARGV_0='./helloWorldMPI' MPISPAWN_ARGC=1
MPISPAWN_GENERIC_ENV_COUNT=0  MPISPAWN_ID=0
MPISPAWN_WORKING_DIR=/home/slurm/tests MPISPAWN_MPIRUN_RANK_0=0
/usr/local/bin/mpispawn 0

Process 0 of 1 is on acme12.ciemat.es

Hello world from process 0 of 1

Goodbye world from process 0 of 1


COORDINATOR OUTPUT


[3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'

     hello_remote.from = 1d64b124afe30f29-4029-562310a2

[3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'

     progname = mpirun_rsh

     msg.from = 1d64b124afe30f29-52000-562310a2

     client->identity() = 1d64b124afe30f29-4029-562310a2

[3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'

     hello_remote.from = 1d64b124afe30f29-52000-562310a2

[3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'

     client->hostname() = acme11.ciemat.es

     client->progname() = mpirun_rsh_(forked)

     msg.from = 1d64b124afe30f29-53000-562310a2

     client->identity() = 1d64b124afe30f29-52000-562310a2

[3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'

     hello_remote.from = 1d64b124afe30f29-53000-562310a2

[3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'

     client->hostname() = acme11.ciemat.es

     client->progname() = dmtcp_ssh_(forked)

     msg.from = 1d64b124afe30f29-54000-562310a2

     client->identity() = 1d64b124afe30f29-53000-562310a2

[3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'

     client->identity() = 1d64b124afe30f29-54000-562310a2

[3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'

     progname = dmtcp_ssh

     msg.from = 1d64b124afe30f29-53000-562310a2

     client->identity() = 1d64b124afe30f29-53000-562310a2

[3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'

     hello_remote.from = 1b69d09fb3238b30-23945-562310a2

[3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'

     progname = dmtcp_sshd

     msg.from = 1b69d09fb3238b30-55000-562310a2

     client->identity() = 1b69d09fb3238b30-23945-562310a2

[3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'

     hello_remote.from = 1b69d09fb3238b30-55000-562310a2

[3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'

     client->hostname() = acme12.ciemat.es

     client->progname() = dmtcp_sshd_(forked)

     msg.from = 1b69d09fb3238b30-56000-562310a2

     client->identity() = 1b69d09fb3238b30-55000-562310a2

[3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'

     hello_remote.from = 1b69d09fb3238b30-56000-562310a2

[3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'

     client->hostname() = acme12.ciemat.es

     client->progname() = mpispawn_(forked)

     msg.from = 1b69d09fb3238b30-57000-562310a2

     client->identity() = 1b69d09fb3238b30-56000-562310a2

[3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'

     progname = env

     msg.from = 1b69d09fb3238b30-56000-562310a2

     client->identity() = 1b69d09fb3238b30-56000-562310a2

[3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'

     progname = mpispawn

     msg.from = 1b69d09fb3238b30-56000-562310a2

     client->identity() = 1b69d09fb3238b30-56000-562310a2

[3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'

     progname = helloWorldMPI

     msg.from = 1b69d09fb3238b30-57000-562310a2

     client->identity() = 1b69d09fb3238b30-57000-562310a2

[3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'

     client->identity() = 1b69d09fb3238b30-57000-562310a2

[3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'

     client->identity() = 1b69d09fb3238b30-56000-562310a2

[3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'

     client->identity() = 1b69d09fb3238b30-55000-562310a2

[3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'

     client->identity() = 1d64b124afe30f29-53000-562310a2

[3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'

     client->identity() = 1d64b124afe30f29-52000-562310a2


---
---

So we see that it is working correctly, connecting and so.

However, if I run the application in more than one core, as in the first
example, it crashes. What happens is that the first node on the node list
executes the application, and the rest do not.

----
----

[root@acme11 tests]#  dmtcp_launch -h acme11 -p 7779 --ib mpirun_rsh  -n 2
acme11 acme12 ./helloWorldMPI

[59000] NOTE at ssh.cpp:369 in prepareForExec; REASON='New ssh command'

     newCommand = /home/localsoft/dmtcp/bin/dmtcp_ssh
/home/localsoft/dmtcp/bin/dmtcp_nocheckpoint /usr/bin/ssh -q acme11 cd
/home/slurm/tests;/home/localsoft/dmtcp/bin/dmtcp_launch --coord-host
172.17.29.173 --coord-port 7779 --ckptdir /home/slurm/tests --infiniband
/home/localsoft/dmtcp/bin/dmtcp_sshd /usr/bin/env  MPISPAWN_MPIRUN_MPD=0
USE_LINEAR_SSH=1 MPISPAWN_MPIRUN_HOST=acme11.ciemat.es
MPISPAWN_MPIRUN_HOSTIP=172.17.29.173 MPIRUN_RSH_LAUNCH=1
MPISPAWN_CHECKIN_PORT=34203 MPISPAWN_MPIRUN_PORT=34203 MPISPAWN_NNODES=2
MPISPAWN_GLOBAL_NPROCS=2 MPISPAWN_MPIRUN_ID=58000 MPISPAWN_ARGC=1
MPDMAN_KVS_TEMPLATE=kvs_481_acme11.ciemat.es_58000 MPISPAWN_LOCAL_NPROCS=1
MPISPAWN_ARGV_0='./helloWorldMPI' MPISPAWN_ARGC=1
MPISPAWN_GENERIC_ENV_COUNT=0  MPISPAWN_ID=0
MPISPAWN_WORKING_DIR=/home/slurm/tests MPISPAWN_MPIRUN_RANK_0=0
/usr/local/bin/mpispawn 0

[60000] NOTE at ssh.cpp:369 in prepareForExec; REASON='New ssh command'

     newCommand = /home/localsoft/dmtcp/bin/dmtcp_ssh
/home/localsoft/dmtcp/bin/dmtcp_nocheckpoint /usr/bin/ssh -q acme12 cd
/home/slurm/tests;/home/localsoft/dmtcp/bin/dmtcp_launch --coord-host
172.17.29.173 --coord-port 7779 --ckptdir /home/slurm/tests --infiniband
/home/localsoft/dmtcp/bin/dmtcp_sshd /usr/bin/env  MPISPAWN_MPIRUN_MPD=0
USE_LINEAR_SSH=1 MPISPAWN_MPIRUN_HOST=acme11.ciemat.es
MPISPAWN_MPIRUN_HOSTIP=172.17.29.173 MPIRUN_RSH_LAUNCH=1
MPISPAWN_CHECKIN_PORT=34203 MPISPAWN_MPIRUN_PORT=34203 MPISPAWN_NNODES=2
MPISPAWN_GLOBAL_NPROCS=2 MPISPAWN_MPIRUN_ID=58000 MPISPAWN_ARGC=1
MPDMAN_KVS_TEMPLATE=kvs_481_acme11.ciemat.es_58000 MPISPAWN_LOCAL_NPROCS=1
MPISPAWN_ARGV_0='./helloWorldMPI' MPISPAWN_ARGC=1
MPISPAWN_GENERIC_ENV_COUNT=0  MPISPAWN_ID=1
MPISPAWN_WORKING_DIR=/home/slurm/tests MPISPAWN_MPIRUN_RANK_0=1
/usr/local/bin/mpispawn 0

Process 0 of 2 is on acme11.ciemat.es

Hello world from process 0 of 2

Goodbye world from process 0 of 2

COORDINATOR OUTPUT


[3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'

     hello_remote.from = 1d64b124afe30f29-4070-56231173

[3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'

     progname = mpirun_rsh

     msg.from = 1d64b124afe30f29-58000-56231173

     client->identity() = 1d64b124afe30f29-4070-56231173

[3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'

     hello_remote.from = 1d64b124afe30f29-58000-56231173

[3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'

     hello_remote.from = 1d64b124afe30f29-58000-56231173

[3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'

     client->hostname() = acme11.ciemat.es

     client->progname() = mpirun_rsh_(forked)

     msg.from = 1d64b124afe30f29-59000-56231173

     client->identity() = 1d64b124afe30f29-58000-56231173

[3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'

     client->hostname() = acme11.ciemat.es

     client->progname() = mpirun_rsh_(forked)

     msg.from = 1d64b124afe30f29-60000-56231173

     client->identity() = 1d64b124afe30f29-58000-56231173

[3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'

     hello_remote.from = 1d64b124afe30f29-59000-56231173

[3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'

     hello_remote.from = 1d64b124afe30f29-60000-56231173

[3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'

     client->hostname() = acme11.ciemat.es

     client->progname() = dmtcp_ssh_(forked)

     msg.from = 1d64b124afe30f29-61000-56231173

     client->identity() = 1d64b124afe30f29-59000-56231173

[3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'

     client->hostname() = acme11.ciemat.es

     client->progname() = dmtcp_ssh_(forked)

     msg.from = 1d64b124afe30f29-62000-56231173

     client->identity() = 1d64b124afe30f29-60000-56231173

[3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'

     client->identity() = 1d64b124afe30f29-61000-56231173

[3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'

     client->identity() = 1d64b124afe30f29-62000-56231173

[3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'

     progname = dmtcp_ssh

     msg.from = 1d64b124afe30f29-59000-56231173

     client->identity() = 1d64b124afe30f29-59000-56231173

[3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'

     progname = dmtcp_ssh

     msg.from = 1d64b124afe30f29-60000-56231173

     client->identity() = 1d64b124afe30f29-60000-56231173

[3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'

     hello_remote.from = 1b69d09fb3238b30-24001-56231173

[3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'

     hello_remote.from = 1d64b124afe30f29-4094-56231173

[3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'

     progname = dmtcp_sshd

     msg.from = 1d64b124afe30f29-64000-56231173

     client->identity() = 1d64b124afe30f29-4094-56231173

[3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'

     progname = dmtcp_sshd

     msg.from = 1b69d09fb3238b30-63000-56231173

     client->identity() = 1b69d09fb3238b30-24001-56231173

[3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'

     hello_remote.from = 1d64b124afe30f29-64000-56231173

[3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'

     hello_remote.from = 1b69d09fb3238b30-63000-56231173

[3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'

     client->hostname() = acme11.ciemat.es

     client->progname() = dmtcp_sshd_(forked)

     msg.from = 1d64b124afe30f29-65000-56231173

     client->identity() = 1d64b124afe30f29-64000-56231173

[3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'

     client->hostname() = acme12.ciemat.es

     client->progname() = dmtcp_sshd_(forked)

     msg.from = 1b69d09fb3238b30-66000-56231173

     client->identity() = 1b69d09fb3238b30-63000-56231173

[3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'

     progname = env

     msg.from = 1d64b124afe30f29-65000-56231173

     client->identity() = 1d64b124afe30f29-65000-56231173

[3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'

     progname = mpispawn

     msg.from = 1d64b124afe30f29-65000-56231173

     client->identity() = 1d64b124afe30f29-65000-56231173

[3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'

     hello_remote.from = 1b69d09fb3238b30-66000-56231173

[3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'

     hello_remote.from = 1d64b124afe30f29-65000-56231173

[3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'

     client->hostname() = acme11.ciemat.es

     client->progname() = mpispawn_(forked)

     msg.from = 1d64b124afe30f29-68000-56231173

     client->identity() = 1d64b124afe30f29-65000-56231173

[3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'

     client->hostname() = acme12.ciemat.es

     client->progname() = mpispawn_(forked)

     msg.from = 1b69d09fb3238b30-67000-56231173

     client->identity() = 1b69d09fb3238b30-66000-56231173

[3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'

     progname = env

     msg.from = 1b69d09fb3238b30-66000-56231173

     client->identity() = 1b69d09fb3238b30-66000-56231173

[3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'

     progname = mpispawn

     msg.from = 1b69d09fb3238b30-66000-56231173

     client->identity() = 1b69d09fb3238b30-66000-56231173

[3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'

     progname = helloWorldMPI

     msg.from = 1d64b124afe30f29-68000-56231173

     client->identity() = 1d64b124afe30f29-68000-56231173

[3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'

     progname = helloWorldMPI

     msg.from = 1b69d09fb3238b30-67000-56231173

     client->identity() = 1b69d09fb3238b30-67000-56231173

[3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'

     client->identity() = 1d64b124afe30f29-68000-56231173

[3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'

     client->identity() = 1b69d09fb3238b30-67000-56231173

[3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'

     client->identity() = 1d64b124afe30f29-65000-56231173

[3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'

     client->identity() = 1b69d09fb3238b30-66000-56231173

[3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'

     client->identity() = 1d64b124afe30f29-64000-56231173

[3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'

     client->identity() = 1b69d09fb3238b30-63000-56231173

[3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'

     client->identity() = 1d64b124afe30f29-59000-56231173

[3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'

     client->identity() = 1d64b124afe30f29-60000-56231173

[3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'

     client->identity() = 1d64b124afe30f29-58000-56231173


----

----







-- 
Dr. Manuel Rodríguez-Pascual
skype: manuel.rodriguez.pascual
phone: (+34) 913466173 // (+34) 679925108

CIEMAT-Moncloa
Edificio 22, desp. 1.25
Avenida Complutense, 40
28040- MADRID
SPAIN
------------------------------------------------------------------------------
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to