Hi all, I am trying to checkpoint an MVAPICH application. It does not behave as expected, so maybe you can give me some support.
I have compiled DMTCP with "--enable-infiniband-support " as only flag. I have MVAPICH installed. I can execute a test MPI application in two nodes, without DMTCP. I also can execute the application in a single node with DMTCP. however, it I execute it in two nodes with DMTCP, only the first one will run. Below there is a series of test commands with a lot of output, together with the versions of everything. Any ideas? thanks for your help, Manuel --- --- # mpichversion MVAPICH2 Version: 2.2a MVAPICH2 Release date: Mon Aug 17 20:00:00 EDT 2015 MVAPICH2 Device: ch3:mrail MVAPICH2 configure: --disable-mcast MVAPICH2 CC: gcc -DNDEBUG -DNVALGRIND -O2 MVAPICH2 CXX: g++ -DNDEBUG -DNVALGRIND -O2 MVAPICH2 F77: gfortran -L/lib -L/lib -O2 MVAPICH2 FC: gfortran -O2 # dmtcp_coordinator --version dmtcp_coordinator (DMTCP) 2.4.1 --- --- I can execute a test MPI application in two nodes (acme11 and 12), with --- --- # mpirun_rsh -n 2 acme11 acme12 ./helloWorldMPI Process 0 of 2 is on acme11.ciemat.es Process 1 of 2 is on acme12.ciemat.es Hello world from process 0 of 2 Hello world from process 1 of 2 Goodbye world from process 0 of 2 Goodbye world from process 1 of 2 --- --- As you can see, it works correctly. If I try to execute the application with DMTCP, however, it does not. I run the coordinator on acme11, with port 7779. I can execute the application on a single node. For example, --- --- # dmtcp_launch -h acme11 -p 7779 --ib mpirun_rsh -n 1 acme12 ./helloWorldMPI [41000] NOTE at ssh.cpp:369 in prepareForExec; REASON='New ssh command' newCommand = /home/localsoft/dmtcp/bin/dmtcp_ssh /home/localsoft/dmtcp/bin/dmtcp_nocheckpoint /usr/bin/ssh -q acme12 cd /home/slurm/tests;/home/localsoft/dmtcp/bin/dmtcp_launch --coord-host 172.17.29.173 --coord-port 7779 --ckptdir /home/slurm/tests --infiniband /home/localsoft/dmtcp/bin/dmtcp_sshd /usr/bin/env MPISPAWN_MPIRUN_MPD=0 USE_LINEAR_SSH=1 MPISPAWN_MPIRUN_HOST=acme11.ciemat.es MPISPAWN_MPIRUN_HOSTIP=172.17.29.173 MPIRUN_RSH_LAUNCH=1 MPISPAWN_CHECKIN_PORT=33687 MPISPAWN_MPIRUN_PORT=33687 MPISPAWN_NNODES=1 MPISPAWN_GLOBAL_NPROCS=1 MPISPAWN_MPIRUN_ID=40000 MPISPAWN_ARGC=1 MPDMAN_KVS_TEMPLATE=kvs_885_acme11.ciemat.es_40000 MPISPAWN_LOCAL_NPROCS=1 MPISPAWN_ARGV_0='./helloWorldMPI' MPISPAWN_ARGC=1 MPISPAWN_GENERIC_ENV_COUNT=0 MPISPAWN_ID=0 MPISPAWN_WORKING_DIR=/home/slurm/tests MPISPAWN_MPIRUN_RANK_0=0 /usr/local/bin/mpispawn 0 Process 0 of 1 is on acme12.ciemat.es Hello world from process 0 of 1 Goodbye world from process 0 of 1 COORDINATOR OUTPUT [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected' hello_remote.from = 1d64b124afe30f29-4029-562310a2 [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating process Information after exec()' progname = mpirun_rsh msg.from = 1d64b124afe30f29-52000-562310a2 client->identity() = 1d64b124afe30f29-4029-562310a2 [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected' hello_remote.from = 1d64b124afe30f29-52000-562310a2 [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating process Information after fork()' client->hostname() = acme11.ciemat.es client->progname() = mpirun_rsh_(forked) msg.from = 1d64b124afe30f29-53000-562310a2 client->identity() = 1d64b124afe30f29-52000-562310a2 [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected' hello_remote.from = 1d64b124afe30f29-53000-562310a2 [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating process Information after fork()' client->hostname() = acme11.ciemat.es client->progname() = dmtcp_ssh_(forked) msg.from = 1d64b124afe30f29-54000-562310a2 client->identity() = 1d64b124afe30f29-53000-562310a2 [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client disconnected' client->identity() = 1d64b124afe30f29-54000-562310a2 [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating process Information after exec()' progname = dmtcp_ssh msg.from = 1d64b124afe30f29-53000-562310a2 client->identity() = 1d64b124afe30f29-53000-562310a2 [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected' hello_remote.from = 1b69d09fb3238b30-23945-562310a2 [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating process Information after exec()' progname = dmtcp_sshd msg.from = 1b69d09fb3238b30-55000-562310a2 client->identity() = 1b69d09fb3238b30-23945-562310a2 [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected' hello_remote.from = 1b69d09fb3238b30-55000-562310a2 [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating process Information after fork()' client->hostname() = acme12.ciemat.es client->progname() = dmtcp_sshd_(forked) msg.from = 1b69d09fb3238b30-56000-562310a2 client->identity() = 1b69d09fb3238b30-55000-562310a2 [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected' hello_remote.from = 1b69d09fb3238b30-56000-562310a2 [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating process Information after fork()' client->hostname() = acme12.ciemat.es client->progname() = mpispawn_(forked) msg.from = 1b69d09fb3238b30-57000-562310a2 client->identity() = 1b69d09fb3238b30-56000-562310a2 [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating process Information after exec()' progname = env msg.from = 1b69d09fb3238b30-56000-562310a2 client->identity() = 1b69d09fb3238b30-56000-562310a2 [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating process Information after exec()' progname = mpispawn msg.from = 1b69d09fb3238b30-56000-562310a2 client->identity() = 1b69d09fb3238b30-56000-562310a2 [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating process Information after exec()' progname = helloWorldMPI msg.from = 1b69d09fb3238b30-57000-562310a2 client->identity() = 1b69d09fb3238b30-57000-562310a2 [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client disconnected' client->identity() = 1b69d09fb3238b30-57000-562310a2 [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client disconnected' client->identity() = 1b69d09fb3238b30-56000-562310a2 [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client disconnected' client->identity() = 1b69d09fb3238b30-55000-562310a2 [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client disconnected' client->identity() = 1d64b124afe30f29-53000-562310a2 [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client disconnected' client->identity() = 1d64b124afe30f29-52000-562310a2 --- --- So we see that it is working correctly, connecting and so. However, if I run the application in more than one core, as in the first example, it crashes. What happens is that the first node on the node list executes the application, and the rest do not. ---- ---- [root@acme11 tests]# dmtcp_launch -h acme11 -p 7779 --ib mpirun_rsh -n 2 acme11 acme12 ./helloWorldMPI [59000] NOTE at ssh.cpp:369 in prepareForExec; REASON='New ssh command' newCommand = /home/localsoft/dmtcp/bin/dmtcp_ssh /home/localsoft/dmtcp/bin/dmtcp_nocheckpoint /usr/bin/ssh -q acme11 cd /home/slurm/tests;/home/localsoft/dmtcp/bin/dmtcp_launch --coord-host 172.17.29.173 --coord-port 7779 --ckptdir /home/slurm/tests --infiniband /home/localsoft/dmtcp/bin/dmtcp_sshd /usr/bin/env MPISPAWN_MPIRUN_MPD=0 USE_LINEAR_SSH=1 MPISPAWN_MPIRUN_HOST=acme11.ciemat.es MPISPAWN_MPIRUN_HOSTIP=172.17.29.173 MPIRUN_RSH_LAUNCH=1 MPISPAWN_CHECKIN_PORT=34203 MPISPAWN_MPIRUN_PORT=34203 MPISPAWN_NNODES=2 MPISPAWN_GLOBAL_NPROCS=2 MPISPAWN_MPIRUN_ID=58000 MPISPAWN_ARGC=1 MPDMAN_KVS_TEMPLATE=kvs_481_acme11.ciemat.es_58000 MPISPAWN_LOCAL_NPROCS=1 MPISPAWN_ARGV_0='./helloWorldMPI' MPISPAWN_ARGC=1 MPISPAWN_GENERIC_ENV_COUNT=0 MPISPAWN_ID=0 MPISPAWN_WORKING_DIR=/home/slurm/tests MPISPAWN_MPIRUN_RANK_0=0 /usr/local/bin/mpispawn 0 [60000] NOTE at ssh.cpp:369 in prepareForExec; REASON='New ssh command' newCommand = /home/localsoft/dmtcp/bin/dmtcp_ssh /home/localsoft/dmtcp/bin/dmtcp_nocheckpoint /usr/bin/ssh -q acme12 cd /home/slurm/tests;/home/localsoft/dmtcp/bin/dmtcp_launch --coord-host 172.17.29.173 --coord-port 7779 --ckptdir /home/slurm/tests --infiniband /home/localsoft/dmtcp/bin/dmtcp_sshd /usr/bin/env MPISPAWN_MPIRUN_MPD=0 USE_LINEAR_SSH=1 MPISPAWN_MPIRUN_HOST=acme11.ciemat.es MPISPAWN_MPIRUN_HOSTIP=172.17.29.173 MPIRUN_RSH_LAUNCH=1 MPISPAWN_CHECKIN_PORT=34203 MPISPAWN_MPIRUN_PORT=34203 MPISPAWN_NNODES=2 MPISPAWN_GLOBAL_NPROCS=2 MPISPAWN_MPIRUN_ID=58000 MPISPAWN_ARGC=1 MPDMAN_KVS_TEMPLATE=kvs_481_acme11.ciemat.es_58000 MPISPAWN_LOCAL_NPROCS=1 MPISPAWN_ARGV_0='./helloWorldMPI' MPISPAWN_ARGC=1 MPISPAWN_GENERIC_ENV_COUNT=0 MPISPAWN_ID=1 MPISPAWN_WORKING_DIR=/home/slurm/tests MPISPAWN_MPIRUN_RANK_0=1 /usr/local/bin/mpispawn 0 Process 0 of 2 is on acme11.ciemat.es Hello world from process 0 of 2 Goodbye world from process 0 of 2 COORDINATOR OUTPUT [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected' hello_remote.from = 1d64b124afe30f29-4070-56231173 [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating process Information after exec()' progname = mpirun_rsh msg.from = 1d64b124afe30f29-58000-56231173 client->identity() = 1d64b124afe30f29-4070-56231173 [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected' hello_remote.from = 1d64b124afe30f29-58000-56231173 [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected' hello_remote.from = 1d64b124afe30f29-58000-56231173 [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating process Information after fork()' client->hostname() = acme11.ciemat.es client->progname() = mpirun_rsh_(forked) msg.from = 1d64b124afe30f29-59000-56231173 client->identity() = 1d64b124afe30f29-58000-56231173 [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating process Information after fork()' client->hostname() = acme11.ciemat.es client->progname() = mpirun_rsh_(forked) msg.from = 1d64b124afe30f29-60000-56231173 client->identity() = 1d64b124afe30f29-58000-56231173 [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected' hello_remote.from = 1d64b124afe30f29-59000-56231173 [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected' hello_remote.from = 1d64b124afe30f29-60000-56231173 [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating process Information after fork()' client->hostname() = acme11.ciemat.es client->progname() = dmtcp_ssh_(forked) msg.from = 1d64b124afe30f29-61000-56231173 client->identity() = 1d64b124afe30f29-59000-56231173 [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating process Information after fork()' client->hostname() = acme11.ciemat.es client->progname() = dmtcp_ssh_(forked) msg.from = 1d64b124afe30f29-62000-56231173 client->identity() = 1d64b124afe30f29-60000-56231173 [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client disconnected' client->identity() = 1d64b124afe30f29-61000-56231173 [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client disconnected' client->identity() = 1d64b124afe30f29-62000-56231173 [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating process Information after exec()' progname = dmtcp_ssh msg.from = 1d64b124afe30f29-59000-56231173 client->identity() = 1d64b124afe30f29-59000-56231173 [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating process Information after exec()' progname = dmtcp_ssh msg.from = 1d64b124afe30f29-60000-56231173 client->identity() = 1d64b124afe30f29-60000-56231173 [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected' hello_remote.from = 1b69d09fb3238b30-24001-56231173 [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected' hello_remote.from = 1d64b124afe30f29-4094-56231173 [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating process Information after exec()' progname = dmtcp_sshd msg.from = 1d64b124afe30f29-64000-56231173 client->identity() = 1d64b124afe30f29-4094-56231173 [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating process Information after exec()' progname = dmtcp_sshd msg.from = 1b69d09fb3238b30-63000-56231173 client->identity() = 1b69d09fb3238b30-24001-56231173 [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected' hello_remote.from = 1d64b124afe30f29-64000-56231173 [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected' hello_remote.from = 1b69d09fb3238b30-63000-56231173 [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating process Information after fork()' client->hostname() = acme11.ciemat.es client->progname() = dmtcp_sshd_(forked) msg.from = 1d64b124afe30f29-65000-56231173 client->identity() = 1d64b124afe30f29-64000-56231173 [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating process Information after fork()' client->hostname() = acme12.ciemat.es client->progname() = dmtcp_sshd_(forked) msg.from = 1b69d09fb3238b30-66000-56231173 client->identity() = 1b69d09fb3238b30-63000-56231173 [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating process Information after exec()' progname = env msg.from = 1d64b124afe30f29-65000-56231173 client->identity() = 1d64b124afe30f29-65000-56231173 [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating process Information after exec()' progname = mpispawn msg.from = 1d64b124afe30f29-65000-56231173 client->identity() = 1d64b124afe30f29-65000-56231173 [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected' hello_remote.from = 1b69d09fb3238b30-66000-56231173 [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected' hello_remote.from = 1d64b124afe30f29-65000-56231173 [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating process Information after fork()' client->hostname() = acme11.ciemat.es client->progname() = mpispawn_(forked) msg.from = 1d64b124afe30f29-68000-56231173 client->identity() = 1d64b124afe30f29-65000-56231173 [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating process Information after fork()' client->hostname() = acme12.ciemat.es client->progname() = mpispawn_(forked) msg.from = 1b69d09fb3238b30-67000-56231173 client->identity() = 1b69d09fb3238b30-66000-56231173 [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating process Information after exec()' progname = env msg.from = 1b69d09fb3238b30-66000-56231173 client->identity() = 1b69d09fb3238b30-66000-56231173 [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating process Information after exec()' progname = mpispawn msg.from = 1b69d09fb3238b30-66000-56231173 client->identity() = 1b69d09fb3238b30-66000-56231173 [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating process Information after exec()' progname = helloWorldMPI msg.from = 1d64b124afe30f29-68000-56231173 client->identity() = 1d64b124afe30f29-68000-56231173 [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating process Information after exec()' progname = helloWorldMPI msg.from = 1b69d09fb3238b30-67000-56231173 client->identity() = 1b69d09fb3238b30-67000-56231173 [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client disconnected' client->identity() = 1d64b124afe30f29-68000-56231173 [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client disconnected' client->identity() = 1b69d09fb3238b30-67000-56231173 [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client disconnected' client->identity() = 1d64b124afe30f29-65000-56231173 [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client disconnected' client->identity() = 1b69d09fb3238b30-66000-56231173 [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client disconnected' client->identity() = 1d64b124afe30f29-64000-56231173 [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client disconnected' client->identity() = 1b69d09fb3238b30-63000-56231173 [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client disconnected' client->identity() = 1d64b124afe30f29-59000-56231173 [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client disconnected' client->identity() = 1d64b124afe30f29-60000-56231173 [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client disconnected' client->identity() = 1d64b124afe30f29-58000-56231173 ---- ---- -- Dr. Manuel Rodríguez-Pascual skype: manuel.rodriguez.pascual phone: (+34) 913466173 // (+34) 679925108 CIEMAT-Moncloa Edificio 22, desp. 1.25 Avenida Complutense, 40 28040- MADRID SPAIN
------------------------------------------------------------------------------
_______________________________________________ Dmtcp-forum mailing list Dmtcp-forum@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dmtcp-forum