Hey Jiajun, Can you take a look at this problem as it is closer to your area of expertise :-).
Best, Kapil On Sat, Oct 17, 2015 at 11:31 PM, Manuel Rodríguez Pascual < manuel.rodriguez.pasc...@gmail.com> wrote: > Hi all, > > I am trying to checkpoint an MVAPICH application. It does not behave as > expected, so maybe you can give me some support. > > I have compiled DMTCP with "--enable-infiniband-support " as only flag. I > have MVAPICH installed. > > I can execute a test MPI application in two nodes, without DMTCP. I also > can execute the application in a single node with DMTCP. however, it I > execute it in two nodes with DMTCP, only the first one will run. > > Below there is a series of test commands with a lot of output, together > with the versions of everything. > > Any ideas? > > thanks for your help, > > > Manuel > > > --- > --- > > # mpichversion > > MVAPICH2 Version: 2.2a > > MVAPICH2 Release date: Mon Aug 17 20:00:00 EDT 2015 > > MVAPICH2 Device: ch3:mrail > > MVAPICH2 configure: --disable-mcast > > MVAPICH2 CC: gcc -DNDEBUG -DNVALGRIND -O2 > > MVAPICH2 CXX: g++ -DNDEBUG -DNVALGRIND -O2 > > MVAPICH2 F77: gfortran -L/lib -L/lib -O2 > > MVAPICH2 FC: gfortran -O2 > > # dmtcp_coordinator --version > > dmtcp_coordinator (DMTCP) 2.4.1 > > --- > > --- > > > I can execute a test MPI application in two nodes (acme11 and 12), with > > --- > --- > # mpirun_rsh -n 2 acme11 acme12 ./helloWorldMPI > > Process 0 of 2 is on acme11.ciemat.es > > Process 1 of 2 is on acme12.ciemat.es > > Hello world from process 0 of 2 > > Hello world from process 1 of 2 > > Goodbye world from process 0 of 2 > > Goodbye world from process 1 of 2 > --- > --- > > As you can see, it works correctly. > > > If I try to execute the application with DMTCP, however, it does not. > > I run the coordinator on acme11, with port 7779. > > > I can execute the application on a single node. For example, > > --- > --- > > # dmtcp_launch -h acme11 -p 7779 --ib mpirun_rsh -n 1 acme12 > ./helloWorldMPI > > [41000] NOTE at ssh.cpp:369 in prepareForExec; REASON='New ssh command' > > newCommand = /home/localsoft/dmtcp/bin/dmtcp_ssh > /home/localsoft/dmtcp/bin/dmtcp_nocheckpoint /usr/bin/ssh -q acme12 cd > /home/slurm/tests;/home/localsoft/dmtcp/bin/dmtcp_launch --coord-host > 172.17.29.173 --coord-port 7779 --ckptdir /home/slurm/tests --infiniband > /home/localsoft/dmtcp/bin/dmtcp_sshd /usr/bin/env MPISPAWN_MPIRUN_MPD=0 > USE_LINEAR_SSH=1 MPISPAWN_MPIRUN_HOST=acme11.ciemat.es > MPISPAWN_MPIRUN_HOSTIP=172.17.29.173 MPIRUN_RSH_LAUNCH=1 > MPISPAWN_CHECKIN_PORT=33687 MPISPAWN_MPIRUN_PORT=33687 MPISPAWN_NNODES=1 > MPISPAWN_GLOBAL_NPROCS=1 MPISPAWN_MPIRUN_ID=40000 MPISPAWN_ARGC=1 > MPDMAN_KVS_TEMPLATE=kvs_885_acme11.ciemat.es_40000 MPISPAWN_LOCAL_NPROCS=1 > MPISPAWN_ARGV_0='./helloWorldMPI' MPISPAWN_ARGC=1 > MPISPAWN_GENERIC_ENV_COUNT=0 MPISPAWN_ID=0 > MPISPAWN_WORKING_DIR=/home/slurm/tests MPISPAWN_MPIRUN_RANK_0=0 > /usr/local/bin/mpispawn 0 > > Process 0 of 1 is on acme12.ciemat.es > > Hello world from process 0 of 1 > > Goodbye world from process 0 of 1 > > > COORDINATOR OUTPUT > > > [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker > connected' > > hello_remote.from = 1d64b124afe30f29-4029-562310a2 > > [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating > process Information after exec()' > > progname = mpirun_rsh > > msg.from = 1d64b124afe30f29-52000-562310a2 > > client->identity() = 1d64b124afe30f29-4029-562310a2 > > [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker > connected' > > hello_remote.from = 1d64b124afe30f29-52000-562310a2 > > [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating > process Information after fork()' > > client->hostname() = acme11.ciemat.es > > client->progname() = mpirun_rsh_(forked) > > msg.from = 1d64b124afe30f29-53000-562310a2 > > client->identity() = 1d64b124afe30f29-52000-562310a2 > > [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker > connected' > > hello_remote.from = 1d64b124afe30f29-53000-562310a2 > > [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating > process Information after fork()' > > client->hostname() = acme11.ciemat.es > > client->progname() = dmtcp_ssh_(forked) > > msg.from = 1d64b124afe30f29-54000-562310a2 > > client->identity() = 1d64b124afe30f29-53000-562310a2 > > [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client > disconnected' > > client->identity() = 1d64b124afe30f29-54000-562310a2 > > [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating > process Information after exec()' > > progname = dmtcp_ssh > > msg.from = 1d64b124afe30f29-53000-562310a2 > > client->identity() = 1d64b124afe30f29-53000-562310a2 > > [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker > connected' > > hello_remote.from = 1b69d09fb3238b30-23945-562310a2 > > [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating > process Information after exec()' > > progname = dmtcp_sshd > > msg.from = 1b69d09fb3238b30-55000-562310a2 > > client->identity() = 1b69d09fb3238b30-23945-562310a2 > > [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker > connected' > > hello_remote.from = 1b69d09fb3238b30-55000-562310a2 > > [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating > process Information after fork()' > > client->hostname() = acme12.ciemat.es > > client->progname() = dmtcp_sshd_(forked) > > msg.from = 1b69d09fb3238b30-56000-562310a2 > > client->identity() = 1b69d09fb3238b30-55000-562310a2 > > [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker > connected' > > hello_remote.from = 1b69d09fb3238b30-56000-562310a2 > > [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating > process Information after fork()' > > client->hostname() = acme12.ciemat.es > > client->progname() = mpispawn_(forked) > > msg.from = 1b69d09fb3238b30-57000-562310a2 > > client->identity() = 1b69d09fb3238b30-56000-562310a2 > > [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating > process Information after exec()' > > progname = env > > msg.from = 1b69d09fb3238b30-56000-562310a2 > > client->identity() = 1b69d09fb3238b30-56000-562310a2 > > [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating > process Information after exec()' > > progname = mpispawn > > msg.from = 1b69d09fb3238b30-56000-562310a2 > > client->identity() = 1b69d09fb3238b30-56000-562310a2 > > [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating > process Information after exec()' > > progname = helloWorldMPI > > msg.from = 1b69d09fb3238b30-57000-562310a2 > > client->identity() = 1b69d09fb3238b30-57000-562310a2 > > [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client > disconnected' > > client->identity() = 1b69d09fb3238b30-57000-562310a2 > > [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client > disconnected' > > client->identity() = 1b69d09fb3238b30-56000-562310a2 > > [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client > disconnected' > > client->identity() = 1b69d09fb3238b30-55000-562310a2 > > [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client > disconnected' > > client->identity() = 1d64b124afe30f29-53000-562310a2 > > [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client > disconnected' > > client->identity() = 1d64b124afe30f29-52000-562310a2 > > > --- > --- > > So we see that it is working correctly, connecting and so. > > However, if I run the application in more than one core, as in the first > example, it crashes. What happens is that the first node on the node list > executes the application, and the rest do not. > > ---- > ---- > > [root@acme11 tests]# dmtcp_launch -h acme11 -p 7779 --ib mpirun_rsh -n > 2 acme11 acme12 ./helloWorldMPI > > [59000] NOTE at ssh.cpp:369 in prepareForExec; REASON='New ssh command' > > newCommand = /home/localsoft/dmtcp/bin/dmtcp_ssh > /home/localsoft/dmtcp/bin/dmtcp_nocheckpoint /usr/bin/ssh -q acme11 cd > /home/slurm/tests;/home/localsoft/dmtcp/bin/dmtcp_launch --coord-host > 172.17.29.173 --coord-port 7779 --ckptdir /home/slurm/tests --infiniband > /home/localsoft/dmtcp/bin/dmtcp_sshd /usr/bin/env MPISPAWN_MPIRUN_MPD=0 > USE_LINEAR_SSH=1 MPISPAWN_MPIRUN_HOST=acme11.ciemat.es > MPISPAWN_MPIRUN_HOSTIP=172.17.29.173 MPIRUN_RSH_LAUNCH=1 > MPISPAWN_CHECKIN_PORT=34203 MPISPAWN_MPIRUN_PORT=34203 MPISPAWN_NNODES=2 > MPISPAWN_GLOBAL_NPROCS=2 MPISPAWN_MPIRUN_ID=58000 MPISPAWN_ARGC=1 > MPDMAN_KVS_TEMPLATE=kvs_481_acme11.ciemat.es_58000 MPISPAWN_LOCAL_NPROCS=1 > MPISPAWN_ARGV_0='./helloWorldMPI' MPISPAWN_ARGC=1 > MPISPAWN_GENERIC_ENV_COUNT=0 MPISPAWN_ID=0 > MPISPAWN_WORKING_DIR=/home/slurm/tests MPISPAWN_MPIRUN_RANK_0=0 > /usr/local/bin/mpispawn 0 > > [60000] NOTE at ssh.cpp:369 in prepareForExec; REASON='New ssh command' > > newCommand = /home/localsoft/dmtcp/bin/dmtcp_ssh > /home/localsoft/dmtcp/bin/dmtcp_nocheckpoint /usr/bin/ssh -q acme12 cd > /home/slurm/tests;/home/localsoft/dmtcp/bin/dmtcp_launch --coord-host > 172.17.29.173 --coord-port 7779 --ckptdir /home/slurm/tests --infiniband > /home/localsoft/dmtcp/bin/dmtcp_sshd /usr/bin/env MPISPAWN_MPIRUN_MPD=0 > USE_LINEAR_SSH=1 MPISPAWN_MPIRUN_HOST=acme11.ciemat.es > MPISPAWN_MPIRUN_HOSTIP=172.17.29.173 MPIRUN_RSH_LAUNCH=1 > MPISPAWN_CHECKIN_PORT=34203 MPISPAWN_MPIRUN_PORT=34203 MPISPAWN_NNODES=2 > MPISPAWN_GLOBAL_NPROCS=2 MPISPAWN_MPIRUN_ID=58000 MPISPAWN_ARGC=1 > MPDMAN_KVS_TEMPLATE=kvs_481_acme11.ciemat.es_58000 MPISPAWN_LOCAL_NPROCS=1 > MPISPAWN_ARGV_0='./helloWorldMPI' MPISPAWN_ARGC=1 > MPISPAWN_GENERIC_ENV_COUNT=0 MPISPAWN_ID=1 > MPISPAWN_WORKING_DIR=/home/slurm/tests MPISPAWN_MPIRUN_RANK_0=1 > /usr/local/bin/mpispawn 0 > > Process 0 of 2 is on acme11.ciemat.es > > Hello world from process 0 of 2 > > Goodbye world from process 0 of 2 > > COORDINATOR OUTPUT > > > [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker > connected' > > hello_remote.from = 1d64b124afe30f29-4070-56231173 > > [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating > process Information after exec()' > > progname = mpirun_rsh > > msg.from = 1d64b124afe30f29-58000-56231173 > > client->identity() = 1d64b124afe30f29-4070-56231173 > > [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker > connected' > > hello_remote.from = 1d64b124afe30f29-58000-56231173 > > [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker > connected' > > hello_remote.from = 1d64b124afe30f29-58000-56231173 > > [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating > process Information after fork()' > > client->hostname() = acme11.ciemat.es > > client->progname() = mpirun_rsh_(forked) > > msg.from = 1d64b124afe30f29-59000-56231173 > > client->identity() = 1d64b124afe30f29-58000-56231173 > > [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating > process Information after fork()' > > client->hostname() = acme11.ciemat.es > > client->progname() = mpirun_rsh_(forked) > > msg.from = 1d64b124afe30f29-60000-56231173 > > client->identity() = 1d64b124afe30f29-58000-56231173 > > [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker > connected' > > hello_remote.from = 1d64b124afe30f29-59000-56231173 > > [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker > connected' > > hello_remote.from = 1d64b124afe30f29-60000-56231173 > > [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating > process Information after fork()' > > client->hostname() = acme11.ciemat.es > > client->progname() = dmtcp_ssh_(forked) > > msg.from = 1d64b124afe30f29-61000-56231173 > > client->identity() = 1d64b124afe30f29-59000-56231173 > > [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating > process Information after fork()' > > client->hostname() = acme11.ciemat.es > > client->progname() = dmtcp_ssh_(forked) > > msg.from = 1d64b124afe30f29-62000-56231173 > > client->identity() = 1d64b124afe30f29-60000-56231173 > > [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client > disconnected' > > client->identity() = 1d64b124afe30f29-61000-56231173 > > [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client > disconnected' > > client->identity() = 1d64b124afe30f29-62000-56231173 > > [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating > process Information after exec()' > > progname = dmtcp_ssh > > msg.from = 1d64b124afe30f29-59000-56231173 > > client->identity() = 1d64b124afe30f29-59000-56231173 > > [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating > process Information after exec()' > > progname = dmtcp_ssh > > msg.from = 1d64b124afe30f29-60000-56231173 > > client->identity() = 1d64b124afe30f29-60000-56231173 > > [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker > connected' > > hello_remote.from = 1b69d09fb3238b30-24001-56231173 > > [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker > connected' > > hello_remote.from = 1d64b124afe30f29-4094-56231173 > > [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating > process Information after exec()' > > progname = dmtcp_sshd > > msg.from = 1d64b124afe30f29-64000-56231173 > > client->identity() = 1d64b124afe30f29-4094-56231173 > > [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating > process Information after exec()' > > progname = dmtcp_sshd > > msg.from = 1b69d09fb3238b30-63000-56231173 > > client->identity() = 1b69d09fb3238b30-24001-56231173 > > [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker > connected' > > hello_remote.from = 1d64b124afe30f29-64000-56231173 > > [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker > connected' > > hello_remote.from = 1b69d09fb3238b30-63000-56231173 > > [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating > process Information after fork()' > > client->hostname() = acme11.ciemat.es > > client->progname() = dmtcp_sshd_(forked) > > msg.from = 1d64b124afe30f29-65000-56231173 > > client->identity() = 1d64b124afe30f29-64000-56231173 > > [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating > process Information after fork()' > > client->hostname() = acme12.ciemat.es > > client->progname() = dmtcp_sshd_(forked) > > msg.from = 1b69d09fb3238b30-66000-56231173 > > client->identity() = 1b69d09fb3238b30-63000-56231173 > > [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating > process Information after exec()' > > progname = env > > msg.from = 1d64b124afe30f29-65000-56231173 > > client->identity() = 1d64b124afe30f29-65000-56231173 > > [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating > process Information after exec()' > > progname = mpispawn > > msg.from = 1d64b124afe30f29-65000-56231173 > > client->identity() = 1d64b124afe30f29-65000-56231173 > > [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker > connected' > > hello_remote.from = 1b69d09fb3238b30-66000-56231173 > > [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker > connected' > > hello_remote.from = 1d64b124afe30f29-65000-56231173 > > [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating > process Information after fork()' > > client->hostname() = acme11.ciemat.es > > client->progname() = mpispawn_(forked) > > msg.from = 1d64b124afe30f29-68000-56231173 > > client->identity() = 1d64b124afe30f29-65000-56231173 > > [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating > process Information after fork()' > > client->hostname() = acme12.ciemat.es > > client->progname() = mpispawn_(forked) > > msg.from = 1b69d09fb3238b30-67000-56231173 > > client->identity() = 1b69d09fb3238b30-66000-56231173 > > [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating > process Information after exec()' > > progname = env > > msg.from = 1b69d09fb3238b30-66000-56231173 > > client->identity() = 1b69d09fb3238b30-66000-56231173 > > [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating > process Information after exec()' > > progname = mpispawn > > msg.from = 1b69d09fb3238b30-66000-56231173 > > client->identity() = 1b69d09fb3238b30-66000-56231173 > > [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating > process Information after exec()' > > progname = helloWorldMPI > > msg.from = 1d64b124afe30f29-68000-56231173 > > client->identity() = 1d64b124afe30f29-68000-56231173 > > [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating > process Information after exec()' > > progname = helloWorldMPI > > msg.from = 1b69d09fb3238b30-67000-56231173 > > client->identity() = 1b69d09fb3238b30-67000-56231173 > > [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client > disconnected' > > client->identity() = 1d64b124afe30f29-68000-56231173 > > [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client > disconnected' > > client->identity() = 1b69d09fb3238b30-67000-56231173 > > [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client > disconnected' > > client->identity() = 1d64b124afe30f29-65000-56231173 > > [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client > disconnected' > > client->identity() = 1b69d09fb3238b30-66000-56231173 > > [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client > disconnected' > > client->identity() = 1d64b124afe30f29-64000-56231173 > > [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client > disconnected' > > client->identity() = 1b69d09fb3238b30-63000-56231173 > > [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client > disconnected' > > client->identity() = 1d64b124afe30f29-59000-56231173 > > [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client > disconnected' > > client->identity() = 1d64b124afe30f29-60000-56231173 > > [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client > disconnected' > > client->identity() = 1d64b124afe30f29-58000-56231173 > > > ---- > > ---- > > > > > > > > -- > Dr. Manuel Rodríguez-Pascual > skype: manuel.rodriguez.pascual > phone: (+34) 913466173 // (+34) 679925108 > > CIEMAT-Moncloa > Edificio 22, desp. 1.25 > Avenida Complutense, 40 > 28040- MADRID > SPAIN > > > ------------------------------------------------------------------------------ > > _______________________________________________ > Dmtcp-forum mailing list > Dmtcp-forum@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum > >
------------------------------------------------------------------------------
_______________________________________________ Dmtcp-forum mailing list Dmtcp-forum@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dmtcp-forum