Hi Jiajun, all, I have been performing some more tests.
When running the code with "mpirun_rsh -n 2 acme11 acme12 ./helloWorldMPI ", if I want to use DMTCP, there are problems (this is the scenario wof the previus mail). Withouth the --ib flag the problem still persists. I think the error is the same, below you have the output anyway. In case it helps on the debug, some info that might be relevant: - The application can be executed without DMTCP. - With DMTCP, only the first node on the list is executed. But there seem to be exceptions to this: -- if I execute mpi_run sh with " -n 2 acme11 acme11" (or whatever node, but the same one twice) : it crashes. This does not happen without DMTCP, it that case it works correctly. -- if I use " -n 3 acme11 acme12 acme11" (three nodes, repeating one): it also crashes. It seems that if you put the same node more than once, it does not work. -- the first node in the list is the only one that runs. For example, if I use "-n 2 acme11 acme12" then acme11 will execute the code. If I use "-n 2 acme12 acme11", then acme12 will. With three nodes it is identical, just the first one on the list. However, I have seen that if I execute the application with another MPI library, MPICH, "mpiexec -n 2 acme11 acme12 ./helloWorldMPI" Everything works as expected. I can use DMTCP with "dmtcp_launch -h acme11 -p 7779 mpiexec -n 2 acme11 acme12 ./helloWorldMPI" and it succeeds. In this case, it works both with "--ib" and without it. Just in case it helps, output is below too. Thansk for your help, Manuel ---- ---- -bash-4.2$ dmtcp_launch -h acme11 -p 7779 mpirun_rsh -n 2 acme11 acme12 ./helloWorldMPI [126000] NOTE at dmtcpworker.cpp:349 in DmtcpWorker; REASON=' *** InfiniBand library detected. Please use dmtcp_launch --ib *** ' [127000] NOTE at ssh.cpp:369 in prepareForExec; REASON='New ssh command' newCommand = /home/localsoft/dmtcp/bin/dmtcp_ssh /home/localsoft/dmtcp/bin/dmtcp_nocheckpoint /usr/bin/ssh -q acme11 cd /home/slurm/tests;/home/localsoft/dmtcp/bin/dmtcp_launch --coord-host 172.17.29.173 --coord-port 7779 --ckptdir /home/slurm/tests /home/localsoft/dmtcp/bin/dmtcp_sshd /usr/bin/env MPISPAWN_MPIRUN_MPD=0 USE_LINEAR_SSH=1 MPISPAWN_MPIRUN_HOST=acme11.ciemat.es MPISPAWN_MPIRUN_HOSTIP=172.17.29.173 MPIRUN_RSH_LAUNCH=1 MPISPAWN_CHECKIN_PORT=59589 MPISPAWN_MPIRUN_PORT=59589 MPISPAWN_NNODES=2 MPISPAWN_GLOBAL_NPROCS=2 MPISPAWN_MPIRUN_ID=126000 MPISPAWN_ARGC=1 MPDMAN_KVS_TEMPLATE=kvs_311_acme11.ciemat.es_126000 MPISPAWN_LOCAL_NPROCS=1 MPISPAWN_ARGV_0='./helloWorldMPI' MPISPAWN_ARGC=1 MPISPAWN_GENERIC_ENV_COUNT=0 MPISPAWN_ID=0 MPISPAWN_WORKING_DIR=/home/slurm/tests MPISPAWN_MPIRUN_RANK_0=0 /usr/local/bin/mpispawn 0 [128000] NOTE at ssh.cpp:369 in prepareForExec; REASON='New ssh command' newCommand = /home/localsoft/dmtcp/bin/dmtcp_ssh /home/localsoft/dmtcp/bin/dmtcp_nocheckpoint /usr/bin/ssh -q acme12 cd /home/slurm/tests;/home/localsoft/dmtcp/bin/dmtcp_launch --coord-host 172.17.29.173 --coord-port 7779 --ckptdir /home/slurm/tests /home/localsoft/dmtcp/bin/dmtcp_sshd /usr/bin/env MPISPAWN_MPIRUN_MPD=0 USE_LINEAR_SSH=1 MPISPAWN_MPIRUN_HOST=acme11.ciemat.es MPISPAWN_MPIRUN_HOSTIP=172.17.29.173 MPIRUN_RSH_LAUNCH=1 MPISPAWN_CHECKIN_PORT=59589 MPISPAWN_MPIRUN_PORT=59589 MPISPAWN_NNODES=2 MPISPAWN_GLOBAL_NPROCS=2 MPISPAWN_MPIRUN_ID=126000 MPISPAWN_ARGC=1 MPDMAN_KVS_TEMPLATE=kvs_311_acme11.ciemat.es_126000 MPISPAWN_LOCAL_NPROCS=1 MPISPAWN_ARGV_0='./helloWorldMPI' MPISPAWN_ARGC=1 MPISPAWN_GENERIC_ENV_COUNT=0 MPISPAWN_ID=1 MPISPAWN_WORKING_DIR=/home/slurm/tests MPISPAWN_MPIRUN_RANK_0=1 /usr/local/bin/mpispawn 0 Process 0 of 2 is on acme11.ciemat.es Hello world from process 0 of 2 Goodbye world from process 0 of 2 COORDINATOR OUTPUT [28245] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected' hello_remote.from = 1d64b124afe30f29-28766-56250e08 [28245] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating process Information after exec()' progname = mpirun_rsh msg.from = 1d64b124afe30f29-126000-56250e08 client->identity() = 1d64b124afe30f29-28766-56250e08 [28245] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected' hello_remote.from = 1d64b124afe30f29-126000-56250e08 [28245] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected' hello_remote.from = 1d64b124afe30f29-126000-56250e08 [28245] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating process Information after fork()' client->hostname() = acme11.ciemat.es client->progname() = mpirun_rsh_(forked) msg.from = 1d64b124afe30f29-127000-56250e08 client->identity() = 1d64b124afe30f29-126000-56250e08 [28245] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating process Information after fork()' client->hostname() = acme11.ciemat.es client->progname() = mpirun_rsh_(forked) msg.from = 1d64b124afe30f29-128000-56250e08 client->identity() = 1d64b124afe30f29-126000-56250e08 [28245] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected' hello_remote.from = 1d64b124afe30f29-127000-56250e08 [28245] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected' hello_remote.from = 1d64b124afe30f29-128000-56250e08 [28245] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating process Information after fork()' client->hostname() = acme11.ciemat.es client->progname() = dmtcp_ssh_(forked) msg.from = 1d64b124afe30f29-129000-56250e08 client->identity() = 1d64b124afe30f29-127000-56250e08 [28245] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating process Information after fork()' client->hostname() = acme11.ciemat.es client->progname() = dmtcp_ssh_(forked) msg.from = 1d64b124afe30f29-130000-56250e08 client->identity() = 1d64b124afe30f29-128000-56250e08 [28245] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client disconnected' client->identity() = 1d64b124afe30f29-129000-56250e08 [28245] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client disconnected' client->identity() = 1d64b124afe30f29-130000-56250e08 [28245] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating process Information after exec()' progname = dmtcp_ssh msg.from = 1d64b124afe30f29-127000-56250e08 client->identity() = 1d64b124afe30f29-127000-56250e08 [28245] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating process Information after exec()' progname = dmtcp_ssh msg.from = 1d64b124afe30f29-128000-56250e08 client->identity() = 1d64b124afe30f29-128000-56250e08 [28245] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected' hello_remote.from = 1d64b124afe30f29-28786-56250e08 [28245] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected' hello_remote.from = 1b69d09fb3238b30-12757-56250e08 [28245] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating process Information after exec()' progname = dmtcp_sshd msg.from = 1d64b124afe30f29-131000-56250e08 client->identity() = 1d64b124afe30f29-28786-56250e08 [28245] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating process Information after exec()' progname = dmtcp_sshd msg.from = 1b69d09fb3238b30-132000-56250e08 client->identity() = 1b69d09fb3238b30-12757-56250e08 [28245] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected' hello_remote.from = 1d64b124afe30f29-131000-56250e08 [28245] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected' hello_remote.from = 1b69d09fb3238b30-132000-56250e08 [28245] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating process Information after fork()' client->hostname() = acme11.ciemat.es client->progname() = dmtcp_sshd_(forked) msg.from = 1d64b124afe30f29-133000-56250e08 client->identity() = 1d64b124afe30f29-131000-56250e08 [28245] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating process Information after fork()' client->hostname() = acme12.ciemat.es client->progname() = dmtcp_sshd_(forked) msg.from = 1b69d09fb3238b30-134000-56250e08 client->identity() = 1b69d09fb3238b30-132000-56250e08 [28245] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected' hello_remote.from = 1d64b124afe30f29-133000-56250e08 [28245] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected' hello_remote.from = 1b69d09fb3238b30-134000-56250e08 [28245] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating process Information after fork()' client->hostname() = acme11.ciemat.es client->progname() = mpispawn_(forked) msg.from = 1d64b124afe30f29-135000-56250e08 client->identity() = 1d64b124afe30f29-133000-56250e08 [28245] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating process Information after fork()' client->hostname() = acme12.ciemat.es client->progname() = mpispawn_(forked) msg.from = 1b69d09fb3238b30-136000-56250e08 client->identity() = 1b69d09fb3238b30-134000-56250e08 [28245] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating process Information after exec()' progname = env msg.from = 1d64b124afe30f29-133000-56250e08 client->identity() = 1d64b124afe30f29-133000-56250e08 [28245] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating process Information after exec()' progname = mpispawn msg.from = 1d64b124afe30f29-133000-56250e08 client->identity() = 1d64b124afe30f29-133000-56250e08 [28245] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating process Information after exec()' progname = env msg.from = 1b69d09fb3238b30-134000-56250e08 client->identity() = 1b69d09fb3238b30-134000-56250e08 [28245] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating process Information after exec()' progname = mpispawn msg.from = 1b69d09fb3238b30-134000-56250e08 client->identity() = 1b69d09fb3238b30-134000-56250e08 [28245] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating process Information after exec()' progname = helloWorldMPI msg.from = 1d64b124afe30f29-135000-56250e08 client->identity() = 1d64b124afe30f29-135000-56250e08 [28245] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating process Information after exec()' progname = helloWorldMPI msg.from = 1b69d09fb3238b30-136000-56250e08 client->identity() = 1b69d09fb3238b30-136000-56250e08 [28245] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client disconnected' client->identity() = 1d64b124afe30f29-135000-56250e08 [28245] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client disconnected' client->identity() = 1b69d09fb3238b30-136000-56250e08 [28245] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client disconnected' client->identity() = 1d64b124afe30f29-133000-56250e08 [28245] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client disconnected' client->identity() = 1d64b124afe30f29-131000-56250e08 [28245] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client disconnected' client->identity() = 1b69d09fb3238b30-134000-56250e08 [28245] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client disconnected' client->identity() = 1b69d09fb3238b30-132000-56250e08 [28245] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client disconnected' client->identity() = 1d64b124afe30f29-127000-56250e08 [28245] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client disconnected' client->identity() = 1d64b124afe30f29-128000-56250e08 [28245] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client disconnected' client->identity() = 1d64b124afe30f29-126000-56250e08 ----- ----- WITH MPIEXEC [root@acme11 tests]# dmtcp_launch --ib mpiexec -f machinefile -n 3 /home/slurm/tests/helloWorldMPI [42000] NOTE at ssh.cpp:369 in prepareForExec; REASON='New ssh command' newCommand = /home/localsoft/dmtcp/bin/dmtcp_ssh /home/localsoft/dmtcp/bin/dmtcp_nocheckpoint /usr/bin/ssh -x acme12 /home/localsoft/dmtcp/bin/dmtcp_launch --coord-host 172.17.31.157 --coord-port 7779 --ckptdir /home/slurm/tests --infiniband /home/localsoft/dmtcp/bin/dmtcp_sshd "/home/localsoft/mpich3/bin//hydra_pmi_proxy" --control-port acme11:44279 --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 1 [43000] NOTE at ssh.cpp:369 in prepareForExec; REASON='New ssh command' newCommand = /home/localsoft/dmtcp/bin/dmtcp_ssh /home/localsoft/dmtcp/bin/dmtcp_nocheckpoint /usr/bin/ssh -x acme13 /home/localsoft/dmtcp/bin/dmtcp_launch --coord-host 172.17.31.157 --coord-port 7779 --ckptdir /home/slurm/tests --infiniband /home/localsoft/dmtcp/bin/dmtcp_sshd "/home/localsoft/mpich3/bin//hydra_pmi_proxy" --control-port acme11:44279 --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 2 Process 0 of 3 is on acme11.ciemat.es Hello world from process 0 of 3 this is iteration 0 on process 0 of host acme11.ciemat.es Process 2 of 3 is on acme13.ciemat.es Hello world from process 2 of 3 Process 1 of 3 is on acme12.ciemat.es Hello world from process 1 of 3 Goodbye world from process 1 of 3 Goodbye world from process 2 of 3 Goodbye world from process 0 of 3 COORDINATOR [root@acme11 ~]# dmtcp_coordinator dmtcp_coordinator starting... Host: acme11.ciemat.es (172.17.31.157) Port: 7779 Checkpoint Interval: disabled (checkpoint manually instead) Exit on last client: 0 Type '?' for help. [21211] NOTE at dmtcp_coordinator.cpp:1661 in updateCheckpointInterval; REASON='CheckpointInterval updated (for this computation only)' oldInterval = 0 theCheckpointInterval = 0 [21211] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected' hello_remote.from = 1d64b124afe30f29-21212-56252da6 [21211] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating process Information after exec()' progname = mpiexec.hydra msg.from = 1d64b124afe30f29-40000-56252da6 client->identity() = 1d64b124afe30f29-21212-56252da6 [21211] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected' hello_remote.from = 1d64b124afe30f29-40000-56252da6 [21211] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating process Information after fork()' client->hostname() = acme11.ciemat.es client->progname() = mpiexec.hydra_(forked) msg.from = 1d64b124afe30f29-41000-56252da6 client->identity() = 1d64b124afe30f29-40000-56252da6 [21211] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected' hello_remote.from = 1d64b124afe30f29-40000-56252da6 [21211] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating process Information after fork()' client->hostname() = acme11.ciemat.es client->progname() = mpiexec.hydra_(forked) msg.from = 1d64b124afe30f29-42000-56252da6 client->identity() = 1d64b124afe30f29-40000-56252da6 [21211] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected' hello_remote.from = 1d64b124afe30f29-40000-56252da6 [21211] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating process Information after fork()' client->hostname() = acme11.ciemat.es client->progname() = mpiexec.hydra_(forked) msg.from = 1d64b124afe30f29-43000-56252da6 client->identity() = 1d64b124afe30f29-40000-56252da6 [21211] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected' hello_remote.from = 1d64b124afe30f29-41000-56252da6 [21211] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating process Information after fork()' client->hostname() = acme11.ciemat.es client->progname() = hydra_pmi_proxy_(forked) msg.from = 1d64b124afe30f29-44000-56252da6 client->identity() = 1d64b124afe30f29-41000-56252da6 [21211] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected' hello_remote.from = 1d64b124afe30f29-42000-56252da6 [21211] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating process Information after fork()' client->hostname() = acme11.ciemat.es client->progname() = dmtcp_ssh_(forked) msg.from = 1d64b124afe30f29-45000-56252da6 client->identity() = 1d64b124afe30f29-42000-56252da6 [21211] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected' hello_remote.from = 1d64b124afe30f29-43000-56252da6 [21211] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating process Information after fork()' client->hostname() = acme11.ciemat.es client->progname() = dmtcp_ssh_(forked) msg.from = 1d64b124afe30f29-46000-56252da6 client->identity() = 1d64b124afe30f29-43000-56252da6 [21211] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client disconnected' client->identity() = 1d64b124afe30f29-45000-56252da6 [21211] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client disconnected' client->identity() = 1d64b124afe30f29-46000-56252da6 [21211] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating process Information after exec()' progname = hydra_pmi_proxy msg.from = 1d64b124afe30f29-41000-56252da6 client->identity() = 1d64b124afe30f29-41000-56252da6 [21211] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating process Information after exec()' progname = dmtcp_ssh msg.from = 1d64b124afe30f29-42000-56252da6 client->identity() = 1d64b124afe30f29-42000-56252da6 [21211] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating process Information after exec()' progname = dmtcp_ssh msg.from = 1d64b124afe30f29-43000-56252da6 client->identity() = 1d64b124afe30f29-43000-56252da6 [21211] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating process Information after exec()' progname = helloWorldMPI msg.from = 1d64b124afe30f29-44000-56252da6 client->identity() = 1d64b124afe30f29-44000-56252da6 [21211] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected' hello_remote.from = 1b69d09fb3238b30-14428-56252da7 [21211] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected' hello_remote.from = 54385264162a2589-10066-56252da7 [21211] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating process Information after exec()' progname = dmtcp_sshd msg.from = 1b69d09fb3238b30-47000-56252da7 client->identity() = 1b69d09fb3238b30-14428-56252da7 [21211] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating process Information after exec()' progname = dmtcp_sshd msg.from = 54385264162a2589-48000-56252da7 client->identity() = 54385264162a2589-10066-56252da7 [21211] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected' hello_remote.from = 1b69d09fb3238b30-47000-56252da7 [21211] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected' hello_remote.from = 54385264162a2589-48000-56252da7 [21211] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating process Information after fork()' client->hostname() = acme13.ciemat.es client->progname() = dmtcp_sshd_(forked) msg.from = 54385264162a2589-50000-56252da7 client->identity() = 54385264162a2589-48000-56252da7 [21211] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating process Information after fork()' client->hostname() = acme12.ciemat.es client->progname() = dmtcp_sshd_(forked) msg.from = 1b69d09fb3238b30-49000-56252da7 client->identity() = 1b69d09fb3238b30-47000-56252da7 [21211] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected' hello_remote.from = 54385264162a2589-50000-56252da7 [21211] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker connected' hello_remote.from = 1b69d09fb3238b30-49000-56252da7 [21211] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating process Information after fork()' client->hostname() = acme13.ciemat.es client->progname() = hydra_pmi_proxy_(forked) msg.from = 54385264162a2589-51000-56252da7 client->identity() = 54385264162a2589-50000-56252da7 [21211] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating process Information after fork()' client->hostname() = acme12.ciemat.es client->progname() = hydra_pmi_proxy_(forked) msg.from = 1b69d09fb3238b30-52000-56252da7 client->identity() = 1b69d09fb3238b30-49000-56252da7 [21211] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating process Information after exec()' progname = hydra_pmi_proxy msg.from = 1b69d09fb3238b30-49000-56252da7 client->identity() = 1b69d09fb3238b30-49000-56252da7 [21211] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating process Information after exec()' progname = hydra_pmi_proxy msg.from = 54385264162a2589-50000-56252da7 client->identity() = 54385264162a2589-50000-56252da7 [21211] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating process Information after exec()' progname = helloWorldMPI msg.from = 54385264162a2589-51000-56252da7 client->identity() = 54385264162a2589-51000-56252da7 [21211] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating process Information after exec()' progname = helloWorldMPI msg.from = 1b69d09fb3238b30-52000-56252da7 client->identity() = 1b69d09fb3238b30-52000-56252da7 [21211] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client disconnected' client->identity() = 1b69d09fb3238b30-52000-56252da7 [21211] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client disconnected' client->identity() = 54385264162a2589-51000-56252da7 [21211] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client disconnected' client->identity() = 1d64b124afe30f29-44000-56252da6 [21211] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client disconnected' client->identity() = 1b69d09fb3238b30-49000-56252da7 [21211] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client disconnected' client->identity() = 54385264162a2589-50000-56252da7 [21211] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client disconnected' client->identity() = 1d64b124afe30f29-41000-56252da6 [21211] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client disconnected' client->identity() = 1b69d09fb3238b30-47000-56252da7 [21211] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client disconnected' client->identity() = 54385264162a2589-48000-56252da7 [21211] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client disconnected' client->identity() = 1d64b124afe30f29-42000-56252da6 [21211] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client disconnected' client->identity() = 1d64b124afe30f29-43000-56252da6 [21211] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client disconnected' client->identity() = 1d64b124afe30f29-40000-56252da6 2015-10-19 8:26 GMT-07:00 Jiajun Cao <jia...@ccs.neu.edu>: > Hi Manuel, > > The infiniband plugin shouldn't affect application launching. Could you > try removing the "--ib" flag and see if the application still crashes? This > can help diagnose whether the issue is in the ib plugin or other dmtcp > modules. > > Best, > Jiajun > > > Best, > Jiajun > > On Sun, Oct 18, 2015 at 10:57 PM, Kapil Arya <ka...@ccs.neu.edu> wrote: > >> Hey Jiajun, >> >> Can you take a look at this problem as it is closer to your area of >> expertise :-). >> >> Best, >> Kapil >> >> On Sat, Oct 17, 2015 at 11:31 PM, Manuel Rodríguez Pascual < >> manuel.rodriguez.pasc...@gmail.com> wrote: >> >>> Hi all, >>> >>> I am trying to checkpoint an MVAPICH application. It does not behave as >>> expected, so maybe you can give me some support. >>> >>> I have compiled DMTCP with "--enable-infiniband-support " as only flag. >>> I have MVAPICH installed. >>> >>> I can execute a test MPI application in two nodes, without DMTCP. I also >>> can execute the application in a single node with DMTCP. however, it I >>> execute it in two nodes with DMTCP, only the first one will run. >>> >>> Below there is a series of test commands with a lot of output, together >>> with the versions of everything. >>> >>> Any ideas? >>> >>> thanks for your help, >>> >>> >>> Manuel >>> >>> >>> --- >>> --- >>> >>> # mpichversion >>> >>> MVAPICH2 Version: 2.2a >>> >>> MVAPICH2 Release date: Mon Aug 17 20:00:00 EDT 2015 >>> >>> MVAPICH2 Device: ch3:mrail >>> >>> MVAPICH2 configure: --disable-mcast >>> >>> MVAPICH2 CC: gcc -DNDEBUG -DNVALGRIND -O2 >>> >>> MVAPICH2 CXX: g++ -DNDEBUG -DNVALGRIND -O2 >>> >>> MVAPICH2 F77: gfortran -L/lib -L/lib -O2 >>> >>> MVAPICH2 FC: gfortran -O2 >>> >>> # dmtcp_coordinator --version >>> >>> dmtcp_coordinator (DMTCP) 2.4.1 >>> >>> --- >>> >>> --- >>> >>> >>> I can execute a test MPI application in two nodes (acme11 and 12), with >>> >>> --- >>> --- >>> # mpirun_rsh -n 2 acme11 acme12 ./helloWorldMPI >>> >>> Process 0 of 2 is on acme11.ciemat.es >>> >>> Process 1 of 2 is on acme12.ciemat.es >>> >>> Hello world from process 0 of 2 >>> >>> Hello world from process 1 of 2 >>> >>> Goodbye world from process 0 of 2 >>> >>> Goodbye world from process 1 of 2 >>> --- >>> --- >>> >>> As you can see, it works correctly. >>> >>> >>> If I try to execute the application with DMTCP, however, it does not. >>> >>> I run the coordinator on acme11, with port 7779. >>> >>> >>> I can execute the application on a single node. For example, >>> >>> --- >>> --- >>> >>> # dmtcp_launch -h acme11 -p 7779 --ib mpirun_rsh -n 1 acme12 >>> ./helloWorldMPI >>> >>> [41000] NOTE at ssh.cpp:369 in prepareForExec; REASON='New ssh command' >>> >>> newCommand = /home/localsoft/dmtcp/bin/dmtcp_ssh >>> /home/localsoft/dmtcp/bin/dmtcp_nocheckpoint /usr/bin/ssh -q acme12 cd >>> /home/slurm/tests;/home/localsoft/dmtcp/bin/dmtcp_launch --coord-host >>> 172.17.29.173 --coord-port 7779 --ckptdir /home/slurm/tests --infiniband >>> /home/localsoft/dmtcp/bin/dmtcp_sshd /usr/bin/env MPISPAWN_MPIRUN_MPD=0 >>> USE_LINEAR_SSH=1 MPISPAWN_MPIRUN_HOST=acme11.ciemat.es >>> MPISPAWN_MPIRUN_HOSTIP=172.17.29.173 MPIRUN_RSH_LAUNCH=1 >>> MPISPAWN_CHECKIN_PORT=33687 MPISPAWN_MPIRUN_PORT=33687 MPISPAWN_NNODES=1 >>> MPISPAWN_GLOBAL_NPROCS=1 MPISPAWN_MPIRUN_ID=40000 MPISPAWN_ARGC=1 >>> MPDMAN_KVS_TEMPLATE=kvs_885_acme11.ciemat.es_40000 MPISPAWN_LOCAL_NPROCS=1 >>> MPISPAWN_ARGV_0='./helloWorldMPI' MPISPAWN_ARGC=1 >>> MPISPAWN_GENERIC_ENV_COUNT=0 MPISPAWN_ID=0 >>> MPISPAWN_WORKING_DIR=/home/slurm/tests MPISPAWN_MPIRUN_RANK_0=0 >>> /usr/local/bin/mpispawn 0 >>> >>> Process 0 of 1 is on acme12.ciemat.es >>> >>> Hello world from process 0 of 1 >>> >>> Goodbye world from process 0 of 1 >>> >>> >>> COORDINATOR OUTPUT >>> >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker >>> connected' >>> >>> hello_remote.from = 1d64b124afe30f29-4029-562310a2 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating >>> process Information after exec()' >>> >>> progname = mpirun_rsh >>> >>> msg.from = 1d64b124afe30f29-52000-562310a2 >>> >>> client->identity() = 1d64b124afe30f29-4029-562310a2 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker >>> connected' >>> >>> hello_remote.from = 1d64b124afe30f29-52000-562310a2 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating >>> process Information after fork()' >>> >>> client->hostname() = acme11.ciemat.es >>> >>> client->progname() = mpirun_rsh_(forked) >>> >>> msg.from = 1d64b124afe30f29-53000-562310a2 >>> >>> client->identity() = 1d64b124afe30f29-52000-562310a2 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker >>> connected' >>> >>> hello_remote.from = 1d64b124afe30f29-53000-562310a2 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating >>> process Information after fork()' >>> >>> client->hostname() = acme11.ciemat.es >>> >>> client->progname() = dmtcp_ssh_(forked) >>> >>> msg.from = 1d64b124afe30f29-54000-562310a2 >>> >>> client->identity() = 1d64b124afe30f29-53000-562310a2 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client >>> disconnected' >>> >>> client->identity() = 1d64b124afe30f29-54000-562310a2 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating >>> process Information after exec()' >>> >>> progname = dmtcp_ssh >>> >>> msg.from = 1d64b124afe30f29-53000-562310a2 >>> >>> client->identity() = 1d64b124afe30f29-53000-562310a2 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker >>> connected' >>> >>> hello_remote.from = 1b69d09fb3238b30-23945-562310a2 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating >>> process Information after exec()' >>> >>> progname = dmtcp_sshd >>> >>> msg.from = 1b69d09fb3238b30-55000-562310a2 >>> >>> client->identity() = 1b69d09fb3238b30-23945-562310a2 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker >>> connected' >>> >>> hello_remote.from = 1b69d09fb3238b30-55000-562310a2 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating >>> process Information after fork()' >>> >>> client->hostname() = acme12.ciemat.es >>> >>> client->progname() = dmtcp_sshd_(forked) >>> >>> msg.from = 1b69d09fb3238b30-56000-562310a2 >>> >>> client->identity() = 1b69d09fb3238b30-55000-562310a2 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker >>> connected' >>> >>> hello_remote.from = 1b69d09fb3238b30-56000-562310a2 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating >>> process Information after fork()' >>> >>> client->hostname() = acme12.ciemat.es >>> >>> client->progname() = mpispawn_(forked) >>> >>> msg.from = 1b69d09fb3238b30-57000-562310a2 >>> >>> client->identity() = 1b69d09fb3238b30-56000-562310a2 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating >>> process Information after exec()' >>> >>> progname = env >>> >>> msg.from = 1b69d09fb3238b30-56000-562310a2 >>> >>> client->identity() = 1b69d09fb3238b30-56000-562310a2 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating >>> process Information after exec()' >>> >>> progname = mpispawn >>> >>> msg.from = 1b69d09fb3238b30-56000-562310a2 >>> >>> client->identity() = 1b69d09fb3238b30-56000-562310a2 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating >>> process Information after exec()' >>> >>> progname = helloWorldMPI >>> >>> msg.from = 1b69d09fb3238b30-57000-562310a2 >>> >>> client->identity() = 1b69d09fb3238b30-57000-562310a2 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client >>> disconnected' >>> >>> client->identity() = 1b69d09fb3238b30-57000-562310a2 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client >>> disconnected' >>> >>> client->identity() = 1b69d09fb3238b30-56000-562310a2 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client >>> disconnected' >>> >>> client->identity() = 1b69d09fb3238b30-55000-562310a2 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client >>> disconnected' >>> >>> client->identity() = 1d64b124afe30f29-53000-562310a2 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client >>> disconnected' >>> >>> client->identity() = 1d64b124afe30f29-52000-562310a2 >>> >>> >>> --- >>> --- >>> >>> So we see that it is working correctly, connecting and so. >>> >>> However, if I run the application in more than one core, as in the first >>> example, it crashes. What happens is that the first node on the node list >>> executes the application, and the rest do not. >>> >>> ---- >>> ---- >>> >>> [root@acme11 tests]# dmtcp_launch -h acme11 -p 7779 --ib mpirun_rsh >>> -n 2 acme11 acme12 ./helloWorldMPI >>> >>> [59000] NOTE at ssh.cpp:369 in prepareForExec; REASON='New ssh command' >>> >>> newCommand = /home/localsoft/dmtcp/bin/dmtcp_ssh >>> /home/localsoft/dmtcp/bin/dmtcp_nocheckpoint /usr/bin/ssh -q acme11 cd >>> /home/slurm/tests;/home/localsoft/dmtcp/bin/dmtcp_launch --coord-host >>> 172.17.29.173 --coord-port 7779 --ckptdir /home/slurm/tests --infiniband >>> /home/localsoft/dmtcp/bin/dmtcp_sshd /usr/bin/env MPISPAWN_MPIRUN_MPD=0 >>> USE_LINEAR_SSH=1 MPISPAWN_MPIRUN_HOST=acme11.ciemat.es >>> MPISPAWN_MPIRUN_HOSTIP=172.17.29.173 MPIRUN_RSH_LAUNCH=1 >>> MPISPAWN_CHECKIN_PORT=34203 MPISPAWN_MPIRUN_PORT=34203 MPISPAWN_NNODES=2 >>> MPISPAWN_GLOBAL_NPROCS=2 MPISPAWN_MPIRUN_ID=58000 MPISPAWN_ARGC=1 >>> MPDMAN_KVS_TEMPLATE=kvs_481_acme11.ciemat.es_58000 MPISPAWN_LOCAL_NPROCS=1 >>> MPISPAWN_ARGV_0='./helloWorldMPI' MPISPAWN_ARGC=1 >>> MPISPAWN_GENERIC_ENV_COUNT=0 MPISPAWN_ID=0 >>> MPISPAWN_WORKING_DIR=/home/slurm/tests MPISPAWN_MPIRUN_RANK_0=0 >>> /usr/local/bin/mpispawn 0 >>> >>> [60000] NOTE at ssh.cpp:369 in prepareForExec; REASON='New ssh command' >>> >>> newCommand = /home/localsoft/dmtcp/bin/dmtcp_ssh >>> /home/localsoft/dmtcp/bin/dmtcp_nocheckpoint /usr/bin/ssh -q acme12 cd >>> /home/slurm/tests;/home/localsoft/dmtcp/bin/dmtcp_launch --coord-host >>> 172.17.29.173 --coord-port 7779 --ckptdir /home/slurm/tests --infiniband >>> /home/localsoft/dmtcp/bin/dmtcp_sshd /usr/bin/env MPISPAWN_MPIRUN_MPD=0 >>> USE_LINEAR_SSH=1 MPISPAWN_MPIRUN_HOST=acme11.ciemat.es >>> MPISPAWN_MPIRUN_HOSTIP=172.17.29.173 MPIRUN_RSH_LAUNCH=1 >>> MPISPAWN_CHECKIN_PORT=34203 MPISPAWN_MPIRUN_PORT=34203 MPISPAWN_NNODES=2 >>> MPISPAWN_GLOBAL_NPROCS=2 MPISPAWN_MPIRUN_ID=58000 MPISPAWN_ARGC=1 >>> MPDMAN_KVS_TEMPLATE=kvs_481_acme11.ciemat.es_58000 MPISPAWN_LOCAL_NPROCS=1 >>> MPISPAWN_ARGV_0='./helloWorldMPI' MPISPAWN_ARGC=1 >>> MPISPAWN_GENERIC_ENV_COUNT=0 MPISPAWN_ID=1 >>> MPISPAWN_WORKING_DIR=/home/slurm/tests MPISPAWN_MPIRUN_RANK_0=1 >>> /usr/local/bin/mpispawn 0 >>> >>> Process 0 of 2 is on acme11.ciemat.es >>> >>> Hello world from process 0 of 2 >>> >>> Goodbye world from process 0 of 2 >>> >>> COORDINATOR OUTPUT >>> >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker >>> connected' >>> >>> hello_remote.from = 1d64b124afe30f29-4070-56231173 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating >>> process Information after exec()' >>> >>> progname = mpirun_rsh >>> >>> msg.from = 1d64b124afe30f29-58000-56231173 >>> >>> client->identity() = 1d64b124afe30f29-4070-56231173 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker >>> connected' >>> >>> hello_remote.from = 1d64b124afe30f29-58000-56231173 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker >>> connected' >>> >>> hello_remote.from = 1d64b124afe30f29-58000-56231173 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating >>> process Information after fork()' >>> >>> client->hostname() = acme11.ciemat.es >>> >>> client->progname() = mpirun_rsh_(forked) >>> >>> msg.from = 1d64b124afe30f29-59000-56231173 >>> >>> client->identity() = 1d64b124afe30f29-58000-56231173 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating >>> process Information after fork()' >>> >>> client->hostname() = acme11.ciemat.es >>> >>> client->progname() = mpirun_rsh_(forked) >>> >>> msg.from = 1d64b124afe30f29-60000-56231173 >>> >>> client->identity() = 1d64b124afe30f29-58000-56231173 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker >>> connected' >>> >>> hello_remote.from = 1d64b124afe30f29-59000-56231173 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker >>> connected' >>> >>> hello_remote.from = 1d64b124afe30f29-60000-56231173 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating >>> process Information after fork()' >>> >>> client->hostname() = acme11.ciemat.es >>> >>> client->progname() = dmtcp_ssh_(forked) >>> >>> msg.from = 1d64b124afe30f29-61000-56231173 >>> >>> client->identity() = 1d64b124afe30f29-59000-56231173 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating >>> process Information after fork()' >>> >>> client->hostname() = acme11.ciemat.es >>> >>> client->progname() = dmtcp_ssh_(forked) >>> >>> msg.from = 1d64b124afe30f29-62000-56231173 >>> >>> client->identity() = 1d64b124afe30f29-60000-56231173 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client >>> disconnected' >>> >>> client->identity() = 1d64b124afe30f29-61000-56231173 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client >>> disconnected' >>> >>> client->identity() = 1d64b124afe30f29-62000-56231173 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating >>> process Information after exec()' >>> >>> progname = dmtcp_ssh >>> >>> msg.from = 1d64b124afe30f29-59000-56231173 >>> >>> client->identity() = 1d64b124afe30f29-59000-56231173 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating >>> process Information after exec()' >>> >>> progname = dmtcp_ssh >>> >>> msg.from = 1d64b124afe30f29-60000-56231173 >>> >>> client->identity() = 1d64b124afe30f29-60000-56231173 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker >>> connected' >>> >>> hello_remote.from = 1b69d09fb3238b30-24001-56231173 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker >>> connected' >>> >>> hello_remote.from = 1d64b124afe30f29-4094-56231173 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating >>> process Information after exec()' >>> >>> progname = dmtcp_sshd >>> >>> msg.from = 1d64b124afe30f29-64000-56231173 >>> >>> client->identity() = 1d64b124afe30f29-4094-56231173 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating >>> process Information after exec()' >>> >>> progname = dmtcp_sshd >>> >>> msg.from = 1b69d09fb3238b30-63000-56231173 >>> >>> client->identity() = 1b69d09fb3238b30-24001-56231173 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker >>> connected' >>> >>> hello_remote.from = 1d64b124afe30f29-64000-56231173 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker >>> connected' >>> >>> hello_remote.from = 1b69d09fb3238b30-63000-56231173 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating >>> process Information after fork()' >>> >>> client->hostname() = acme11.ciemat.es >>> >>> client->progname() = dmtcp_sshd_(forked) >>> >>> msg.from = 1d64b124afe30f29-65000-56231173 >>> >>> client->identity() = 1d64b124afe30f29-64000-56231173 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating >>> process Information after fork()' >>> >>> client->hostname() = acme12.ciemat.es >>> >>> client->progname() = dmtcp_sshd_(forked) >>> >>> msg.from = 1b69d09fb3238b30-66000-56231173 >>> >>> client->identity() = 1b69d09fb3238b30-63000-56231173 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating >>> process Information after exec()' >>> >>> progname = env >>> >>> msg.from = 1d64b124afe30f29-65000-56231173 >>> >>> client->identity() = 1d64b124afe30f29-65000-56231173 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating >>> process Information after exec()' >>> >>> progname = mpispawn >>> >>> msg.from = 1d64b124afe30f29-65000-56231173 >>> >>> client->identity() = 1d64b124afe30f29-65000-56231173 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker >>> connected' >>> >>> hello_remote.from = 1b69d09fb3238b30-66000-56231173 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker >>> connected' >>> >>> hello_remote.from = 1d64b124afe30f29-65000-56231173 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating >>> process Information after fork()' >>> >>> client->hostname() = acme11.ciemat.es >>> >>> client->progname() = mpispawn_(forked) >>> >>> msg.from = 1d64b124afe30f29-68000-56231173 >>> >>> client->identity() = 1d64b124afe30f29-65000-56231173 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating >>> process Information after fork()' >>> >>> client->hostname() = acme12.ciemat.es >>> >>> client->progname() = mpispawn_(forked) >>> >>> msg.from = 1b69d09fb3238b30-67000-56231173 >>> >>> client->identity() = 1b69d09fb3238b30-66000-56231173 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating >>> process Information after exec()' >>> >>> progname = env >>> >>> msg.from = 1b69d09fb3238b30-66000-56231173 >>> >>> client->identity() = 1b69d09fb3238b30-66000-56231173 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating >>> process Information after exec()' >>> >>> progname = mpispawn >>> >>> msg.from = 1b69d09fb3238b30-66000-56231173 >>> >>> client->identity() = 1b69d09fb3238b30-66000-56231173 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating >>> process Information after exec()' >>> >>> progname = helloWorldMPI >>> >>> msg.from = 1d64b124afe30f29-68000-56231173 >>> >>> client->identity() = 1d64b124afe30f29-68000-56231173 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating >>> process Information after exec()' >>> >>> progname = helloWorldMPI >>> >>> msg.from = 1b69d09fb3238b30-67000-56231173 >>> >>> client->identity() = 1b69d09fb3238b30-67000-56231173 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client >>> disconnected' >>> >>> client->identity() = 1d64b124afe30f29-68000-56231173 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client >>> disconnected' >>> >>> client->identity() = 1b69d09fb3238b30-67000-56231173 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client >>> disconnected' >>> >>> client->identity() = 1d64b124afe30f29-65000-56231173 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client >>> disconnected' >>> >>> client->identity() = 1b69d09fb3238b30-66000-56231173 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client >>> disconnected' >>> >>> client->identity() = 1d64b124afe30f29-64000-56231173 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client >>> disconnected' >>> >>> client->identity() = 1b69d09fb3238b30-63000-56231173 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client >>> disconnected' >>> >>> client->identity() = 1d64b124afe30f29-59000-56231173 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client >>> disconnected' >>> >>> client->identity() = 1d64b124afe30f29-60000-56231173 >>> >>> [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client >>> disconnected' >>> >>> client->identity() = 1d64b124afe30f29-58000-56231173 >>> >>> >>> ---- >>> >>> ---- >>> >>> >>> >>> >>> >>> >>> >>> -- >>> Dr. Manuel Rodríguez-Pascual >>> skype: manuel.rodriguez.pascual >>> phone: (+34) 913466173 // (+34) 679925108 >>> >>> CIEMAT-Moncloa >>> Edificio 22, desp. 1.25 >>> Avenida Complutense, 40 >>> 28040- MADRID >>> SPAIN >>> >>> >>> ------------------------------------------------------------------------------ >>> >>> _______________________________________________ >>> Dmtcp-forum mailing list >>> Dmtcp-forum@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum >>> >>> >> > -- Dr. Manuel Rodríguez-Pascual skype: manuel.rodriguez.pascual phone: (+34) 913466173 // (+34) 679925108 CIEMAT-Moncloa Edificio 22, desp. 1.25 Avenida Complutense, 40 28040- MADRID SPAIN
------------------------------------------------------------------------------
_______________________________________________ Dmtcp-forum mailing list Dmtcp-forum@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dmtcp-forum