Hi Jiajun, all,

I have been performing some more tests.

When running the code with "mpirun_rsh  -n 2  acme11 acme12 ./helloWorldMPI
", if I want to use DMTCP, there are problems (this is the scenario wof the
previus mail). Withouth the --ib flag the problem still persists. I think
the error is the same, below  you have the output anyway.

In case it helps on the debug, some info that might be relevant:

- The application can be executed without DMTCP.
- With DMTCP, only the first node on the list is executed.  But there seem
to be exceptions to this:
     --  if I execute mpi_run sh with  " -n 2 acme11 acme11" (or whatever
node, but the same one twice) :  it crashes. This does not happen without
DMTCP,  it that case it works correctly.
    -- if I use " -n 3 acme11 acme12 acme11"  (three nodes, repeating one):
it also crashes. It seems that if you put the same node more than once, it
does not work.
    -- the first node in the list is the only one that runs. For example,
if I use "-n 2 acme11 acme12"  then acme11 will execute the code. If I use
"-n 2 acme12 acme11", then acme12 will.  With three nodes it is identical,
just the first one on the list.



However, I have seen that if I execute the application with another MPI
library, MPICH,

"mpiexec -n 2 acme11 acme12  ./helloWorldMPI"

Everything works as expected. I can use DMTCP with

"dmtcp_launch  -h acme11 -p 7779  mpiexec -n 2 acme11 acme12
./helloWorldMPI"

and it succeeds. In this case, it works both with "--ib" and without it.
Just in case it helps, output is below too.




Thansk for your help,

Manuel




----
----

-bash-4.2$ dmtcp_launch -h acme11 -p 7779  mpirun_rsh  -n 2  acme11 acme12
./helloWorldMPI

[126000] NOTE at dmtcpworker.cpp:349 in DmtcpWorker; REASON='


*** InfiniBand library detected.  Please use dmtcp_launch --ib ***

'

[127000] NOTE at ssh.cpp:369 in prepareForExec; REASON='New ssh command'

     newCommand = /home/localsoft/dmtcp/bin/dmtcp_ssh
/home/localsoft/dmtcp/bin/dmtcp_nocheckpoint /usr/bin/ssh -q acme11 cd
/home/slurm/tests;/home/localsoft/dmtcp/bin/dmtcp_launch --coord-host
172.17.29.173 --coord-port 7779 --ckptdir /home/slurm/tests
/home/localsoft/dmtcp/bin/dmtcp_sshd /usr/bin/env  MPISPAWN_MPIRUN_MPD=0
USE_LINEAR_SSH=1 MPISPAWN_MPIRUN_HOST=acme11.ciemat.es
MPISPAWN_MPIRUN_HOSTIP=172.17.29.173 MPIRUN_RSH_LAUNCH=1
MPISPAWN_CHECKIN_PORT=59589 MPISPAWN_MPIRUN_PORT=59589 MPISPAWN_NNODES=2
MPISPAWN_GLOBAL_NPROCS=2 MPISPAWN_MPIRUN_ID=126000 MPISPAWN_ARGC=1
MPDMAN_KVS_TEMPLATE=kvs_311_acme11.ciemat.es_126000 MPISPAWN_LOCAL_NPROCS=1
MPISPAWN_ARGV_0='./helloWorldMPI' MPISPAWN_ARGC=1
MPISPAWN_GENERIC_ENV_COUNT=0  MPISPAWN_ID=0
MPISPAWN_WORKING_DIR=/home/slurm/tests MPISPAWN_MPIRUN_RANK_0=0
/usr/local/bin/mpispawn 0

[128000] NOTE at ssh.cpp:369 in prepareForExec; REASON='New ssh command'

     newCommand = /home/localsoft/dmtcp/bin/dmtcp_ssh
/home/localsoft/dmtcp/bin/dmtcp_nocheckpoint /usr/bin/ssh -q acme12 cd
/home/slurm/tests;/home/localsoft/dmtcp/bin/dmtcp_launch --coord-host
172.17.29.173 --coord-port 7779 --ckptdir /home/slurm/tests
/home/localsoft/dmtcp/bin/dmtcp_sshd /usr/bin/env  MPISPAWN_MPIRUN_MPD=0
USE_LINEAR_SSH=1 MPISPAWN_MPIRUN_HOST=acme11.ciemat.es
MPISPAWN_MPIRUN_HOSTIP=172.17.29.173 MPIRUN_RSH_LAUNCH=1
MPISPAWN_CHECKIN_PORT=59589 MPISPAWN_MPIRUN_PORT=59589 MPISPAWN_NNODES=2
MPISPAWN_GLOBAL_NPROCS=2 MPISPAWN_MPIRUN_ID=126000 MPISPAWN_ARGC=1
MPDMAN_KVS_TEMPLATE=kvs_311_acme11.ciemat.es_126000 MPISPAWN_LOCAL_NPROCS=1
MPISPAWN_ARGV_0='./helloWorldMPI' MPISPAWN_ARGC=1
MPISPAWN_GENERIC_ENV_COUNT=0  MPISPAWN_ID=1
MPISPAWN_WORKING_DIR=/home/slurm/tests MPISPAWN_MPIRUN_RANK_0=1
/usr/local/bin/mpispawn 0

Process 0 of 2 is on acme11.ciemat.es

Hello world from process 0 of 2

Goodbye world from process 0 of 2


COORDINATOR OUTPUT

[28245] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'

     hello_remote.from = 1d64b124afe30f29-28766-56250e08

[28245] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'

     progname = mpirun_rsh

     msg.from = 1d64b124afe30f29-126000-56250e08

     client->identity() = 1d64b124afe30f29-28766-56250e08

[28245] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'

     hello_remote.from = 1d64b124afe30f29-126000-56250e08

[28245] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'

     hello_remote.from = 1d64b124afe30f29-126000-56250e08

[28245] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'

     client->hostname() = acme11.ciemat.es

     client->progname() = mpirun_rsh_(forked)

     msg.from = 1d64b124afe30f29-127000-56250e08

     client->identity() = 1d64b124afe30f29-126000-56250e08

[28245] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'

     client->hostname() = acme11.ciemat.es

     client->progname() = mpirun_rsh_(forked)

     msg.from = 1d64b124afe30f29-128000-56250e08

     client->identity() = 1d64b124afe30f29-126000-56250e08

[28245] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'

     hello_remote.from = 1d64b124afe30f29-127000-56250e08

[28245] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'

     hello_remote.from = 1d64b124afe30f29-128000-56250e08

[28245] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'

     client->hostname() = acme11.ciemat.es

     client->progname() = dmtcp_ssh_(forked)

     msg.from = 1d64b124afe30f29-129000-56250e08

     client->identity() = 1d64b124afe30f29-127000-56250e08

[28245] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'

     client->hostname() = acme11.ciemat.es

     client->progname() = dmtcp_ssh_(forked)

     msg.from = 1d64b124afe30f29-130000-56250e08

     client->identity() = 1d64b124afe30f29-128000-56250e08

[28245] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'

     client->identity() = 1d64b124afe30f29-129000-56250e08

[28245] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'

     client->identity() = 1d64b124afe30f29-130000-56250e08

[28245] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'

     progname = dmtcp_ssh

     msg.from = 1d64b124afe30f29-127000-56250e08

     client->identity() = 1d64b124afe30f29-127000-56250e08

[28245] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'

     progname = dmtcp_ssh

     msg.from = 1d64b124afe30f29-128000-56250e08

     client->identity() = 1d64b124afe30f29-128000-56250e08

[28245] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'

     hello_remote.from = 1d64b124afe30f29-28786-56250e08

[28245] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'

     hello_remote.from = 1b69d09fb3238b30-12757-56250e08

[28245] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'

     progname = dmtcp_sshd

     msg.from = 1d64b124afe30f29-131000-56250e08

     client->identity() = 1d64b124afe30f29-28786-56250e08

[28245] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'

     progname = dmtcp_sshd

     msg.from = 1b69d09fb3238b30-132000-56250e08

     client->identity() = 1b69d09fb3238b30-12757-56250e08

[28245] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'

     hello_remote.from = 1d64b124afe30f29-131000-56250e08

[28245] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'

     hello_remote.from = 1b69d09fb3238b30-132000-56250e08

[28245] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'

     client->hostname() = acme11.ciemat.es

     client->progname() = dmtcp_sshd_(forked)

     msg.from = 1d64b124afe30f29-133000-56250e08

     client->identity() = 1d64b124afe30f29-131000-56250e08

[28245] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'

     client->hostname() = acme12.ciemat.es

     client->progname() = dmtcp_sshd_(forked)

     msg.from = 1b69d09fb3238b30-134000-56250e08

     client->identity() = 1b69d09fb3238b30-132000-56250e08

[28245] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'

     hello_remote.from = 1d64b124afe30f29-133000-56250e08

[28245] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'

     hello_remote.from = 1b69d09fb3238b30-134000-56250e08

[28245] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'

     client->hostname() = acme11.ciemat.es

     client->progname() = mpispawn_(forked)

     msg.from = 1d64b124afe30f29-135000-56250e08

     client->identity() = 1d64b124afe30f29-133000-56250e08

[28245] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'

     client->hostname() = acme12.ciemat.es

     client->progname() = mpispawn_(forked)

     msg.from = 1b69d09fb3238b30-136000-56250e08

     client->identity() = 1b69d09fb3238b30-134000-56250e08

[28245] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'

     progname = env

     msg.from = 1d64b124afe30f29-133000-56250e08

     client->identity() = 1d64b124afe30f29-133000-56250e08

[28245] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'

     progname = mpispawn

     msg.from = 1d64b124afe30f29-133000-56250e08

     client->identity() = 1d64b124afe30f29-133000-56250e08

[28245] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'

     progname = env

     msg.from = 1b69d09fb3238b30-134000-56250e08

     client->identity() = 1b69d09fb3238b30-134000-56250e08

[28245] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'

     progname = mpispawn

     msg.from = 1b69d09fb3238b30-134000-56250e08

     client->identity() = 1b69d09fb3238b30-134000-56250e08

[28245] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'

     progname = helloWorldMPI

     msg.from = 1d64b124afe30f29-135000-56250e08

     client->identity() = 1d64b124afe30f29-135000-56250e08

[28245] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'

     progname = helloWorldMPI

     msg.from = 1b69d09fb3238b30-136000-56250e08

     client->identity() = 1b69d09fb3238b30-136000-56250e08

[28245] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'

     client->identity() = 1d64b124afe30f29-135000-56250e08

[28245] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'

     client->identity() = 1b69d09fb3238b30-136000-56250e08

[28245] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'

     client->identity() = 1d64b124afe30f29-133000-56250e08

[28245] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'

     client->identity() = 1d64b124afe30f29-131000-56250e08

[28245] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'

     client->identity() = 1b69d09fb3238b30-134000-56250e08

[28245] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'

     client->identity() = 1b69d09fb3238b30-132000-56250e08

[28245] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'

     client->identity() = 1d64b124afe30f29-127000-56250e08

[28245] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'

     client->identity() = 1d64b124afe30f29-128000-56250e08

[28245] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'

     client->identity() = 1d64b124afe30f29-126000-56250e08



-----

-----

WITH MPIEXEC

[root@acme11 tests]# dmtcp_launch --ib mpiexec -f machinefile -n 3
/home/slurm/tests/helloWorldMPI

[42000] NOTE at ssh.cpp:369 in prepareForExec; REASON='New ssh command'

     newCommand = /home/localsoft/dmtcp/bin/dmtcp_ssh
/home/localsoft/dmtcp/bin/dmtcp_nocheckpoint /usr/bin/ssh -x acme12
/home/localsoft/dmtcp/bin/dmtcp_launch --coord-host 172.17.31.157
--coord-port 7779 --ckptdir /home/slurm/tests --infiniband
/home/localsoft/dmtcp/bin/dmtcp_sshd
"/home/localsoft/mpich3/bin//hydra_pmi_proxy" --control-port acme11:44279
--rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2
--proxy-id 1

[43000] NOTE at ssh.cpp:369 in prepareForExec; REASON='New ssh command'

     newCommand = /home/localsoft/dmtcp/bin/dmtcp_ssh
/home/localsoft/dmtcp/bin/dmtcp_nocheckpoint /usr/bin/ssh -x acme13
/home/localsoft/dmtcp/bin/dmtcp_launch --coord-host 172.17.31.157
--coord-port 7779 --ckptdir /home/slurm/tests --infiniband
/home/localsoft/dmtcp/bin/dmtcp_sshd
"/home/localsoft/mpich3/bin//hydra_pmi_proxy" --control-port acme11:44279
--rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2
--proxy-id 2

Process 0 of 3 is on acme11.ciemat.es

Hello world from process 0 of 3

this is iteration 0 on process 0 of host acme11.ciemat.es

Process 2 of 3 is on acme13.ciemat.es

Hello world from process 2 of 3

Process 1 of 3 is on acme12.ciemat.es

Hello world from process 1 of 3

Goodbye world from process 1 of 3

Goodbye world from process 2 of 3

Goodbye world from process 0 of 3


COORDINATOR


[root@acme11 ~]# dmtcp_coordinator

dmtcp_coordinator starting...

    Host: acme11.ciemat.es (172.17.31.157)

    Port: 7779

    Checkpoint Interval: disabled (checkpoint manually instead)

    Exit on last client: 0

Type '?' for help.


[21211] NOTE at dmtcp_coordinator.cpp:1661 in updateCheckpointInterval;
REASON='CheckpointInterval updated (for this computation only)'

     oldInterval = 0

     theCheckpointInterval = 0

[21211] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'

     hello_remote.from = 1d64b124afe30f29-21212-56252da6

[21211] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'

     progname = mpiexec.hydra

     msg.from = 1d64b124afe30f29-40000-56252da6

     client->identity() = 1d64b124afe30f29-21212-56252da6

[21211] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'

     hello_remote.from = 1d64b124afe30f29-40000-56252da6

[21211] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'

     client->hostname() = acme11.ciemat.es

     client->progname() = mpiexec.hydra_(forked)

     msg.from = 1d64b124afe30f29-41000-56252da6

     client->identity() = 1d64b124afe30f29-40000-56252da6

[21211] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'

     hello_remote.from = 1d64b124afe30f29-40000-56252da6

[21211] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'

     client->hostname() = acme11.ciemat.es

     client->progname() = mpiexec.hydra_(forked)

     msg.from = 1d64b124afe30f29-42000-56252da6

     client->identity() = 1d64b124afe30f29-40000-56252da6

[21211] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'

     hello_remote.from = 1d64b124afe30f29-40000-56252da6

[21211] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'

     client->hostname() = acme11.ciemat.es

     client->progname() = mpiexec.hydra_(forked)

     msg.from = 1d64b124afe30f29-43000-56252da6

     client->identity() = 1d64b124afe30f29-40000-56252da6

[21211] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'

     hello_remote.from = 1d64b124afe30f29-41000-56252da6

[21211] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'

     client->hostname() = acme11.ciemat.es

     client->progname() = hydra_pmi_proxy_(forked)

     msg.from = 1d64b124afe30f29-44000-56252da6

     client->identity() = 1d64b124afe30f29-41000-56252da6

[21211] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'

     hello_remote.from = 1d64b124afe30f29-42000-56252da6

[21211] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'

     client->hostname() = acme11.ciemat.es

     client->progname() = dmtcp_ssh_(forked)

     msg.from = 1d64b124afe30f29-45000-56252da6

     client->identity() = 1d64b124afe30f29-42000-56252da6

[21211] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'

     hello_remote.from = 1d64b124afe30f29-43000-56252da6

[21211] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'

     client->hostname() = acme11.ciemat.es

     client->progname() = dmtcp_ssh_(forked)

     msg.from = 1d64b124afe30f29-46000-56252da6

     client->identity() = 1d64b124afe30f29-43000-56252da6

[21211] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'

     client->identity() = 1d64b124afe30f29-45000-56252da6

[21211] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'

     client->identity() = 1d64b124afe30f29-46000-56252da6

[21211] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'

     progname = hydra_pmi_proxy

     msg.from = 1d64b124afe30f29-41000-56252da6

     client->identity() = 1d64b124afe30f29-41000-56252da6

[21211] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'

     progname = dmtcp_ssh

     msg.from = 1d64b124afe30f29-42000-56252da6

     client->identity() = 1d64b124afe30f29-42000-56252da6

[21211] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'

     progname = dmtcp_ssh

     msg.from = 1d64b124afe30f29-43000-56252da6

     client->identity() = 1d64b124afe30f29-43000-56252da6

[21211] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'

     progname = helloWorldMPI

     msg.from = 1d64b124afe30f29-44000-56252da6

     client->identity() = 1d64b124afe30f29-44000-56252da6

[21211] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'

     hello_remote.from = 1b69d09fb3238b30-14428-56252da7

[21211] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'

     hello_remote.from = 54385264162a2589-10066-56252da7

[21211] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'

     progname = dmtcp_sshd

     msg.from = 1b69d09fb3238b30-47000-56252da7

     client->identity() = 1b69d09fb3238b30-14428-56252da7

[21211] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'

     progname = dmtcp_sshd

     msg.from = 54385264162a2589-48000-56252da7

     client->identity() = 54385264162a2589-10066-56252da7

[21211] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'

     hello_remote.from = 1b69d09fb3238b30-47000-56252da7

[21211] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'

     hello_remote.from = 54385264162a2589-48000-56252da7

[21211] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'

     client->hostname() = acme13.ciemat.es

     client->progname() = dmtcp_sshd_(forked)

     msg.from = 54385264162a2589-50000-56252da7

     client->identity() = 54385264162a2589-48000-56252da7

[21211] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'

     client->hostname() = acme12.ciemat.es

     client->progname() = dmtcp_sshd_(forked)

     msg.from = 1b69d09fb3238b30-49000-56252da7

     client->identity() = 1b69d09fb3238b30-47000-56252da7

[21211] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'

     hello_remote.from = 54385264162a2589-50000-56252da7

[21211] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'

     hello_remote.from = 1b69d09fb3238b30-49000-56252da7

[21211] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'

     client->hostname() = acme13.ciemat.es

     client->progname() = hydra_pmi_proxy_(forked)

     msg.from = 54385264162a2589-51000-56252da7

     client->identity() = 54385264162a2589-50000-56252da7

[21211] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'

     client->hostname() = acme12.ciemat.es

     client->progname() = hydra_pmi_proxy_(forked)

     msg.from = 1b69d09fb3238b30-52000-56252da7

     client->identity() = 1b69d09fb3238b30-49000-56252da7

[21211] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'

     progname = hydra_pmi_proxy

     msg.from = 1b69d09fb3238b30-49000-56252da7

     client->identity() = 1b69d09fb3238b30-49000-56252da7

[21211] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'

     progname = hydra_pmi_proxy

     msg.from = 54385264162a2589-50000-56252da7

     client->identity() = 54385264162a2589-50000-56252da7

[21211] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'

     progname = helloWorldMPI

     msg.from = 54385264162a2589-51000-56252da7

     client->identity() = 54385264162a2589-51000-56252da7

[21211] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'

     progname = helloWorldMPI

     msg.from = 1b69d09fb3238b30-52000-56252da7

     client->identity() = 1b69d09fb3238b30-52000-56252da7

[21211] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'

     client->identity() = 1b69d09fb3238b30-52000-56252da7

[21211] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'

     client->identity() = 54385264162a2589-51000-56252da7

[21211] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'

     client->identity() = 1d64b124afe30f29-44000-56252da6

[21211] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'

     client->identity() = 1b69d09fb3238b30-49000-56252da7

[21211] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'

     client->identity() = 54385264162a2589-50000-56252da7

[21211] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'

     client->identity() = 1d64b124afe30f29-41000-56252da6

[21211] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'

     client->identity() = 1b69d09fb3238b30-47000-56252da7

[21211] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'

     client->identity() = 54385264162a2589-48000-56252da7

[21211] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'

     client->identity() = 1d64b124afe30f29-42000-56252da6

[21211] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'

     client->identity() = 1d64b124afe30f29-43000-56252da6

[21211] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'

     client->identity() = 1d64b124afe30f29-40000-56252da6





2015-10-19 8:26 GMT-07:00 Jiajun Cao <jia...@ccs.neu.edu>:

> Hi Manuel,
>
> The infiniband plugin shouldn't affect application launching. Could you
> try removing the "--ib" flag and see if the application still crashes? This
> can help diagnose whether the issue is in the ib plugin or other dmtcp
> modules.
>
> Best,
> Jiajun
>
>
> Best,
> Jiajun
>
> On Sun, Oct 18, 2015 at 10:57 PM, Kapil Arya <ka...@ccs.neu.edu> wrote:
>
>> Hey Jiajun,
>>
>> Can you take a look at this problem as it is closer to your area of
>> expertise :-).
>>
>> Best,
>> Kapil
>>
>> On Sat, Oct 17, 2015 at 11:31 PM, Manuel Rodríguez Pascual <
>> manuel.rodriguez.pasc...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> I am trying to checkpoint an MVAPICH application. It does not behave as
>>> expected, so maybe you can give me some support.
>>>
>>> I have compiled DMTCP with "--enable-infiniband-support " as only flag.
>>> I have MVAPICH installed.
>>>
>>> I can execute a test MPI application in two nodes, without DMTCP. I also
>>> can execute the application in a single node with DMTCP. however, it I
>>> execute it in two nodes with DMTCP, only the first one will run.
>>>
>>> Below there is a series of test commands with a lot of output, together
>>> with the versions of everything.
>>>
>>> Any ideas?
>>>
>>> thanks for your help,
>>>
>>>
>>> Manuel
>>>
>>>
>>> ---
>>> ---
>>>
>>> # mpichversion
>>>
>>> MVAPICH2 Version:     2.2a
>>>
>>> MVAPICH2 Release date: Mon Aug 17 20:00:00 EDT 2015
>>>
>>> MVAPICH2 Device:      ch3:mrail
>>>
>>> MVAPICH2 configure:   --disable-mcast
>>>
>>> MVAPICH2 CC:  gcc    -DNDEBUG -DNVALGRIND -O2
>>>
>>> MVAPICH2 CXX: g++   -DNDEBUG -DNVALGRIND -O2
>>>
>>> MVAPICH2 F77: gfortran -L/lib -L/lib   -O2
>>>
>>> MVAPICH2 FC:  gfortran   -O2
>>>
>>> # dmtcp_coordinator --version
>>>
>>> dmtcp_coordinator (DMTCP) 2.4.1
>>>
>>> ---
>>>
>>> ---
>>>
>>>
>>> I can execute a test MPI application in two nodes (acme11 and 12), with
>>>
>>> ---
>>> ---
>>> # mpirun_rsh  -n 2  acme11 acme12 ./helloWorldMPI
>>>
>>> Process 0 of 2 is on acme11.ciemat.es
>>>
>>> Process 1 of 2 is on acme12.ciemat.es
>>>
>>> Hello world from process 0 of 2
>>>
>>> Hello world from process 1 of 2
>>>
>>> Goodbye world from process 0 of 2
>>>
>>> Goodbye world from process 1 of 2
>>> ---
>>> ---
>>>
>>> As you can see, it works correctly.
>>>
>>>
>>> If I try to execute the application with DMTCP, however, it does not.
>>>
>>> I run the coordinator on acme11, with port 7779.
>>>
>>>
>>> I can execute the application on a single node. For example,
>>>
>>> ---
>>> ---
>>>
>>> #  dmtcp_launch -h acme11 -p 7779 --ib mpirun_rsh  -n 1  acme12
>>> ./helloWorldMPI
>>>
>>> [41000] NOTE at ssh.cpp:369 in prepareForExec; REASON='New ssh command'
>>>
>>>      newCommand = /home/localsoft/dmtcp/bin/dmtcp_ssh
>>> /home/localsoft/dmtcp/bin/dmtcp_nocheckpoint /usr/bin/ssh -q acme12 cd
>>> /home/slurm/tests;/home/localsoft/dmtcp/bin/dmtcp_launch --coord-host
>>> 172.17.29.173 --coord-port 7779 --ckptdir /home/slurm/tests --infiniband
>>> /home/localsoft/dmtcp/bin/dmtcp_sshd /usr/bin/env  MPISPAWN_MPIRUN_MPD=0
>>> USE_LINEAR_SSH=1 MPISPAWN_MPIRUN_HOST=acme11.ciemat.es
>>> MPISPAWN_MPIRUN_HOSTIP=172.17.29.173 MPIRUN_RSH_LAUNCH=1
>>> MPISPAWN_CHECKIN_PORT=33687 MPISPAWN_MPIRUN_PORT=33687 MPISPAWN_NNODES=1
>>> MPISPAWN_GLOBAL_NPROCS=1 MPISPAWN_MPIRUN_ID=40000 MPISPAWN_ARGC=1
>>> MPDMAN_KVS_TEMPLATE=kvs_885_acme11.ciemat.es_40000 MPISPAWN_LOCAL_NPROCS=1
>>> MPISPAWN_ARGV_0='./helloWorldMPI' MPISPAWN_ARGC=1
>>> MPISPAWN_GENERIC_ENV_COUNT=0  MPISPAWN_ID=0
>>> MPISPAWN_WORKING_DIR=/home/slurm/tests MPISPAWN_MPIRUN_RANK_0=0
>>> /usr/local/bin/mpispawn 0
>>>
>>> Process 0 of 1 is on acme12.ciemat.es
>>>
>>> Hello world from process 0 of 1
>>>
>>> Goodbye world from process 0 of 1
>>>
>>>
>>> COORDINATOR OUTPUT
>>>
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>>> connected'
>>>
>>>      hello_remote.from = 1d64b124afe30f29-4029-562310a2
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
>>> process Information after exec()'
>>>
>>>      progname = mpirun_rsh
>>>
>>>      msg.from = 1d64b124afe30f29-52000-562310a2
>>>
>>>      client->identity() = 1d64b124afe30f29-4029-562310a2
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>>> connected'
>>>
>>>      hello_remote.from = 1d64b124afe30f29-52000-562310a2
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
>>> process Information after fork()'
>>>
>>>      client->hostname() = acme11.ciemat.es
>>>
>>>      client->progname() = mpirun_rsh_(forked)
>>>
>>>      msg.from = 1d64b124afe30f29-53000-562310a2
>>>
>>>      client->identity() = 1d64b124afe30f29-52000-562310a2
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>>> connected'
>>>
>>>      hello_remote.from = 1d64b124afe30f29-53000-562310a2
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
>>> process Information after fork()'
>>>
>>>      client->hostname() = acme11.ciemat.es
>>>
>>>      client->progname() = dmtcp_ssh_(forked)
>>>
>>>      msg.from = 1d64b124afe30f29-54000-562310a2
>>>
>>>      client->identity() = 1d64b124afe30f29-53000-562310a2
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
>>> disconnected'
>>>
>>>      client->identity() = 1d64b124afe30f29-54000-562310a2
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
>>> process Information after exec()'
>>>
>>>      progname = dmtcp_ssh
>>>
>>>      msg.from = 1d64b124afe30f29-53000-562310a2
>>>
>>>      client->identity() = 1d64b124afe30f29-53000-562310a2
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>>> connected'
>>>
>>>      hello_remote.from = 1b69d09fb3238b30-23945-562310a2
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
>>> process Information after exec()'
>>>
>>>      progname = dmtcp_sshd
>>>
>>>      msg.from = 1b69d09fb3238b30-55000-562310a2
>>>
>>>      client->identity() = 1b69d09fb3238b30-23945-562310a2
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>>> connected'
>>>
>>>      hello_remote.from = 1b69d09fb3238b30-55000-562310a2
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
>>> process Information after fork()'
>>>
>>>      client->hostname() = acme12.ciemat.es
>>>
>>>      client->progname() = dmtcp_sshd_(forked)
>>>
>>>      msg.from = 1b69d09fb3238b30-56000-562310a2
>>>
>>>      client->identity() = 1b69d09fb3238b30-55000-562310a2
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>>> connected'
>>>
>>>      hello_remote.from = 1b69d09fb3238b30-56000-562310a2
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
>>> process Information after fork()'
>>>
>>>      client->hostname() = acme12.ciemat.es
>>>
>>>      client->progname() = mpispawn_(forked)
>>>
>>>      msg.from = 1b69d09fb3238b30-57000-562310a2
>>>
>>>      client->identity() = 1b69d09fb3238b30-56000-562310a2
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
>>> process Information after exec()'
>>>
>>>      progname = env
>>>
>>>      msg.from = 1b69d09fb3238b30-56000-562310a2
>>>
>>>      client->identity() = 1b69d09fb3238b30-56000-562310a2
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
>>> process Information after exec()'
>>>
>>>      progname = mpispawn
>>>
>>>      msg.from = 1b69d09fb3238b30-56000-562310a2
>>>
>>>      client->identity() = 1b69d09fb3238b30-56000-562310a2
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
>>> process Information after exec()'
>>>
>>>      progname = helloWorldMPI
>>>
>>>      msg.from = 1b69d09fb3238b30-57000-562310a2
>>>
>>>      client->identity() = 1b69d09fb3238b30-57000-562310a2
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
>>> disconnected'
>>>
>>>      client->identity() = 1b69d09fb3238b30-57000-562310a2
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
>>> disconnected'
>>>
>>>      client->identity() = 1b69d09fb3238b30-56000-562310a2
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
>>> disconnected'
>>>
>>>      client->identity() = 1b69d09fb3238b30-55000-562310a2
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
>>> disconnected'
>>>
>>>      client->identity() = 1d64b124afe30f29-53000-562310a2
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
>>> disconnected'
>>>
>>>      client->identity() = 1d64b124afe30f29-52000-562310a2
>>>
>>>
>>> ---
>>> ---
>>>
>>> So we see that it is working correctly, connecting and so.
>>>
>>> However, if I run the application in more than one core, as in the first
>>> example, it crashes. What happens is that the first node on the node list
>>> executes the application, and the rest do not.
>>>
>>> ----
>>> ----
>>>
>>> [root@acme11 tests]#  dmtcp_launch -h acme11 -p 7779 --ib mpirun_rsh
>>> -n 2  acme11 acme12 ./helloWorldMPI
>>>
>>> [59000] NOTE at ssh.cpp:369 in prepareForExec; REASON='New ssh command'
>>>
>>>      newCommand = /home/localsoft/dmtcp/bin/dmtcp_ssh
>>> /home/localsoft/dmtcp/bin/dmtcp_nocheckpoint /usr/bin/ssh -q acme11 cd
>>> /home/slurm/tests;/home/localsoft/dmtcp/bin/dmtcp_launch --coord-host
>>> 172.17.29.173 --coord-port 7779 --ckptdir /home/slurm/tests --infiniband
>>> /home/localsoft/dmtcp/bin/dmtcp_sshd /usr/bin/env  MPISPAWN_MPIRUN_MPD=0
>>> USE_LINEAR_SSH=1 MPISPAWN_MPIRUN_HOST=acme11.ciemat.es
>>> MPISPAWN_MPIRUN_HOSTIP=172.17.29.173 MPIRUN_RSH_LAUNCH=1
>>> MPISPAWN_CHECKIN_PORT=34203 MPISPAWN_MPIRUN_PORT=34203 MPISPAWN_NNODES=2
>>> MPISPAWN_GLOBAL_NPROCS=2 MPISPAWN_MPIRUN_ID=58000 MPISPAWN_ARGC=1
>>> MPDMAN_KVS_TEMPLATE=kvs_481_acme11.ciemat.es_58000 MPISPAWN_LOCAL_NPROCS=1
>>> MPISPAWN_ARGV_0='./helloWorldMPI' MPISPAWN_ARGC=1
>>> MPISPAWN_GENERIC_ENV_COUNT=0  MPISPAWN_ID=0
>>> MPISPAWN_WORKING_DIR=/home/slurm/tests MPISPAWN_MPIRUN_RANK_0=0
>>> /usr/local/bin/mpispawn 0
>>>
>>> [60000] NOTE at ssh.cpp:369 in prepareForExec; REASON='New ssh command'
>>>
>>>      newCommand = /home/localsoft/dmtcp/bin/dmtcp_ssh
>>> /home/localsoft/dmtcp/bin/dmtcp_nocheckpoint /usr/bin/ssh -q acme12 cd
>>> /home/slurm/tests;/home/localsoft/dmtcp/bin/dmtcp_launch --coord-host
>>> 172.17.29.173 --coord-port 7779 --ckptdir /home/slurm/tests --infiniband
>>> /home/localsoft/dmtcp/bin/dmtcp_sshd /usr/bin/env  MPISPAWN_MPIRUN_MPD=0
>>> USE_LINEAR_SSH=1 MPISPAWN_MPIRUN_HOST=acme11.ciemat.es
>>> MPISPAWN_MPIRUN_HOSTIP=172.17.29.173 MPIRUN_RSH_LAUNCH=1
>>> MPISPAWN_CHECKIN_PORT=34203 MPISPAWN_MPIRUN_PORT=34203 MPISPAWN_NNODES=2
>>> MPISPAWN_GLOBAL_NPROCS=2 MPISPAWN_MPIRUN_ID=58000 MPISPAWN_ARGC=1
>>> MPDMAN_KVS_TEMPLATE=kvs_481_acme11.ciemat.es_58000 MPISPAWN_LOCAL_NPROCS=1
>>> MPISPAWN_ARGV_0='./helloWorldMPI' MPISPAWN_ARGC=1
>>> MPISPAWN_GENERIC_ENV_COUNT=0  MPISPAWN_ID=1
>>> MPISPAWN_WORKING_DIR=/home/slurm/tests MPISPAWN_MPIRUN_RANK_0=1
>>> /usr/local/bin/mpispawn 0
>>>
>>> Process 0 of 2 is on acme11.ciemat.es
>>>
>>> Hello world from process 0 of 2
>>>
>>> Goodbye world from process 0 of 2
>>>
>>> COORDINATOR OUTPUT
>>>
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>>> connected'
>>>
>>>      hello_remote.from = 1d64b124afe30f29-4070-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
>>> process Information after exec()'
>>>
>>>      progname = mpirun_rsh
>>>
>>>      msg.from = 1d64b124afe30f29-58000-56231173
>>>
>>>      client->identity() = 1d64b124afe30f29-4070-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>>> connected'
>>>
>>>      hello_remote.from = 1d64b124afe30f29-58000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>>> connected'
>>>
>>>      hello_remote.from = 1d64b124afe30f29-58000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
>>> process Information after fork()'
>>>
>>>      client->hostname() = acme11.ciemat.es
>>>
>>>      client->progname() = mpirun_rsh_(forked)
>>>
>>>      msg.from = 1d64b124afe30f29-59000-56231173
>>>
>>>      client->identity() = 1d64b124afe30f29-58000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
>>> process Information after fork()'
>>>
>>>      client->hostname() = acme11.ciemat.es
>>>
>>>      client->progname() = mpirun_rsh_(forked)
>>>
>>>      msg.from = 1d64b124afe30f29-60000-56231173
>>>
>>>      client->identity() = 1d64b124afe30f29-58000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>>> connected'
>>>
>>>      hello_remote.from = 1d64b124afe30f29-59000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>>> connected'
>>>
>>>      hello_remote.from = 1d64b124afe30f29-60000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
>>> process Information after fork()'
>>>
>>>      client->hostname() = acme11.ciemat.es
>>>
>>>      client->progname() = dmtcp_ssh_(forked)
>>>
>>>      msg.from = 1d64b124afe30f29-61000-56231173
>>>
>>>      client->identity() = 1d64b124afe30f29-59000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
>>> process Information after fork()'
>>>
>>>      client->hostname() = acme11.ciemat.es
>>>
>>>      client->progname() = dmtcp_ssh_(forked)
>>>
>>>      msg.from = 1d64b124afe30f29-62000-56231173
>>>
>>>      client->identity() = 1d64b124afe30f29-60000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
>>> disconnected'
>>>
>>>      client->identity() = 1d64b124afe30f29-61000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
>>> disconnected'
>>>
>>>      client->identity() = 1d64b124afe30f29-62000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
>>> process Information after exec()'
>>>
>>>      progname = dmtcp_ssh
>>>
>>>      msg.from = 1d64b124afe30f29-59000-56231173
>>>
>>>      client->identity() = 1d64b124afe30f29-59000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
>>> process Information after exec()'
>>>
>>>      progname = dmtcp_ssh
>>>
>>>      msg.from = 1d64b124afe30f29-60000-56231173
>>>
>>>      client->identity() = 1d64b124afe30f29-60000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>>> connected'
>>>
>>>      hello_remote.from = 1b69d09fb3238b30-24001-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>>> connected'
>>>
>>>      hello_remote.from = 1d64b124afe30f29-4094-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
>>> process Information after exec()'
>>>
>>>      progname = dmtcp_sshd
>>>
>>>      msg.from = 1d64b124afe30f29-64000-56231173
>>>
>>>      client->identity() = 1d64b124afe30f29-4094-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
>>> process Information after exec()'
>>>
>>>      progname = dmtcp_sshd
>>>
>>>      msg.from = 1b69d09fb3238b30-63000-56231173
>>>
>>>      client->identity() = 1b69d09fb3238b30-24001-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>>> connected'
>>>
>>>      hello_remote.from = 1d64b124afe30f29-64000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>>> connected'
>>>
>>>      hello_remote.from = 1b69d09fb3238b30-63000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
>>> process Information after fork()'
>>>
>>>      client->hostname() = acme11.ciemat.es
>>>
>>>      client->progname() = dmtcp_sshd_(forked)
>>>
>>>      msg.from = 1d64b124afe30f29-65000-56231173
>>>
>>>      client->identity() = 1d64b124afe30f29-64000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
>>> process Information after fork()'
>>>
>>>      client->hostname() = acme12.ciemat.es
>>>
>>>      client->progname() = dmtcp_sshd_(forked)
>>>
>>>      msg.from = 1b69d09fb3238b30-66000-56231173
>>>
>>>      client->identity() = 1b69d09fb3238b30-63000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
>>> process Information after exec()'
>>>
>>>      progname = env
>>>
>>>      msg.from = 1d64b124afe30f29-65000-56231173
>>>
>>>      client->identity() = 1d64b124afe30f29-65000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
>>> process Information after exec()'
>>>
>>>      progname = mpispawn
>>>
>>>      msg.from = 1d64b124afe30f29-65000-56231173
>>>
>>>      client->identity() = 1d64b124afe30f29-65000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>>> connected'
>>>
>>>      hello_remote.from = 1b69d09fb3238b30-66000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>>> connected'
>>>
>>>      hello_remote.from = 1d64b124afe30f29-65000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
>>> process Information after fork()'
>>>
>>>      client->hostname() = acme11.ciemat.es
>>>
>>>      client->progname() = mpispawn_(forked)
>>>
>>>      msg.from = 1d64b124afe30f29-68000-56231173
>>>
>>>      client->identity() = 1d64b124afe30f29-65000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
>>> process Information after fork()'
>>>
>>>      client->hostname() = acme12.ciemat.es
>>>
>>>      client->progname() = mpispawn_(forked)
>>>
>>>      msg.from = 1b69d09fb3238b30-67000-56231173
>>>
>>>      client->identity() = 1b69d09fb3238b30-66000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
>>> process Information after exec()'
>>>
>>>      progname = env
>>>
>>>      msg.from = 1b69d09fb3238b30-66000-56231173
>>>
>>>      client->identity() = 1b69d09fb3238b30-66000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
>>> process Information after exec()'
>>>
>>>      progname = mpispawn
>>>
>>>      msg.from = 1b69d09fb3238b30-66000-56231173
>>>
>>>      client->identity() = 1b69d09fb3238b30-66000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
>>> process Information after exec()'
>>>
>>>      progname = helloWorldMPI
>>>
>>>      msg.from = 1d64b124afe30f29-68000-56231173
>>>
>>>      client->identity() = 1d64b124afe30f29-68000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
>>> process Information after exec()'
>>>
>>>      progname = helloWorldMPI
>>>
>>>      msg.from = 1b69d09fb3238b30-67000-56231173
>>>
>>>      client->identity() = 1b69d09fb3238b30-67000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
>>> disconnected'
>>>
>>>      client->identity() = 1d64b124afe30f29-68000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
>>> disconnected'
>>>
>>>      client->identity() = 1b69d09fb3238b30-67000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
>>> disconnected'
>>>
>>>      client->identity() = 1d64b124afe30f29-65000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
>>> disconnected'
>>>
>>>      client->identity() = 1b69d09fb3238b30-66000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
>>> disconnected'
>>>
>>>      client->identity() = 1d64b124afe30f29-64000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
>>> disconnected'
>>>
>>>      client->identity() = 1b69d09fb3238b30-63000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
>>> disconnected'
>>>
>>>      client->identity() = 1d64b124afe30f29-59000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
>>> disconnected'
>>>
>>>      client->identity() = 1d64b124afe30f29-60000-56231173
>>>
>>> [3984] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
>>> disconnected'
>>>
>>>      client->identity() = 1d64b124afe30f29-58000-56231173
>>>
>>>
>>> ----
>>>
>>> ----
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Dr. Manuel Rodríguez-Pascual
>>> skype: manuel.rodriguez.pascual
>>> phone: (+34) 913466173 // (+34) 679925108
>>>
>>> CIEMAT-Moncloa
>>> Edificio 22, desp. 1.25
>>> Avenida Complutense, 40
>>> 28040- MADRID
>>> SPAIN
>>>
>>>
>>> ------------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> Dmtcp-forum mailing list
>>> Dmtcp-forum@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>>>
>>>
>>
>


-- 
Dr. Manuel Rodríguez-Pascual
skype: manuel.rodriguez.pascual
phone: (+34) 913466173 // (+34) 679925108

CIEMAT-Moncloa
Edificio 22, desp. 1.25
Avenida Complutense, 40
28040- MADRID
SPAIN
------------------------------------------------------------------------------
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to