Dear all,
I run dmtcp_coordinator in head-node and then I tried to run dmtcp_launch
in another node (compute-node) using the following command :
dmtcp_launch --coord-host head-node --coord-port 7779 mpirun -np 24
-hostfile machines ./mm.o
However, the mpi application is not executed. When I see dmtcp_coordinator
output log, the last two REASONs said "client disconnected".
Why the client is disconnected ? any idea how to fix this ? Thank you in
advance.
This is the output of dmtcp_coordinator :
[25572] NOTE at dmtcp_coordinator.cpp:1664 in updateCheckpointInterval;
REASON='CheckpointInterval updated (for this computation only)'
oldInterval = 0
theCheckpointInterval = 0
[25572] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 3537527e5a992df8-12706-571c48a6
[25572] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'
progname = mpiexec.hydra
msg.from = 3537527e5a992df8-40000-571c48a6
client->identity() = 3537527e5a992df8-12706-571c48a6
[25572] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 3537527e5a992df8-40000-571c48a6
[25572] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'
client->hostname() = compute-node
client->progname() = mpiexec.hydra_(forked)
msg.from = 3537527e5a992df8-41000-571c48a6
client->identity() = 3537527e5a992df8-40000-571c48a6
[25572] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 3537527e5a992df8-40000-571c48a6
[25572] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 3537527e5a992df8-40000-571c48a6
[25572] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'
client->hostname() = compute-node
client->progname() = mpiexec.hydra_(forked)
msg.from = 3537527e5a992df8-42000-571c48a6
client->identity() = 3537527e5a992df8-40000-571c48a6
[25572] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'
client->hostname() = compute-node
client->progname() = mpiexec.hydra_(forked)
msg.from = 3537527e5a992df8-43000-571c48a6
client->identity() = 3537527e5a992df8-40000-571c48a6
[25572] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 3537527e5a992df8-41000-571c48a6
[25572] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'
client->hostname() = compute-node
client->progname() = dmtcp_ssh_(forked)
msg.from = 3537527e5a992df8-44000-571c48a6
client->identity() = 3537527e5a992df8-41000-571c48a6
[25572] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 3537527e5a992df8-43000-571c48a6
[25572] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'
client->identity() = 3537527e5a992df8-44000-571c48a6
client->progname() = dmtcp_ssh_(forked)
[25572] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'
client->hostname() = compute-node
client->progname() = dmtcp_ssh_(forked)
msg.from = 3537527e5a992df8-45000-571c48a6
client->identity() = 3537527e5a992df8-43000-571c48a6
[25572] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'
client->identity() = 3537527e5a992df8-45000-571c48a6
client->progname() = dmtcp_ssh_(forked)
[25572] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 3537527e5a992df8-42000-571c48a6
[25572] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'
client->hostname() = compute-node
client->progname() = hydra_pmi_proxy_(forked)
msg.from = 3537527e5a992df8-46000-571c48a6
client->identity() = 3537527e5a992df8-42000-571c48a6
[25572] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 3537527e5a992df8-42000-571c48a6
[25572] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'
client->hostname() = compute-node
client->progname() = hydra_pmi_proxy_(forked)
msg.from = 3537527e5a992df8-47000-571c48a6
client->identity() = 3537527e5a992df8-42000-571c48a6
[25572] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 3537527e5a992df8-42000-571c48a6
[25572] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'
client->hostname() = compute-node
client->progname() = hydra_pmi_proxy_(forked)
msg.from = 3537527e5a992df8-48000-571c48a6
client->identity() = 3537527e5a992df8-42000-571c48a6
[25572] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 3537527e5a992df8-42000-571c48a6
[25572] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'
client->hostname() = compute-node
client->progname() = hydra_pmi_proxy_(forked)
msg.from = 3537527e5a992df8-49000-571c48a6
client->identity() = 3537527e5a992df8-42000-571c48a6
[25572] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 3537527e5a992df8-42000-571c48a6
[25572] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'
client->hostname() = compute-node
client->progname() = hydra_pmi_proxy_(forked)
msg.from = 3537527e5a992df8-50000-571c48a6
client->identity() = 3537527e5a992df8-42000-571c48a6
[25572] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 3537527e5a992df8-42000-571c48a6
[25572] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'
client->hostname() = compute-node
client->progname() = hydra_pmi_proxy_(forked)
msg.from = 3537527e5a992df8-51000-571c48a6
client->identity() = 3537527e5a992df8-42000-571c48a6
[25572] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 3537527e5a992df8-42000-571c48a6
[25572] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'
client->hostname() = compute-node
client->progname() = hydra_pmi_proxy_(forked)
msg.from = 3537527e5a992df8-52000-571c48a6
client->identity() = 3537527e5a992df8-42000-571c48a6
[25572] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 3537527e5a992df8-42000-571c48a6
[25572] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
process Information after fork()'
client->hostname() = compute-node
client->progname() = hydra_pmi_proxy_(forked)
msg.from = 3537527e5a992df8-53000-571c48a6
client->identity() = 3537527e5a992df8-42000-571c48a6
[25572] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'
progname = dmtcp_ssh
msg.from = 3537527e5a992df8-41000-571c48a6
client->identity() = 3537527e5a992df8-41000-571c48a6
[25572] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'
progname = hydra_pmi_proxy
msg.from = 3537527e5a992df8-42000-571c48a6
client->identity() = 3537527e5a992df8-42000-571c48a6
[25572] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'
progname = dmtcp_ssh
msg.from = 3537527e5a992df8-43000-571c48a6
client->identity() = 3537527e5a992df8-43000-571c48a6
[25572] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'
progname = mm.o
msg.from = 3537527e5a992df8-46000-571c48a6
client->identity() = 3537527e5a992df8-46000-571c48a6
[25572] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'
progname = mm.o
msg.from = 3537527e5a992df8-47000-571c48a6
client->identity() = 3537527e5a992df8-47000-571c48a6
[25572] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'
progname = mm.o
msg.from = 3537527e5a992df8-48000-571c48a6
client->identity() = 3537527e5a992df8-48000-571c48a6
[25572] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'
progname = mm.o
msg.from = 3537527e5a992df8-49000-571c48a6
client->identity() = 3537527e5a992df8-49000-571c48a6
[25572] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'
progname = mm.o
msg.from = 3537527e5a992df8-50000-571c48a6
client->identity() = 3537527e5a992df8-50000-571c48a6
[25572] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'
progname = mm.o
msg.from = 3537527e5a992df8-51000-571c48a6
client->identity() = 3537527e5a992df8-51000-571c48a6
[25572] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'
progname = mm.o
msg.from = 3537527e5a992df8-52000-571c48a6
client->identity() = 3537527e5a992df8-52000-571c48a6
[25572] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
process Information after exec()'
progname = mm.o
msg.from = 3537527e5a992df8-53000-571c48a6
client->identity() = 3537527e5a992df8-53000-571c48a6
[25572] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'
client->identity() = 3537527e5a992df8-41000-571c48a6
client->progname() = dmtcp_ssh
[25572] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
disconnected'
client->identity() = 3537527e5a992df8-43000-571c48a6
client->progname() = dmtcp_ssh
Regards,
Husen
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum