Hi Jiajun,
Thank you for your reply and sorry for this late response.
Unfortunatelly, I'm unable to give you access to the cluster because I'm
not the administrator of the cluster.
I use MPICH-3.2 (MPI Implementation) and Slurm-15.08.10 as a resource
management software.
The interconnect is fast ethernet.
I have tried running using resource management but it doesn't work.
regards,
Husen
On Fri, Apr 29, 2016 at 2:09 AM, Jiajun Cao <jia...@ccs.neu.edu> wrote:
> Hi Husen,
>
> There can be multiple reasons a client disconnects. Is it possible to give
> us access to your cluster? This should be the fastest way to diagnose the
> problem. Also, to have some initial guess, could you please provide the
> following info:
>
> 1. MPI version;
> 2. What resource management software is used;
> 3. What interconnect is used in the cluster.
>
> In principle, when resource management is used, submitting jobs using job
> scripts is recommended. You can find some job examples
> in plugin/batch-queue/job_examples. However, running application
> interactively is also supported. In your case, if the configuration is no
> problem, it can be a bug in DMTCP, and we'll help you fix that.
>
> Also, if InfiniBand is used as the interconnect, you'll need to enforce
> the IB plugin of DMTCP by adding the --ib option to dmtcp_launch.
>
>
> Best,
> Jiajun
>
> On Sun, Apr 24, 2016 at 12:34 AM, Husen R <hus...@gmail.com> wrote:
>
>> Dear all,
>>
>> I run dmtcp_coordinator in head-node and then I tried to run dmtcp_launch
>> in another node (compute-node) using the following command :
>>
>> dmtcp_launch --coord-host head-node --coord-port 7779 mpirun -np 24
>> -hostfile machines ./mm.o
>>
>> However, the mpi application is not executed. When I see
>> dmtcp_coordinator output log, the last two REASONs said "client
>> disconnected".
>> Why the client is disconnected ? any idea how to fix this ? Thank you in
>> advance.
>>
>> This is the output of dmtcp_coordinator :
>>
>>
>> [25572] NOTE at dmtcp_coordinator.cpp:1664 in updateCheckpointInterval;
>> REASON='CheckpointInterval updated (for this computation only)'
>> oldInterval = 0
>> theCheckpointInterval = 0
>> [25572] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>> connected'
>> hello_remote.from = 3537527e5a992df8-12706-571c48a6
>> [25572] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
>> process Information after exec()'
>> progname = mpiexec.hydra
>> msg.from = 3537527e5a992df8-40000-571c48a6
>> client->identity() = 3537527e5a992df8-12706-571c48a6
>> [25572] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>> connected'
>> hello_remote.from = 3537527e5a992df8-40000-571c48a6
>> [25572] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
>> process Information after fork()'
>> client->hostname() = compute-node
>> client->progname() = mpiexec.hydra_(forked)
>> msg.from = 3537527e5a992df8-41000-571c48a6
>> client->identity() = 3537527e5a992df8-40000-571c48a6
>> [25572] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>> connected'
>> hello_remote.from = 3537527e5a992df8-40000-571c48a6
>> [25572] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>> connected'
>> hello_remote.from = 3537527e5a992df8-40000-571c48a6
>> [25572] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
>> process Information after fork()'
>> client->hostname() = compute-node
>> client->progname() = mpiexec.hydra_(forked)
>> msg.from = 3537527e5a992df8-42000-571c48a6
>> client->identity() = 3537527e5a992df8-40000-571c48a6
>> [25572] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
>> process Information after fork()'
>> client->hostname() = compute-node
>> client->progname() = mpiexec.hydra_(forked)
>> msg.from = 3537527e5a992df8-43000-571c48a6
>> client->identity() = 3537527e5a992df8-40000-571c48a6
>> [25572] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>> connected'
>> hello_remote.from = 3537527e5a992df8-41000-571c48a6
>> [25572] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
>> process Information after fork()'
>> client->hostname() = compute-node
>> client->progname() = dmtcp_ssh_(forked)
>> msg.from = 3537527e5a992df8-44000-571c48a6
>> client->identity() = 3537527e5a992df8-41000-571c48a6
>> [25572] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>> connected'
>> hello_remote.from = 3537527e5a992df8-43000-571c48a6
>> [25572] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
>> disconnected'
>> client->identity() = 3537527e5a992df8-44000-571c48a6
>> client->progname() = dmtcp_ssh_(forked)
>> [25572] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
>> process Information after fork()'
>> client->hostname() = compute-node
>> client->progname() = dmtcp_ssh_(forked)
>> msg.from = 3537527e5a992df8-45000-571c48a6
>> client->identity() = 3537527e5a992df8-43000-571c48a6
>> [25572] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
>> disconnected'
>> client->identity() = 3537527e5a992df8-45000-571c48a6
>> client->progname() = dmtcp_ssh_(forked)
>> [25572] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>> connected'
>> hello_remote.from = 3537527e5a992df8-42000-571c48a6
>> [25572] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
>> process Information after fork()'
>> client->hostname() = compute-node
>> client->progname() = hydra_pmi_proxy_(forked)
>> msg.from = 3537527e5a992df8-46000-571c48a6
>> client->identity() = 3537527e5a992df8-42000-571c48a6
>> [25572] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>> connected'
>> hello_remote.from = 3537527e5a992df8-42000-571c48a6
>> [25572] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
>> process Information after fork()'
>> client->hostname() = compute-node
>> client->progname() = hydra_pmi_proxy_(forked)
>> msg.from = 3537527e5a992df8-47000-571c48a6
>> client->identity() = 3537527e5a992df8-42000-571c48a6
>> [25572] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>> connected'
>> hello_remote.from = 3537527e5a992df8-42000-571c48a6
>> [25572] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
>> process Information after fork()'
>> client->hostname() = compute-node
>> client->progname() = hydra_pmi_proxy_(forked)
>> msg.from = 3537527e5a992df8-48000-571c48a6
>> client->identity() = 3537527e5a992df8-42000-571c48a6
>> [25572] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>> connected'
>> hello_remote.from = 3537527e5a992df8-42000-571c48a6
>> [25572] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
>> process Information after fork()'
>> client->hostname() = compute-node
>> client->progname() = hydra_pmi_proxy_(forked)
>> msg.from = 3537527e5a992df8-49000-571c48a6
>> client->identity() = 3537527e5a992df8-42000-571c48a6
>> [25572] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>> connected'
>> hello_remote.from = 3537527e5a992df8-42000-571c48a6
>> [25572] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
>> process Information after fork()'
>> client->hostname() = compute-node
>> client->progname() = hydra_pmi_proxy_(forked)
>> msg.from = 3537527e5a992df8-50000-571c48a6
>> client->identity() = 3537527e5a992df8-42000-571c48a6
>> [25572] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>> connected'
>> hello_remote.from = 3537527e5a992df8-42000-571c48a6
>> [25572] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
>> process Information after fork()'
>> client->hostname() = compute-node
>> client->progname() = hydra_pmi_proxy_(forked)
>> msg.from = 3537527e5a992df8-51000-571c48a6
>> client->identity() = 3537527e5a992df8-42000-571c48a6
>> [25572] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>> connected'
>> hello_remote.from = 3537527e5a992df8-42000-571c48a6
>> [25572] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
>> process Information after fork()'
>> client->hostname() = compute-node
>> client->progname() = hydra_pmi_proxy_(forked)
>> msg.from = 3537527e5a992df8-52000-571c48a6
>> client->identity() = 3537527e5a992df8-42000-571c48a6
>> [25572] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
>> connected'
>> hello_remote.from = 3537527e5a992df8-42000-571c48a6
>> [25572] NOTE at dmtcp_coordinator.cpp:858 in onData; REASON='Updating
>> process Information after fork()'
>> client->hostname() = compute-node
>> client->progname() = hydra_pmi_proxy_(forked)
>> msg.from = 3537527e5a992df8-53000-571c48a6
>> client->identity() = 3537527e5a992df8-42000-571c48a6
>> [25572] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
>> process Information after exec()'
>> progname = dmtcp_ssh
>> msg.from = 3537527e5a992df8-41000-571c48a6
>> client->identity() = 3537527e5a992df8-41000-571c48a6
>> [25572] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
>> process Information after exec()'
>> progname = hydra_pmi_proxy
>> msg.from = 3537527e5a992df8-42000-571c48a6
>> client->identity() = 3537527e5a992df8-42000-571c48a6
>> [25572] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
>> process Information after exec()'
>> progname = dmtcp_ssh
>> msg.from = 3537527e5a992df8-43000-571c48a6
>> client->identity() = 3537527e5a992df8-43000-571c48a6
>> [25572] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
>> process Information after exec()'
>> progname = mm.o
>> msg.from = 3537527e5a992df8-46000-571c48a6
>> client->identity() = 3537527e5a992df8-46000-571c48a6
>> [25572] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
>> process Information after exec()'
>> progname = mm.o
>> msg.from = 3537527e5a992df8-47000-571c48a6
>> client->identity() = 3537527e5a992df8-47000-571c48a6
>> [25572] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
>> process Information after exec()'
>> progname = mm.o
>> msg.from = 3537527e5a992df8-48000-571c48a6
>> client->identity() = 3537527e5a992df8-48000-571c48a6
>> [25572] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
>> process Information after exec()'
>> progname = mm.o
>> msg.from = 3537527e5a992df8-49000-571c48a6
>> client->identity() = 3537527e5a992df8-49000-571c48a6
>> [25572] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
>> process Information after exec()'
>> progname = mm.o
>> msg.from = 3537527e5a992df8-50000-571c48a6
>> client->identity() = 3537527e5a992df8-50000-571c48a6
>> [25572] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
>> process Information after exec()'
>> progname = mm.o
>> msg.from = 3537527e5a992df8-51000-571c48a6
>> client->identity() = 3537527e5a992df8-51000-571c48a6
>> [25572] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
>> process Information after exec()'
>> progname = mm.o
>> msg.from = 3537527e5a992df8-52000-571c48a6
>> client->identity() = 3537527e5a992df8-52000-571c48a6
>> [25572] NOTE at dmtcp_coordinator.cpp:867 in onData; REASON='Updating
>> process Information after exec()'
>> progname = mm.o
>> msg.from = 3537527e5a992df8-53000-571c48a6
>> client->identity() = 3537527e5a992df8-53000-571c48a6
>> [25572] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
>> disconnected'
>> client->identity() = 3537527e5a992df8-41000-571c48a6
>> client->progname() = dmtcp_ssh
>> [25572] NOTE at dmtcp_coordinator.cpp:917 in onDisconnect; REASON='client
>> disconnected'
>> client->identity() = 3537527e5a992df8-43000-571c48a6
>> client->progname() = dmtcp_ssh
>>
>>
>> Regards,
>>
>>
>> Husen
>>
>>
>>
>> ------------------------------------------------------------------------------
>> Find and fix application performance issues faster with Applications
>> Manager
>> Applications Manager provides deep performance insights into multiple
>> tiers of
>> your business applications. It resolves application problems quickly and
>> reduces your MTTR. Get your free trial!
>> https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
>> _______________________________________________
>> Dmtcp-forum mailing list
>> Dmtcp-forum@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>>
>>
>
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum