Hi all,
I am a newbie in DMTCP. I am trying to checkpoint my MPI application but
still haven't been able. I am pretty sure that I'm doing something pretty
obvious wrong, but cannot find the problem on my own.
Some info,
- I am employing MPICH 3.1.4, and my DMTCP version is 2.4.0-rc4
- MPICH is working OK. DMTCP is working OK for serial applications too.
- My code is a simple hello world.
- At this moment I am going with 3 mpi tasks running on the same node, to
simplify things
- coordinator is started with "dmtcp_coordinator", and my application with
"dmtcp_launch mpirun -n 3 /home/slurm/helloWorldMPI". Both are run with
user "slurm".
- passwordless SSH is enabled. Tested with
ssh slurm-master which dmtcp_launch
/usr/local/bin/dmtcp_launch
I don't know where the error can be, so I have included quite a lot of
information below to help.
Thanks for your help,
Manuel
---
---
COORDINATOR OUTPUT
[root@slurm-master tmp]# dmtcp_coordinator
dmtcp_coordinator starting...
Host: slurm-master (192.168.1.10)
Port: 7779
Checkpoint Interval: disabled (checkpoint manually instead)
Exit on last client: 0
Type '?' for help.
[20427] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 6db90f3d5a9dd200-20448-55536cf2
[20427] NOTE at dmtcp_coordinator.cpp:864 in onData; REASON='Updating
process Information after exec()'
progname = mpiexec.hydra
msg.from = 6db90f3d5a9dd200-40000-55536cf3
client->identity() = 6db90f3d5a9dd200-20448-55536cf2
[20427] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 6db90f3d5a9dd200-40000-55536cf3
[20427] NOTE at dmtcp_coordinator.cpp:855 in onData; REASON='Updating
process Information after fork()'
client->hostname() = slurm-master
client->progname() = mpiexec.hydra_(forked)
msg.from = 6db90f3d5a9dd200-41000-55536cf3
client->identity() = 6db90f3d5a9dd200-40000-55536cf3
[20427] NOTE at dmtcp_coordinator.cpp:864 in onData; REASON='Updating
process Information after exec()'
progname = hydra_pmi_proxy
msg.from = 6db90f3d5a9dd200-41000-55536cf3
client->identity() = 6db90f3d5a9dd200-41000-55536cf3
[20427] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 6db90f3d5a9dd200-41000-55536cf3
[20427] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 6db90f3d5a9dd200-41000-55536cf3
[20427] NOTE at dmtcp_coordinator.cpp:855 in onData; REASON='Updating
process Information after fork()'
client->hostname() = slurm-master
client->progname() = hydra_pmi_proxy_(forked)
msg.from = 6db90f3d5a9dd200-42000-55536cf4
client->identity() = 6db90f3d5a9dd200-41000-55536cf3
[20427] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
connected'
hello_remote.from = 6db90f3d5a9dd200-41000-55536cf3
[20427] NOTE at dmtcp_coordinator.cpp:855 in onData; REASON='Updating
process Information after fork()'
client->hostname() = slurm-master
client->progname() = hydra_pmi_proxy_(forked)
msg.from = 6db90f3d5a9dd200-43000-55536cf4
client->identity() = 6db90f3d5a9dd200-41000-55536cf3
[20427] NOTE at dmtcp_coordinator.cpp:855 in onData; REASON='Updating
process Information after fork()'
client->hostname() = slurm-master
client->progname() = hydra_pmi_proxy_(forked)
msg.from = 6db90f3d5a9dd200-44000-55536cf4
client->identity() = 6db90f3d5a9dd200-41000-55536cf3
[20427] NOTE at dmtcp_coordinator.cpp:864 in onData; REASON='Updating
process Information after exec()'
progname = helloWorldMPI
msg.from = 6db90f3d5a9dd200-42000-55536cf4
client->identity() = 6db90f3d5a9dd200-42000-55536cf4
[20427] NOTE at dmtcp_coordinator.cpp:864 in onData; REASON='Updating
process Information after exec()'
progname = helloWorldMPI
msg.from = 6db90f3d5a9dd200-44000-55536cf4
client->identity() = 6db90f3d5a9dd200-44000-55536cf4
[20427] NOTE at dmtcp_coordinator.cpp:864 in onData; REASON='Updating
process Information after exec()'
progname = helloWorldMPI
msg.from = 6db90f3d5a9dd200-43000-55536cf4
client->identity() = 6db90f3d5a9dd200-43000-55536cf4
---
---
I press letter C to checkpoint
---
---
[20427] NOTE at dmtcp_coordinator.cpp:1291 in startCheckpoint;
REASON='starting checkpoint, suspending all nodes'
s.numPeers = 5
[20427] NOTE at dmtcp_coordinator.cpp:1293 in startCheckpoint;
REASON='Incremented Generation'
compId.generation() = 1
[20427] NOTE at dmtcp_coordinator.cpp:654 in updateMinimumState;
REASON='locking all nodes'
[20427] NOTE at dmtcp_coordinator.cpp:660 in updateMinimumState;
REASON='draining all nodes'
[20427] NOTE at dmtcp_coordinator.cpp:666 in updateMinimumState;
REASON='checkpointing all nodes'
[20427] NOTE at dmtcp_coordinator.cpp:680 in updateMinimumState;
REASON='building name service database'
[20427] NOTE at dmtcp_coordinator.cpp:696 in updateMinimumState;
REASON='entertaining queries now'
[20427] NOTE at dmtcp_coordinator.cpp:701 in updateMinimumState;
REASON='refilling all nodes'
[20427] NOTE at dmtcp_coordinator.cpp:732 in updateMinimumState;
REASON='restarting all nodes'
---
---
I cancel the running application with CTRL+C
---
---
COORDINATOR
[20427] NOTE at dmtcp_coordinator.cpp:914 in onDisconnect; REASON='client
disconnected'
client->identity() = 6db90f3d5a9dd200-41000-55536cf3
[20427] NOTE at dmtcp_coordinator.cpp:914 in onDisconnect; REASON='client
disconnected'
client->identity() = 6db90f3d5a9dd200-40000-55536cf3
APPLICATION
[mpiexec@slurm-master] Sending Ctrl-C to processes as requested
[mpiexec@slurm-master] Press Ctrl-C again to force abort
Ctrl-C caught... cleaning up processes
[proxy:0:0@slurm-master] HYD_pmcd_pmip_control_cmd_cb
(/root/mpich-3.1.4/src/pm/hydra/pm/pmiserv/pmip_cb.c:885): assert (!closed)
failed
[proxy:0:0@slurm-master] HYDT_dmxu_poll_wait_for_event
(/root/mpich-3.1.4/src/pm/hydra/tools/demux/demux_poll.c:76): callback
returned error status
[proxy:0:0@slurm-master] main
(/root/mpich-3.1.4/src/pm/hydra/pm/pmiserv/pmip.c:206): demux engine error
waiting for event
^C[mpiexec@slurm-master] HYDT_bscu_wait_for_completion
(/root/mpich-3.1.4/src/pm/hydra/tools/bootstrap/utils/bscu_wait.c:76): one
of the processes terminated badly; aborting
[mpiexec@slurm-master] HYDT_bsci_wait_for_completion
(/root/mpich-3.1.4/src/pm/hydra/tools/bootstrap/src/bsci_wait.c:23):
launcher returned error waiting for completion
[mpiexec@slurm-master] HYD_pmci_wait_for_completion
(/root/mpich-3.1.4/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:218): launcher
returned error waiting for completion
[mpiexec@slurm-master] main
(/root/mpich-3.1.4/src/pm/hydra/ui/mpich/mpiexec.c:344): process manager
error waiting for completion
---
---
Everything until now seems to be OK. I get the same output with serial
applications.
Then I try to restart,
---
---
APPLICATION
[slurm@slurm-master ~]$ sh /tmp/dmtcp_restart_script.sh
[21164] ERROR at coordinatorapi.cpp:514 in sendRecvHandshake;
REASON='JASSERT(msg.type == DMT_ACCEPT) failed'
dmtcp_restart (21164): Terminating...
COORDINATOR
[20427] NOTE at dmtcp_coordinator.cpp:1143 in
validateRestartingWorkerProcess; REASON='Computation not in RESTARTING or
CHECKPOINTED state. Reject incoming computation process requesting
restart.'
compId = 6db90f3d5a9dd200-40000-55536cf2
hello_remote.compGroup = 6db90f3d5a9dd200-40000-55536cf2
minimumState() = WorkerState::RUNNING
---
---
--
Dr. Manuel Rodríguez-Pascual
skype: manuel.rodriguez.pascual
phone: (+34) 913466173 // (+34) 679925108
CIEMAT-Moncloa
Edificio 22, desp. 1.25
Avenida Complutense, 40
28040- MADRID
SPAIN
------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum