Hi Rohan, Sorry to burden you with this, but with Jiajun and Artem on trips, you're our main expert right now. Could you answer this one? Manuel is running MPICH. He's using DMTCP 2.4.0-rc4, which should be reasonably up to date. Three processes on one node. Checkpoint succeeds. But on restart, he gets the following eror:
> [20427] NOTE at dmtcp_coordinator.cpp:1143 in > validateRestartingWorkerProcess; REASON='Computation not in RESTARTING or > CHECKPOINTED state. Reject incoming computation process requesting > restart.' > compId = 6db90f3d5a9dd200-40000-55536cf2 > hello_remote.compGroup = 6db90f3d5a9dd200-40000-55536cf2 > minimumState() = WorkerState::RUNNING Rohan, Even if it turns out that there's no bug and that Manuel was not using the commands correctly, I'd like to consider this a "documentation bug". Our documentation for DMTCP/MPI has changed in recent months, and it's still not properly polished. Could you also keep in mind where our documentation for running DMTCP with MPI is lacking, and then put up a pull request to improve our documentation? Thanks, - Gene On Thu, May 14, 2015 at 01:02:42PM +0200, Manuel Rodríguez Pascual wrote: > Hi all, > > I am a newbie in DMTCP. I am trying to checkpoint my MPI application but > still haven't been able. I am pretty sure that I'm doing something pretty > obvious wrong, but cannot find the problem on my own. > > Some info, > > - I am employing MPICH 3.1.4, and my DMTCP version is 2.4.0-rc4 > - MPICH is working OK. DMTCP is working OK for serial applications too. > - My code is a simple hello world. > - At this moment I am going with 3 mpi tasks running on the same node, to > simplify things > - coordinator is started with "dmtcp_coordinator", and my application with > "dmtcp_launch mpirun -n 3 /home/slurm/helloWorldMPI". Both are run with > user "slurm". > > - passwordless SSH is enabled. Tested with > > ssh slurm-master which dmtcp_launch > /usr/local/bin/dmtcp_launch > > > I don't know where the error can be, so I have included quite a lot of > information below to help. > > Thanks for your help, > > > Manuel > --- > --- > COORDINATOR OUTPUT > [root@slurm-master tmp]# dmtcp_coordinator > dmtcp_coordinator starting... > Host: slurm-master (192.168.1.10) > Port: 7779 > Checkpoint Interval: disabled (checkpoint manually instead) > Exit on last client: 0 > Type '?' for help. > > [20427] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker > connected' > hello_remote.from = 6db90f3d5a9dd200-20448-55536cf2 > [20427] NOTE at dmtcp_coordinator.cpp:864 in onData; REASON='Updating > process Information after exec()' > progname = mpiexec.hydra > msg.from = 6db90f3d5a9dd200-40000-55536cf3 > client->identity() = 6db90f3d5a9dd200-20448-55536cf2 > [20427] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker > connected' > hello_remote.from = 6db90f3d5a9dd200-40000-55536cf3 > [20427] NOTE at dmtcp_coordinator.cpp:855 in onData; REASON='Updating > process Information after fork()' > client->hostname() = slurm-master > client->progname() = mpiexec.hydra_(forked) > msg.from = 6db90f3d5a9dd200-41000-55536cf3 > client->identity() = 6db90f3d5a9dd200-40000-55536cf3 > [20427] NOTE at dmtcp_coordinator.cpp:864 in onData; REASON='Updating > process Information after exec()' > progname = hydra_pmi_proxy > msg.from = 6db90f3d5a9dd200-41000-55536cf3 > client->identity() = 6db90f3d5a9dd200-41000-55536cf3 > [20427] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker > connected' > hello_remote.from = 6db90f3d5a9dd200-41000-55536cf3 > [20427] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker > connected' > hello_remote.from = 6db90f3d5a9dd200-41000-55536cf3 > [20427] NOTE at dmtcp_coordinator.cpp:855 in onData; REASON='Updating > process Information after fork()' > client->hostname() = slurm-master > client->progname() = hydra_pmi_proxy_(forked) > msg.from = 6db90f3d5a9dd200-42000-55536cf4 > client->identity() = 6db90f3d5a9dd200-41000-55536cf3 > [20427] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker > connected' > hello_remote.from = 6db90f3d5a9dd200-41000-55536cf3 > [20427] NOTE at dmtcp_coordinator.cpp:855 in onData; REASON='Updating > process Information after fork()' > client->hostname() = slurm-master > client->progname() = hydra_pmi_proxy_(forked) > msg.from = 6db90f3d5a9dd200-43000-55536cf4 > client->identity() = 6db90f3d5a9dd200-41000-55536cf3 > [20427] NOTE at dmtcp_coordinator.cpp:855 in onData; REASON='Updating > process Information after fork()' > client->hostname() = slurm-master > client->progname() = hydra_pmi_proxy_(forked) > msg.from = 6db90f3d5a9dd200-44000-55536cf4 > client->identity() = 6db90f3d5a9dd200-41000-55536cf3 > [20427] NOTE at dmtcp_coordinator.cpp:864 in onData; REASON='Updating > process Information after exec()' > progname = helloWorldMPI > msg.from = 6db90f3d5a9dd200-42000-55536cf4 > client->identity() = 6db90f3d5a9dd200-42000-55536cf4 > [20427] NOTE at dmtcp_coordinator.cpp:864 in onData; REASON='Updating > process Information after exec()' > progname = helloWorldMPI > msg.from = 6db90f3d5a9dd200-44000-55536cf4 > client->identity() = 6db90f3d5a9dd200-44000-55536cf4 > [20427] NOTE at dmtcp_coordinator.cpp:864 in onData; REASON='Updating > process Information after exec()' > progname = helloWorldMPI > msg.from = 6db90f3d5a9dd200-43000-55536cf4 > client->identity() = 6db90f3d5a9dd200-43000-55536cf4 > > --- > --- > > I press letter C to checkpoint > > --- > --- > > [20427] NOTE at dmtcp_coordinator.cpp:1291 in startCheckpoint; > REASON='starting checkpoint, suspending all nodes' > s.numPeers = 5 > [20427] NOTE at dmtcp_coordinator.cpp:1293 in startCheckpoint; > REASON='Incremented Generation' > compId.generation() = 1 > [20427] NOTE at dmtcp_coordinator.cpp:654 in updateMinimumState; > REASON='locking all nodes' > [20427] NOTE at dmtcp_coordinator.cpp:660 in updateMinimumState; > REASON='draining all nodes' > [20427] NOTE at dmtcp_coordinator.cpp:666 in updateMinimumState; > REASON='checkpointing all nodes' > [20427] NOTE at dmtcp_coordinator.cpp:680 in updateMinimumState; > REASON='building name service database' > [20427] NOTE at dmtcp_coordinator.cpp:696 in updateMinimumState; > REASON='entertaining queries now' > [20427] NOTE at dmtcp_coordinator.cpp:701 in updateMinimumState; > REASON='refilling all nodes' > [20427] NOTE at dmtcp_coordinator.cpp:732 in updateMinimumState; > REASON='restarting all nodes' > --- > --- > > > I cancel the running application with CTRL+C > > --- > --- > COORDINATOR > [20427] NOTE at dmtcp_coordinator.cpp:914 in onDisconnect; REASON='client > disconnected' > client->identity() = 6db90f3d5a9dd200-41000-55536cf3 > [20427] NOTE at dmtcp_coordinator.cpp:914 in onDisconnect; REASON='client > disconnected' > client->identity() = 6db90f3d5a9dd200-40000-55536cf3 > > APPLICATION > > [mpiexec@slurm-master] Sending Ctrl-C to processes as requested > [mpiexec@slurm-master] Press Ctrl-C again to force abort > > Ctrl-C caught... cleaning up processes > [proxy:0:0@slurm-master] HYD_pmcd_pmip_control_cmd_cb > (/root/mpich-3.1.4/src/pm/hydra/pm/pmiserv/pmip_cb.c:885): assert (!closed) > failed > [proxy:0:0@slurm-master] HYDT_dmxu_poll_wait_for_event > (/root/mpich-3.1.4/src/pm/hydra/tools/demux/demux_poll.c:76): callback > returned error status > [proxy:0:0@slurm-master] main > (/root/mpich-3.1.4/src/pm/hydra/pm/pmiserv/pmip.c:206): demux engine error > waiting for event > ^C[mpiexec@slurm-master] HYDT_bscu_wait_for_completion > (/root/mpich-3.1.4/src/pm/hydra/tools/bootstrap/utils/bscu_wait.c:76): one > of the processes terminated badly; aborting > [mpiexec@slurm-master] HYDT_bsci_wait_for_completion > (/root/mpich-3.1.4/src/pm/hydra/tools/bootstrap/src/bsci_wait.c:23): > launcher returned error waiting for completion > [mpiexec@slurm-master] HYD_pmci_wait_for_completion > (/root/mpich-3.1.4/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:218): launcher > returned error waiting for completion > [mpiexec@slurm-master] main > (/root/mpich-3.1.4/src/pm/hydra/ui/mpich/mpiexec.c:344): process manager > error waiting for completion > > --- > --- > > Everything until now seems to be OK. I get the same output with serial > applications. > > Then I try to restart, > > --- > --- > > APPLICATION > [slurm@slurm-master ~]$ sh /tmp/dmtcp_restart_script.sh > [21164] ERROR at coordinatorapi.cpp:514 in sendRecvHandshake; > REASON='JASSERT(msg.type == DMT_ACCEPT) failed' > dmtcp_restart (21164): Terminating... > > COORDINATOR > > [20427] NOTE at dmtcp_coordinator.cpp:1143 in > validateRestartingWorkerProcess; REASON='Computation not in RESTARTING or > CHECKPOINTED state. Reject incoming computation process requesting > restart.' > compId = 6db90f3d5a9dd200-40000-55536cf2 > hello_remote.compGroup = 6db90f3d5a9dd200-40000-55536cf2 > minimumState() = WorkerState::RUNNING > --- > --- > > > > > > -- > Dr. Manuel Rodríguez-Pascual > skype: manuel.rodriguez.pascual > phone: (+34) 913466173 // (+34) 679925108 > > CIEMAT-Moncloa > Edificio 22, desp. 1.25 > Avenida Complutense, 40 > 28040- MADRID > SPAIN > ------------------------------------------------------------------------------ > One dashboard for servers and applications across Physical-Virtual-Cloud > Widest out-of-the-box monitoring support with 50+ applications > Performance metrics, stats and reports that give you Actionable Insights > Deep dive visibility with transaction tracing using APM Insight. > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y > _______________________________________________ > Dmtcp-forum mailing list > Dmtcp-forum@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum ------------------------------------------------------------------------------ One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y _______________________________________________ Dmtcp-forum mailing list Dmtcp-forum@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dmtcp-forum