Thank you! A VM image sounds great! I'm assuming it'll have the pre-configured MPICH.
----- Original Message ----- From: "Manuel Rodríguez Pascual" <manuel.rodriguez.pasc...@gmail.com> To: "Rohan Garg" <rohg...@ccs.neu.edu> Cc: "dmtcp-forum" <dmtcp-forum@lists.sourceforge.net> Sent: Friday, May 15, 2015 11:49:36 AM Subject: Re: checkpointing MPI applications Hi, My environment is not fixed at all, so if you have any suggestion I have no problem on changing it :) my only requirement is Slurm, but the MPI and checkpoint libraries can be modified freely. Or if you prefer, I can keep this one and help you on the debug process. Everything is virtualized, so I can send you the image if you want to use it for your tests. Thanks for your help, Manuel El viernes, 15 de mayo de 2015, Rohan Garg <rohg...@ccs.neu.edu> escribió: > Hi Manuel, > > Sorry for the delayed response. I don't see any obvious problems > from the logs that you shared, or your methodology. It seems like > DMTCP failed to restore one or more processes on restart. > > I believe we support MPICH, but I'll have to go back and check if > we have regressed. I'll try to reproduce this issue locally and > report back to you. Is there anything specific about your run-time > environment that I should keep in mind? > > Thanks, > Rohan > > ----- Original Message ----- > From: "gene" <g...@ccs.neu.edu <javascript:;>> > To: "Manuel Rodríguez Pascual" <manuel.rodriguez.pasc...@gmail.com > <javascript:;>> > Cc: "dmtcp-forum" <dmtcp-forum@lists.sourceforge.net <javascript:;>> > Sent: Thursday, May 14, 2015 11:05:23 AM > Subject: Re: [Dmtcp-forum] checkpointing MPI applications > > Hi Rohan, > Sorry to burden you with this, but with Jiajun and Artem on trips, > you're our main expert right now. Could you answer this one? > Manuel is running MPICH. He's using DMTCP 2.4.0-rc4, which should > be reasonably up to date. Three processes on one node. Checkpoint > succeeds. But on restart, he gets the following eror: > > > [20427] NOTE at dmtcp_coordinator.cpp:1143 in > > validateRestartingWorkerProcess; REASON='Computation not in RESTARTING or > > CHECKPOINTED state. Reject incoming computation process requesting > > restart.' > > compId = 6db90f3d5a9dd200-40000-55536cf2 > > hello_remote.compGroup = 6db90f3d5a9dd200-40000-55536cf2 > > minimumState() = WorkerState::RUNNING > > Rohan, > Even if it turns out that there's no bug and that Manuel was not > using the commands correctly, I'd like to consider this a "documentation > bug". > Our documentation for DMTCP/MPI has changed in recent months, and it's > still not properly polished. > Could you also keep in mind where our documentation for running > DMTCP with MPI is lacking, and then put up a pull request to improve > our documentation? > > Thanks, > - Gene > > > On Thu, May 14, 2015 at 01:02:42PM +0200, Manuel Rodríguez Pascual wrote: > > Hi all, > > > > I am a newbie in DMTCP. I am trying to checkpoint my MPI application but > > still haven't been able. I am pretty sure that I'm doing something pretty > > obvious wrong, but cannot find the problem on my own. > > > > Some info, > > > > - I am employing MPICH 3.1.4, and my DMTCP version is 2.4.0-rc4 > > - MPICH is working OK. DMTCP is working OK for serial applications too. > > - My code is a simple hello world. > > - At this moment I am going with 3 mpi tasks running on the same node, to > > simplify things > > - coordinator is started with "dmtcp_coordinator", and my application > with > > "dmtcp_launch mpirun -n 3 /home/slurm/helloWorldMPI". Both are run with > > user "slurm". > > > > - passwordless SSH is enabled. Tested with > > > > ssh slurm-master which dmtcp_launch > > /usr/local/bin/dmtcp_launch > > > > > > I don't know where the error can be, so I have included quite a lot of > > information below to help. > > > > Thanks for your help, > > > > > > Manuel > > --- > > --- > > COORDINATOR OUTPUT > > [root@slurm-master tmp]# dmtcp_coordinator > > dmtcp_coordinator starting... > > Host: slurm-master (192.168.1.10) > > Port: 7779 > > Checkpoint Interval: disabled (checkpoint manually instead) > > Exit on last client: 0 > > Type '?' for help. > > > > [20427] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker > > connected' > > hello_remote.from = 6db90f3d5a9dd200-20448-55536cf2 > > [20427] NOTE at dmtcp_coordinator.cpp:864 in onData; REASON='Updating > > process Information after exec()' > > progname = mpiexec.hydra > > msg.from = 6db90f3d5a9dd200-40000-55536cf3 > > client->identity() = 6db90f3d5a9dd200-20448-55536cf2 > > [20427] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker > > connected' > > hello_remote.from = 6db90f3d5a9dd200-40000-55536cf3 > > [20427] NOTE at dmtcp_coordinator.cpp:855 in onData; REASON='Updating > > process Information after fork()' > > client->hostname() = slurm-master > > client->progname() = mpiexec.hydra_(forked) > > msg.from = 6db90f3d5a9dd200-41000-55536cf3 > > client->identity() = 6db90f3d5a9dd200-40000-55536cf3 > > [20427] NOTE at dmtcp_coordinator.cpp:864 in onData; REASON='Updating > > process Information after exec()' > > progname = hydra_pmi_proxy > > msg.from = 6db90f3d5a9dd200-41000-55536cf3 > > client->identity() = 6db90f3d5a9dd200-41000-55536cf3 > > [20427] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker > > connected' > > hello_remote.from = 6db90f3d5a9dd200-41000-55536cf3 > > [20427] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker > > connected' > > hello_remote.from = 6db90f3d5a9dd200-41000-55536cf3 > > [20427] NOTE at dmtcp_coordinator.cpp:855 in onData; REASON='Updating > > process Information after fork()' > > client->hostname() = slurm-master > > client->progname() = hydra_pmi_proxy_(forked) > > msg.from = 6db90f3d5a9dd200-42000-55536cf4 > > client->identity() = 6db90f3d5a9dd200-41000-55536cf3 > > [20427] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker > > connected' > > hello_remote.from = 6db90f3d5a9dd200-41000-55536cf3 > > [20427] NOTE at dmtcp_coordinator.cpp:855 in onData; REASON='Updating > > process Information after fork()' > > client->hostname() = slurm-master > > client->progname() = hydra_pmi_proxy_(forked) > > msg.from = 6db90f3d5a9dd200-43000-55536cf4 > > client->identity() = 6db90f3d5a9dd200-41000-55536cf3 > > [20427] NOTE at dmtcp_coordinator.cpp:855 in onData; REASON='Updating > > process Information after fork()' > > client->hostname() = slurm-master > > client->progname() = hydra_pmi_proxy_(forked) > > msg.from = 6db90f3d5a9dd200-44000-55536cf4 > > client->identity() = 6db90f3d5a9dd200-41000-55536cf3 > > [20427] NOTE at dmtcp_coordinator.cpp:864 in onData; REASON='Updating > > process Information after exec()' > > progname = helloWorldMPI > > msg.from = 6db90f3d5a9dd200-42000-55536cf4 > > client->identity() = 6db90f3d5a9dd200-42000-55536cf4 > > [20427] NOTE at dmtcp_coordinator.cpp:864 in onData; REASON='Updating > > process Information after exec()' > > progname = helloWorldMPI > > msg.from = 6db90f3d5a9dd200-44000-55536cf4 > > client->identity() = 6db90f3d5a9dd200-44000-55536cf4 > > [20427] NOTE at dmtcp_coordinator.cpp:864 in onData; REASON='Updating > > process Information after exec()' > > progname = helloWorldMPI > > msg.from = 6db90f3d5a9dd200-43000-55536cf4 > > client->identity() = 6db90f3d5a9dd200-43000-55536cf4 > > > > --- > > --- > > > > I press letter C to checkpoint > > > > --- > > --- > > > > [20427] NOTE at dmtcp_coordinator.cpp:1291 in startCheckpoint; > > REASON='starting checkpoint, suspending all nodes' > > s.numPeers = 5 > > [20427] NOTE at dmtcp_coordinator.cpp:1293 in startCheckpoint; > > REASON='Incremented Generation' > > compId.generation() = 1 > > [20427] NOTE at dmtcp_coordinator.cpp:654 in updateMinimumState; > > REASON='locking all nodes' > > [20427] NOTE at dmtcp_coordinator.cpp:660 in updateMinimumState; > > REASON='draining all nodes' > > [20427] NOTE at dmtcp_coordinator.cpp:666 in updateMinimumState; > > REASON='checkpointing all nodes' > > [20427] NOTE at dmtcp_coordinator.cpp:680 in updateMinimumState; > > REASON='building name service database' > > [20427] NOTE at dmtcp_coordinator.cpp:696 in updateMinimumState; > > REASON='entertaining queries now' > > [20427] NOTE at dmtcp_coordinator.cpp:701 in updateMinimumState; > > REASON='refilling all nodes' > > [20427] NOTE at dmtcp_coordinator.cpp:732 in updateMinimumState; > > REASON='restarting all nodes' > > --- > > --- > > > > > > I cancel the running application with CTRL+C > > > > --- > > --- > > COORDINATOR > > [20427] NOTE at dmtcp_coordinator.cpp:914 in onDisconnect; REASON='client > > disconnected' > > client->identity() = 6db90f3d5a9dd200-41000-55536cf3 > > [20427] NOTE at dmtcp_coordinator.cpp:914 in onDisconnect; REASON='client > > disconnected' > > client->identity() = 6db90f3d5a9dd200-40000-55536cf3 > > > > APPLICATION > > > > [mpiexec@slurm-master] Sending Ctrl-C to processes as requested > > [mpiexec@slurm-master] Press Ctrl-C again to force abort > > > > Ctrl-C caught... cleaning up processes > > [proxy:0:0@slurm-master] HYD_pmcd_pmip_control_cmd_cb > > (/root/mpich-3.1.4/src/pm/hydra/pm/pmiserv/pmip_cb.c:885): assert > (!closed) > > failed > > [proxy:0:0@slurm-master] HYDT_dmxu_poll_wait_for_event > > (/root/mpich-3.1.4/src/pm/hydra/tools/demux/demux_poll.c:76): callback > > returned error status > > [proxy:0:0@slurm-master] main > > (/root/mpich-3.1.4/src/pm/hydra/pm/pmiserv/pmip.c:206): demux engine > error > > waiting for event > > ^C[mpiexec@slurm-master] HYDT_bscu_wait_for_completion > > (/root/mpich-3.1.4/src/pm/hydra/tools/bootstrap/utils/bscu_wait.c:76): > one > > of the processes terminated badly; aborting > > [mpiexec@slurm-master] HYDT_bsci_wait_for_completion > > (/root/mpich-3.1.4/src/pm/hydra/tools/bootstrap/src/bsci_wait.c:23): > > launcher returned error waiting for completion > > [mpiexec@slurm-master] HYD_pmci_wait_for_completion > > (/root/mpich-3.1.4/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:218): launcher > > returned error waiting for completion > > [mpiexec@slurm-master] main > > (/root/mpich-3.1.4/src/pm/hydra/ui/mpich/mpiexec.c:344): process manager > > error waiting for completion > > > > --- > > --- > > > > Everything until now seems to be OK. I get the same output with serial > > applications. > > > > Then I try to restart, > > > > --- > > --- > > > > APPLICATION > > [slurm@slurm-master ~]$ sh /tmp/dmtcp_restart_script.sh > > [21164] ERROR at coordinatorapi.cpp:514 in sendRecvHandshake; > > REASON='JASSERT(msg.type == DMT_ACCEPT) failed' > > dmtcp_restart (21164): Terminating... > > > > COORDINATOR > > > > [20427] NOTE at dmtcp_coordinator.cpp:1143 in > > validateRestartingWorkerProcess; REASON='Computation not in RESTARTING or > > CHECKPOINTED state. Reject incoming computation process requesting > > restart.' > > compId = 6db90f3d5a9dd200-40000-55536cf2 > > hello_remote.compGroup = 6db90f3d5a9dd200-40000-55536cf2 > > minimumState() = WorkerState::RUNNING > > --- > > --- > > > > > > > > > > > > -- > > Dr. Manuel Rodríguez-Pascual > > skype: manuel.rodriguez.pascual > > phone: (+34) 913466173 // (+34) 679925108 > > > > CIEMAT-Moncloa > > Edificio 22, desp. 1.25 > > Avenida Complutense, 40 > > 28040- MADRID > > SPAIN > > > > ------------------------------------------------------------------------------ > > One dashboard for servers and applications across Physical-Virtual-Cloud > > Widest out-of-the-box monitoring support with 50+ applications > > Performance metrics, stats and reports that give you Actionable Insights > > Deep dive visibility with transaction tracing using APM Insight. > > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y > > > _______________________________________________ > > Dmtcp-forum mailing list > > Dmtcp-forum@lists.sourceforge.net <javascript:;> > > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum > > > > ------------------------------------------------------------------------------ > One dashboard for servers and applications across Physical-Virtual-Cloud > Widest out-of-the-box monitoring support with 50+ applications > Performance metrics, stats and reports that give you Actionable Insights > Deep dive visibility with transaction tracing using APM Insight. > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y > _______________________________________________ > Dmtcp-forum mailing list > Dmtcp-forum@lists.sourceforge.net <javascript:;> > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum > -- Dr. Manuel Rodríguez-Pascual skype: manuel.rodriguez.pascual phone: (+34) 913466173 // (+34) 679925108 CIEMAT-Moncloa Edificio 22, desp. 1.25 Avenida Complutense, 40 28040- MADRID SPAIN ------------------------------------------------------------------------------ One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y _______________________________________________ Dmtcp-forum mailing list Dmtcp-forum@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dmtcp-forum