Thanks very much, Manuel. We'll be using this case study as a means to improving our documentation, so that others don't fall into the same trap.
Rohan, In addition to pointing out for users to use 'k' instead of '^C', we also need to describe what a user should do to fix it, if they accidentally do type '^C'. I'm sure that this is a very easy trap to fall into, and even random system glitches could create the same bug. Could you create a pull request with a draft to fix the user documentation? Thanks, - Gene On Wed, May 20, 2015 at 10:57:13AM +0200, Manuel Rodríguez Pascual wrote: > Hi all, > > Thanks to Rohan Garg support, the problem is solved now. The issue was > that I was killing the job iwth CTRL+C instead of pressing K in the > coordinator, and that created some issues. > > For the sake of completion, below is attached the whole set of tests, > hoping it helps someone on my same sittuation. > > > Best regards, > > Manuel > > --- > --- > > > [slurm@localhost dmtcp]$ ./bin/dmtcp_coordinator > dmtcp_coordinator starting... > Host: localhost.localdomain (127.0.0.1) > Port: 7779 > Checkpoint Interval: disabled (checkpoint manually instead) > Exit on last client: 0 > Type '?' for help. > > - Ran the hellompi program under DMTCP from a separate terminal: > > > [slurm@localhost dmtcp]$ ./bin/dmtcp_launch mpirun -n 3 > /home/slurm/helloWorldMPI > Process 2 of 3 is on localhost.localdomain > Hello world from process 2 of 3 > 2: 2 > Process 0 of 3 is on localhost.localdomain > Hello world from process 0 of 3 > 0: 2 > Process 1 of 3 is on localhost.localdomain > Hello world from process 1 of 3 > ... > > - Checked the processes connected to the coordinator by pressing 'l' at the > coordinator terminal: > > l <enter> > Client List: > #, PROG[virtPID:realPID]@HOST, DMTCP-UNIQUEPID, STATE > 1, mpiexec.hydra[40000:23455]@localhost.localdomain, > 5c0352704509e818-40000-555abc19, RUNNING > 2, hydra_pmi_proxy[41000:23459]@localhost.localdomain, > 5c0352704509e818-41000-555abc19, RUNNING > 3, helloWorldMPI[42000:23462]@localhost.localdomain, > 5c0352704509e818-42000-555abc19, RUNNING > 4, helloWorldMPI[43000:23463]@localhost.localdomain, > 5c0352704509e818-43000-555abc19, RUNNING > 5, helloWorldMPI[44000:23464]@localhost.localdomain, > 5c0352704509e818-44000-555abc19, RUNNING > > - Entered 'c' to issue a checkpoint command at the coordinator > terminal, and then, > 'l' again to verify that the processes are still running: > > c <enter> > [23454] NOTE at dmtcp_coordinator.cpp:1291 in startCheckpoint; > REASON='starting checkpoint, suspending all nodes' > s.numPeers = 5 > [23454] NOTE at dmtcp_coordinator.cpp:1293 in startCheckpoint; > REASON='Incremented Generation' > compId.generation() = 1 > [23454] NOTE at dmtcp_coordinator.cpp:654 in updateMinimumState; > REASON='locking all nodes' > [23454] NOTE at dmtcp_coordinator.cpp:660 in updateMinimumState; > REASON='draining all nodes' > [23454] NOTE at dmtcp_coordinator.cpp:666 in updateMinimumState; > REASON='checkpointing all nodes' > [23454] NOTE at dmtcp_coordinator.cpp:680 in updateMinimumState; > REASON='building name service database' > [23454] NOTE at dmtcp_coordinator.cpp:696 in updateMinimumState; > REASON='entertaining queries now' > [23454] NOTE at dmtcp_coordinator.cpp:701 in updateMinimumState; > REASON='refilling all nodes' > [23454] NOTE at dmtcp_coordinator.cpp:732 in updateMinimumState; > REASON='restarting all nodes' > l <enter> > Client List: > #, PROG[virtPID:realPID]@HOST, DMTCP-UNIQUEPID, STATE > 1, mpiexec.hydra[40000:23455]@localhost.localdomain, > 5c0352704509e818-40000-555abc19, RUNNING > 2, hydra_pmi_proxy[41000:23459]@localhost.localdomain, > 5c0352704509e818-41000-555abc19, RUNNING > 3, helloWorldMPI[42000:23462]@localhost.localdomain, > 5c0352704509e818-42000-555abc19, RUNNING > 4, helloWorldMPI[43000:23463]@localhost.localdomain, > 5c0352704509e818-43000-555abc19, RUNNING > 5, helloWorldMPI[44000:23464]@localhost.localdomain, > 5c0352704509e818-44000-555abc19, RUNNING > > - Next, I killed the computation by pressing 'k' at the coordinator > terminal: > > k > [23454] NOTE at dmtcp_coordinator.cpp:588 in handleUserCommand; > REASON='Killing all connected Peers...' > [23454] NOTE at dmtcp_coordinator.cpp:914 in onDisconnect; > REASON='client disconnected' > client->identity() = 5c0352704509e818-40000-555abc19 > [23454] NOTE at dmtcp_coordinator.cpp:914 in onDisconnect; > REASON='client disconnected' > client->identity() = 5c0352704509e818-41000-555abc19 > [23454] NOTE at dmtcp_coordinator.cpp:914 in onDisconnect; > REASON='client disconnected' > client->identity() = 5c0352704509e818-42000-555abc19 > [23454] NOTE at dmtcp_coordinator.cpp:914 in onDisconnect; > REASON='client disconnected' > client->identity() = 5c0352704509e818-44000-555abc19 > [23454] NOTE at dmtcp_coordinator.cpp:914 in onDisconnect; > REASON='client disconnected' > client->identity() = 5c0352704509e818-43000-555abc19 > > - Finally, I tried to restart from the checkpoint images, and it worked: > > [slurm@localhost dmtcp]$ ./bin/dmtcp_restart ckpt_* > [23491] mtcp_restart.c:1204 read_shared_memory_area_from_file: > mapping /dev/shm/mpich_shar_tmpAEfhnO with data from ckpt image > [23490] mtcp_restart.c:1204 read_shared_memory_area_from_file: > mapping /dev/shm/mpich_shar_tmpAEfhnO with data from ckpt image > 0: 7 > 2: 7 > 1: 7 > > 2015-05-15 18:10 GMT+02:00 Rohan Garg <rohg...@ccs.neu.edu>: > > Thank you! A VM image sounds great! I'm assuming it'll have the > > pre-configured MPICH. > > > > ----- Original Message ----- > > From: "Manuel Rodríguez Pascual" <manuel.rodriguez.pasc...@gmail.com> > > To: "Rohan Garg" <rohg...@ccs.neu.edu> > > Cc: "dmtcp-forum" <dmtcp-forum@lists.sourceforge.net> > > Sent: Friday, May 15, 2015 11:49:36 AM > > Subject: Re: checkpointing MPI applications > > > > Hi, > > > > My environment is not fixed at all, so if you have any suggestion I have no > > problem on changing it :) my only requirement is Slurm, but the MPI and > > checkpoint libraries can be modified freely. > > > > Or if you prefer, I can keep this one and help you on the debug process. > > Everything is virtualized, so I can send you the image if you want to use > > it for your tests. > > > > Thanks for your help, > > > > > > Manuel > > > > El viernes, 15 de mayo de 2015, Rohan Garg <rohg...@ccs.neu.edu> escribió: > > > >> Hi Manuel, > >> > >> Sorry for the delayed response. I don't see any obvious problems > >> from the logs that you shared, or your methodology. It seems like > >> DMTCP failed to restore one or more processes on restart. > >> > >> I believe we support MPICH, but I'll have to go back and check if > >> we have regressed. I'll try to reproduce this issue locally and > >> report back to you. Is there anything specific about your run-time > >> environment that I should keep in mind? > >> > >> Thanks, > >> Rohan > >> > >> ----- Original Message ----- > >> From: "gene" <g...@ccs.neu.edu <javascript:;>> > >> To: "Manuel Rodríguez Pascual" <manuel.rodriguez.pasc...@gmail.com > >> <javascript:;>> > >> Cc: "dmtcp-forum" <dmtcp-forum@lists.sourceforge.net <javascript:;>> > >> Sent: Thursday, May 14, 2015 11:05:23 AM > >> Subject: Re: [Dmtcp-forum] checkpointing MPI applications > >> > >> Hi Rohan, > >> Sorry to burden you with this, but with Jiajun and Artem on trips, > >> you're our main expert right now. Could you answer this one? > >> Manuel is running MPICH. He's using DMTCP 2.4.0-rc4, which should > >> be reasonably up to date. Three processes on one node. Checkpoint > >> succeeds. But on restart, he gets the following eror: > >> > >> > [20427] NOTE at dmtcp_coordinator.cpp:1143 in > >> > validateRestartingWorkerProcess; REASON='Computation not in RESTARTING or > >> > CHECKPOINTED state. Reject incoming computation process requesting > >> > restart.' > >> > compId = 6db90f3d5a9dd200-40000-55536cf2 > >> > hello_remote.compGroup = 6db90f3d5a9dd200-40000-55536cf2 > >> > minimumState() = WorkerState::RUNNING > >> > >> Rohan, > >> Even if it turns out that there's no bug and that Manuel was not > >> using the commands correctly, I'd like to consider this a "documentation > >> bug". > >> Our documentation for DMTCP/MPI has changed in recent months, and it's > >> still not properly polished. > >> Could you also keep in mind where our documentation for running > >> DMTCP with MPI is lacking, and then put up a pull request to improve > >> our documentation? > >> > >> Thanks, > >> - Gene > >> > >> > >> On Thu, May 14, 2015 at 01:02:42PM +0200, Manuel Rodríguez Pascual wrote: > >> > Hi all, > >> > > >> > I am a newbie in DMTCP. I am trying to checkpoint my MPI application but > >> > still haven't been able. I am pretty sure that I'm doing something pretty > >> > obvious wrong, but cannot find the problem on my own. > >> > > >> > Some info, > >> > > >> > - I am employing MPICH 3.1.4, and my DMTCP version is 2.4.0-rc4 > >> > - MPICH is working OK. DMTCP is working OK for serial applications too. > >> > - My code is a simple hello world. > >> > - At this moment I am going with 3 mpi tasks running on the same node, to > >> > simplify things > >> > - coordinator is started with "dmtcp_coordinator", and my application > >> with > >> > "dmtcp_launch mpirun -n 3 /home/slurm/helloWorldMPI". Both are run with > >> > user "slurm". > >> > > >> > - passwordless SSH is enabled. Tested with > >> > > >> > ssh slurm-master which dmtcp_launch > >> > /usr/local/bin/dmtcp_launch > >> > > >> > > >> > I don't know where the error can be, so I have included quite a lot of > >> > information below to help. > >> > > >> > Thanks for your help, > >> > > >> > > >> > Manuel > >> > --- > >> > --- > >> > COORDINATOR OUTPUT > >> > [root@slurm-master tmp]# dmtcp_coordinator > >> > dmtcp_coordinator starting... > >> > Host: slurm-master (192.168.1.10) > >> > Port: 7779 > >> > Checkpoint Interval: disabled (checkpoint manually instead) > >> > Exit on last client: 0 > >> > Type '?' for help. > >> > > >> > [20427] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker > >> > connected' > >> > hello_remote.from = 6db90f3d5a9dd200-20448-55536cf2 > >> > [20427] NOTE at dmtcp_coordinator.cpp:864 in onData; REASON='Updating > >> > process Information after exec()' > >> > progname = mpiexec.hydra > >> > msg.from = 6db90f3d5a9dd200-40000-55536cf3 > >> > client->identity() = 6db90f3d5a9dd200-20448-55536cf2 > >> > [20427] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker > >> > connected' > >> > hello_remote.from = 6db90f3d5a9dd200-40000-55536cf3 > >> > [20427] NOTE at dmtcp_coordinator.cpp:855 in onData; REASON='Updating > >> > process Information after fork()' > >> > client->hostname() = slurm-master > >> > client->progname() = mpiexec.hydra_(forked) > >> > msg.from = 6db90f3d5a9dd200-41000-55536cf3 > >> > client->identity() = 6db90f3d5a9dd200-40000-55536cf3 > >> > [20427] NOTE at dmtcp_coordinator.cpp:864 in onData; REASON='Updating > >> > process Information after exec()' > >> > progname = hydra_pmi_proxy > >> > msg.from = 6db90f3d5a9dd200-41000-55536cf3 > >> > client->identity() = 6db90f3d5a9dd200-41000-55536cf3 > >> > [20427] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker > >> > connected' > >> > hello_remote.from = 6db90f3d5a9dd200-41000-55536cf3 > >> > [20427] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker > >> > connected' > >> > hello_remote.from = 6db90f3d5a9dd200-41000-55536cf3 > >> > [20427] NOTE at dmtcp_coordinator.cpp:855 in onData; REASON='Updating > >> > process Information after fork()' > >> > client->hostname() = slurm-master > >> > client->progname() = hydra_pmi_proxy_(forked) > >> > msg.from = 6db90f3d5a9dd200-42000-55536cf4 > >> > client->identity() = 6db90f3d5a9dd200-41000-55536cf3 > >> > [20427] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker > >> > connected' > >> > hello_remote.from = 6db90f3d5a9dd200-41000-55536cf3 > >> > [20427] NOTE at dmtcp_coordinator.cpp:855 in onData; REASON='Updating > >> > process Information after fork()' > >> > client->hostname() = slurm-master > >> > client->progname() = hydra_pmi_proxy_(forked) > >> > msg.from = 6db90f3d5a9dd200-43000-55536cf4 > >> > client->identity() = 6db90f3d5a9dd200-41000-55536cf3 > >> > [20427] NOTE at dmtcp_coordinator.cpp:855 in onData; REASON='Updating > >> > process Information after fork()' > >> > client->hostname() = slurm-master > >> > client->progname() = hydra_pmi_proxy_(forked) > >> > msg.from = 6db90f3d5a9dd200-44000-55536cf4 > >> > client->identity() = 6db90f3d5a9dd200-41000-55536cf3 > >> > [20427] NOTE at dmtcp_coordinator.cpp:864 in onData; REASON='Updating > >> > process Information after exec()' > >> > progname = helloWorldMPI > >> > msg.from = 6db90f3d5a9dd200-42000-55536cf4 > >> > client->identity() = 6db90f3d5a9dd200-42000-55536cf4 > >> > [20427] NOTE at dmtcp_coordinator.cpp:864 in onData; REASON='Updating > >> > process Information after exec()' > >> > progname = helloWorldMPI > >> > msg.from = 6db90f3d5a9dd200-44000-55536cf4 > >> > client->identity() = 6db90f3d5a9dd200-44000-55536cf4 > >> > [20427] NOTE at dmtcp_coordinator.cpp:864 in onData; REASON='Updating > >> > process Information after exec()' > >> > progname = helloWorldMPI > >> > msg.from = 6db90f3d5a9dd200-43000-55536cf4 > >> > client->identity() = 6db90f3d5a9dd200-43000-55536cf4 > >> > > >> > --- > >> > --- > >> > > >> > I press letter C to checkpoint > >> > > >> > --- > >> > --- > >> > > >> > [20427] NOTE at dmtcp_coordinator.cpp:1291 in startCheckpoint; > >> > REASON='starting checkpoint, suspending all nodes' > >> > s.numPeers = 5 > >> > [20427] NOTE at dmtcp_coordinator.cpp:1293 in startCheckpoint; > >> > REASON='Incremented Generation' > >> > compId.generation() = 1 > >> > [20427] NOTE at dmtcp_coordinator.cpp:654 in updateMinimumState; > >> > REASON='locking all nodes' > >> > [20427] NOTE at dmtcp_coordinator.cpp:660 in updateMinimumState; > >> > REASON='draining all nodes' > >> > [20427] NOTE at dmtcp_coordinator.cpp:666 in updateMinimumState; > >> > REASON='checkpointing all nodes' > >> > [20427] NOTE at dmtcp_coordinator.cpp:680 in updateMinimumState; > >> > REASON='building name service database' > >> > [20427] NOTE at dmtcp_coordinator.cpp:696 in updateMinimumState; > >> > REASON='entertaining queries now' > >> > [20427] NOTE at dmtcp_coordinator.cpp:701 in updateMinimumState; > >> > REASON='refilling all nodes' > >> > [20427] NOTE at dmtcp_coordinator.cpp:732 in updateMinimumState; > >> > REASON='restarting all nodes' > >> > --- > >> > --- > >> > > >> > > >> > I cancel the running application with CTRL+C > >> > > >> > --- > >> > --- > >> > COORDINATOR > >> > [20427] NOTE at dmtcp_coordinator.cpp:914 in onDisconnect; REASON='client > >> > disconnected' > >> > client->identity() = 6db90f3d5a9dd200-41000-55536cf3 > >> > [20427] NOTE at dmtcp_coordinator.cpp:914 in onDisconnect; REASON='client > >> > disconnected' > >> > client->identity() = 6db90f3d5a9dd200-40000-55536cf3 > >> > > >> > APPLICATION > >> > > >> > [mpiexec@slurm-master] Sending Ctrl-C to processes as requested > >> > [mpiexec@slurm-master] Press Ctrl-C again to force abort > >> > > >> > Ctrl-C caught... cleaning up processes > >> > [proxy:0:0@slurm-master] HYD_pmcd_pmip_control_cmd_cb > >> > (/root/mpich-3.1.4/src/pm/hydra/pm/pmiserv/pmip_cb.c:885): assert > >> (!closed) > >> > failed > >> > [proxy:0:0@slurm-master] HYDT_dmxu_poll_wait_for_event > >> > (/root/mpich-3.1.4/src/pm/hydra/tools/demux/demux_poll.c:76): callback > >> > returned error status > >> > [proxy:0:0@slurm-master] main > >> > (/root/mpich-3.1.4/src/pm/hydra/pm/pmiserv/pmip.c:206): demux engine > >> error > >> > waiting for event > >> > ^C[mpiexec@slurm-master] HYDT_bscu_wait_for_completion > >> > (/root/mpich-3.1.4/src/pm/hydra/tools/bootstrap/utils/bscu_wait.c:76): > >> one > >> > of the processes terminated badly; aborting > >> > [mpiexec@slurm-master] HYDT_bsci_wait_for_completion > >> > (/root/mpich-3.1.4/src/pm/hydra/tools/bootstrap/src/bsci_wait.c:23): > >> > launcher returned error waiting for completion > >> > [mpiexec@slurm-master] HYD_pmci_wait_for_completion > >> > (/root/mpich-3.1.4/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:218): launcher > >> > returned error waiting for completion > >> > [mpiexec@slurm-master] main > >> > (/root/mpich-3.1.4/src/pm/hydra/ui/mpich/mpiexec.c:344): process manager > >> > error waiting for completion > >> > > >> > --- > >> > --- > >> > > >> > Everything until now seems to be OK. I get the same output with serial > >> > applications. > >> > > >> > Then I try to restart, > >> > > >> > --- > >> > --- > >> > > >> > APPLICATION > >> > [slurm@slurm-master ~]$ sh /tmp/dmtcp_restart_script.sh > >> > [21164] ERROR at coordinatorapi.cpp:514 in sendRecvHandshake; > >> > REASON='JASSERT(msg.type == DMT_ACCEPT) failed' > >> > dmtcp_restart (21164): Terminating... > >> > > >> > COORDINATOR > >> > > >> > [20427] NOTE at dmtcp_coordinator.cpp:1143 in > >> > validateRestartingWorkerProcess; REASON='Computation not in RESTARTING or > >> > CHECKPOINTED state. Reject incoming computation process requesting > >> > restart.' > >> > compId = 6db90f3d5a9dd200-40000-55536cf2 > >> > hello_remote.compGroup = 6db90f3d5a9dd200-40000-55536cf2 > >> > minimumState() = WorkerState::RUNNING > >> > --- > >> > --- > >> > > >> > > >> > > >> > > >> > > >> > -- > >> > Dr. Manuel Rodríguez-Pascual > >> > skype: manuel.rodriguez.pascual > >> > phone: (+34) 913466173 // (+34) 679925108 > >> > > >> > CIEMAT-Moncloa > >> > Edificio 22, desp. 1.25 > >> > Avenida Complutense, 40 > >> > 28040- MADRID > >> > SPAIN > >> > >> > > >> ------------------------------------------------------------------------------ > >> > One dashboard for servers and applications across Physical-Virtual-Cloud > >> > Widest out-of-the-box monitoring support with 50+ applications > >> > Performance metrics, stats and reports that give you Actionable Insights > >> > Deep dive visibility with transaction tracing using APM Insight. > >> > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y > >> > >> > _______________________________________________ > >> > Dmtcp-forum mailing list > >> > Dmtcp-forum@lists.sourceforge.net <javascript:;> > >> > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum > >> > >> > >> > >> ------------------------------------------------------------------------------ > >> One dashboard for servers and applications across Physical-Virtual-Cloud > >> Widest out-of-the-box monitoring support with 50+ applications > >> Performance metrics, stats and reports that give you Actionable Insights > >> Deep dive visibility with transaction tracing using APM Insight. > >> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y > >> _______________________________________________ > >> Dmtcp-forum mailing list > >> Dmtcp-forum@lists.sourceforge.net <javascript:;> > >> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum > >> > > > > > > -- > > Dr. Manuel Rodríguez-Pascual > > skype: manuel.rodriguez.pascual > > phone: (+34) 913466173 // (+34) 679925108 > > > > CIEMAT-Moncloa > > Edificio 22, desp. 1.25 > > Avenida Complutense, 40 > > 28040- MADRID > > SPAIN > > > > -- > Dr. Manuel Rodríguez-Pascual > skype: manuel.rodriguez.pascual > phone: (+34) 913466173 // (+34) 679925108 > > CIEMAT-Moncloa > Edificio 22, desp. 1.25 > Avenida Complutense, 40 > 28040- MADRID > SPAIN > > ------------------------------------------------------------------------------ > One dashboard for servers and applications across Physical-Virtual-Cloud > Widest out-of-the-box monitoring support with 50+ applications > Performance metrics, stats and reports that give you Actionable Insights > Deep dive visibility with transaction tracing using APM Insight. > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y > _______________________________________________ > Dmtcp-forum mailing list > Dmtcp-forum@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum ------------------------------------------------------------------------------ One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y _______________________________________________ Dmtcp-forum mailing list Dmtcp-forum@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dmtcp-forum