Thank you! A VM image sounds great! I'm assuming it'll have the
pre-configured MPICH.

----- Original Message -----
From: "Manuel Rodríguez Pascual" <manuel.rodriguez.pasc...@gmail.com>
To: "Rohan Garg" <rohg...@ccs.neu.edu>
Cc: "dmtcp-forum" <dmtcp-forum@lists.sourceforge.net>
Sent: Friday, May 15, 2015 11:49:36 AM
Subject: Re: checkpointing MPI applications

Hi,

My environment is not fixed at all, so if you have any suggestion I have no
problem on changing it :) my only requirement is Slurm, but the MPI and
checkpoint libraries can be modified freely.

Or if you prefer, I can keep this one and help you on the debug process.
Everything is virtualized, so I can send you the image if you want to use
it for your tests.

Thanks for your help,


Manuel

El viernes, 15 de mayo de 2015, Rohan Garg <rohg...@ccs.neu.edu> escribió:

> Hi Manuel,
>
> Sorry for the delayed response. I don't see any obvious problems
> from the logs that you shared, or your methodology. It seems like
> DMTCP failed to restore one or more processes on restart.
>
> I believe we support MPICH, but I'll have to go back and check if
> we have regressed. I'll try to reproduce this issue locally and
> report back to you. Is there anything specific about your run-time
> environment that I should keep in mind?
>
> Thanks,
> Rohan
>
> ----- Original Message -----
> From: "gene" <g...@ccs.neu.edu <javascript:;>>
> To: "Manuel Rodríguez Pascual" <manuel.rodriguez.pasc...@gmail.com
> <javascript:;>>
> Cc: "dmtcp-forum" <dmtcp-forum@lists.sourceforge.net <javascript:;>>
> Sent: Thursday, May 14, 2015 11:05:23 AM
> Subject: Re: [Dmtcp-forum] checkpointing MPI applications
>
> Hi Rohan,
>     Sorry to burden you with this, but with Jiajun and Artem on trips,
> you're our main expert right now.  Could you answer this one?
>     Manuel is running MPICH.  He's using DMTCP 2.4.0-rc4, which should
> be reasonably up to date.  Three processes on one node.  Checkpoint
> succeeds.  But on restart, he gets the following eror:
>
> > [20427] NOTE at dmtcp_coordinator.cpp:1143 in
> > validateRestartingWorkerProcess; REASON='Computation not in RESTARTING or
> > CHECKPOINTED state.  Reject incoming computation process requesting
> > restart.'
> >      compId = 6db90f3d5a9dd200-40000-55536cf2
> >      hello_remote.compGroup = 6db90f3d5a9dd200-40000-55536cf2
> >      minimumState() = WorkerState::RUNNING
>
> Rohan,
>     Even if it turns out that there's no bug and that Manuel was not
> using the commands correctly, I'd like to consider this a "documentation
> bug".
> Our documentation for DMTCP/MPI has changed in recent months, and it's
> still not properly polished.
>     Could you also keep in mind where our documentation for running
> DMTCP with MPI is lacking, and then put up a pull request to improve
> our documentation?
>
> Thanks,
> - Gene
>
>
> On Thu, May 14, 2015 at 01:02:42PM +0200, Manuel Rodríguez Pascual wrote:
> > Hi all,
> >
> > I am a newbie in DMTCP. I am trying to checkpoint my MPI application but
> > still haven't been able. I am pretty sure that I'm doing something pretty
> > obvious wrong, but cannot find the problem on my own.
> >
> > Some info,
> >
> > - I am employing MPICH 3.1.4, and my DMTCP version is  2.4.0-rc4
> > - MPICH is working OK. DMTCP is working OK for serial applications too.
> > - My code is a simple hello world.
> > - At this moment I am going with 3 mpi tasks running on the same node, to
> > simplify things
> > - coordinator is started with "dmtcp_coordinator",  and my application
> with
> > "dmtcp_launch mpirun -n 3 /home/slurm/helloWorldMPI". Both are run with
> > user "slurm".
> >
> > - passwordless SSH is enabled. Tested with
> >
> > ssh slurm-master  which dmtcp_launch
> > /usr/local/bin/dmtcp_launch
> >
> >
> > I don't know where the error can be, so I have included quite a lot of
> > information below to help.
> >
> > Thanks for your help,
> >
> >
> > Manuel
> > ---
> > ---
> > COORDINATOR OUTPUT
> > [root@slurm-master tmp]# dmtcp_coordinator
> > dmtcp_coordinator starting...
> >     Host: slurm-master (192.168.1.10)
> >     Port: 7779
> >     Checkpoint Interval: disabled (checkpoint manually instead)
> >     Exit on last client: 0
> > Type '?' for help.
> >
> > [20427] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
> > connected'
> >      hello_remote.from = 6db90f3d5a9dd200-20448-55536cf2
> > [20427] NOTE at dmtcp_coordinator.cpp:864 in onData; REASON='Updating
> > process Information after exec()'
> >      progname = mpiexec.hydra
> >      msg.from = 6db90f3d5a9dd200-40000-55536cf3
> >      client->identity() = 6db90f3d5a9dd200-20448-55536cf2
> > [20427] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
> > connected'
> >      hello_remote.from = 6db90f3d5a9dd200-40000-55536cf3
> > [20427] NOTE at dmtcp_coordinator.cpp:855 in onData; REASON='Updating
> > process Information after fork()'
> >      client->hostname() = slurm-master
> >      client->progname() = mpiexec.hydra_(forked)
> >      msg.from = 6db90f3d5a9dd200-41000-55536cf3
> >      client->identity() = 6db90f3d5a9dd200-40000-55536cf3
> > [20427] NOTE at dmtcp_coordinator.cpp:864 in onData; REASON='Updating
> > process Information after exec()'
> >      progname = hydra_pmi_proxy
> >      msg.from = 6db90f3d5a9dd200-41000-55536cf3
> >      client->identity() = 6db90f3d5a9dd200-41000-55536cf3
> > [20427] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
> > connected'
> >      hello_remote.from = 6db90f3d5a9dd200-41000-55536cf3
> > [20427] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
> > connected'
> >      hello_remote.from = 6db90f3d5a9dd200-41000-55536cf3
> > [20427] NOTE at dmtcp_coordinator.cpp:855 in onData; REASON='Updating
> > process Information after fork()'
> >      client->hostname() = slurm-master
> >      client->progname() = hydra_pmi_proxy_(forked)
> >      msg.from = 6db90f3d5a9dd200-42000-55536cf4
> >      client->identity() = 6db90f3d5a9dd200-41000-55536cf3
> > [20427] NOTE at dmtcp_coordinator.cpp:1079 in onConnect; REASON='worker
> > connected'
> >      hello_remote.from = 6db90f3d5a9dd200-41000-55536cf3
> > [20427] NOTE at dmtcp_coordinator.cpp:855 in onData; REASON='Updating
> > process Information after fork()'
> >      client->hostname() = slurm-master
> >      client->progname() = hydra_pmi_proxy_(forked)
> >      msg.from = 6db90f3d5a9dd200-43000-55536cf4
> >      client->identity() = 6db90f3d5a9dd200-41000-55536cf3
> > [20427] NOTE at dmtcp_coordinator.cpp:855 in onData; REASON='Updating
> > process Information after fork()'
> >      client->hostname() = slurm-master
> >      client->progname() = hydra_pmi_proxy_(forked)
> >      msg.from = 6db90f3d5a9dd200-44000-55536cf4
> >      client->identity() = 6db90f3d5a9dd200-41000-55536cf3
> > [20427] NOTE at dmtcp_coordinator.cpp:864 in onData; REASON='Updating
> > process Information after exec()'
> >      progname = helloWorldMPI
> >      msg.from = 6db90f3d5a9dd200-42000-55536cf4
> >      client->identity() = 6db90f3d5a9dd200-42000-55536cf4
> > [20427] NOTE at dmtcp_coordinator.cpp:864 in onData; REASON='Updating
> > process Information after exec()'
> >      progname = helloWorldMPI
> >      msg.from = 6db90f3d5a9dd200-44000-55536cf4
> >      client->identity() = 6db90f3d5a9dd200-44000-55536cf4
> > [20427] NOTE at dmtcp_coordinator.cpp:864 in onData; REASON='Updating
> > process Information after exec()'
> >      progname = helloWorldMPI
> >      msg.from = 6db90f3d5a9dd200-43000-55536cf4
> >      client->identity() = 6db90f3d5a9dd200-43000-55536cf4
> >
> > ---
> > ---
> >
> > I press letter C to checkpoint
> >
> > ---
> > ---
> >
> > [20427] NOTE at dmtcp_coordinator.cpp:1291 in startCheckpoint;
> > REASON='starting checkpoint, suspending all nodes'
> >      s.numPeers = 5
> > [20427] NOTE at dmtcp_coordinator.cpp:1293 in startCheckpoint;
> > REASON='Incremented Generation'
> >      compId.generation() = 1
> > [20427] NOTE at dmtcp_coordinator.cpp:654 in updateMinimumState;
> > REASON='locking all nodes'
> > [20427] NOTE at dmtcp_coordinator.cpp:660 in updateMinimumState;
> > REASON='draining all nodes'
> > [20427] NOTE at dmtcp_coordinator.cpp:666 in updateMinimumState;
> > REASON='checkpointing all nodes'
> > [20427] NOTE at dmtcp_coordinator.cpp:680 in updateMinimumState;
> > REASON='building name service database'
> > [20427] NOTE at dmtcp_coordinator.cpp:696 in updateMinimumState;
> > REASON='entertaining queries now'
> > [20427] NOTE at dmtcp_coordinator.cpp:701 in updateMinimumState;
> > REASON='refilling all nodes'
> > [20427] NOTE at dmtcp_coordinator.cpp:732 in updateMinimumState;
> > REASON='restarting all nodes'
> > ---
> > ---
> >
> >
> > I cancel the running application with CTRL+C
> >
> > ---
> > ---
> > COORDINATOR
> > [20427] NOTE at dmtcp_coordinator.cpp:914 in onDisconnect; REASON='client
> > disconnected'
> >      client->identity() = 6db90f3d5a9dd200-41000-55536cf3
> > [20427] NOTE at dmtcp_coordinator.cpp:914 in onDisconnect; REASON='client
> > disconnected'
> >      client->identity() = 6db90f3d5a9dd200-40000-55536cf3
> >
> > APPLICATION
> >
> > [mpiexec@slurm-master] Sending Ctrl-C to processes as requested
> > [mpiexec@slurm-master] Press Ctrl-C again to force abort
> >
> > Ctrl-C caught... cleaning up processes
> > [proxy:0:0@slurm-master] HYD_pmcd_pmip_control_cmd_cb
> > (/root/mpich-3.1.4/src/pm/hydra/pm/pmiserv/pmip_cb.c:885): assert
> (!closed)
> > failed
> > [proxy:0:0@slurm-master] HYDT_dmxu_poll_wait_for_event
> > (/root/mpich-3.1.4/src/pm/hydra/tools/demux/demux_poll.c:76): callback
> > returned error status
> > [proxy:0:0@slurm-master] main
> > (/root/mpich-3.1.4/src/pm/hydra/pm/pmiserv/pmip.c:206): demux engine
> error
> > waiting for event
> > ^C[mpiexec@slurm-master] HYDT_bscu_wait_for_completion
> > (/root/mpich-3.1.4/src/pm/hydra/tools/bootstrap/utils/bscu_wait.c:76):
> one
> > of the processes terminated badly; aborting
> > [mpiexec@slurm-master] HYDT_bsci_wait_for_completion
> > (/root/mpich-3.1.4/src/pm/hydra/tools/bootstrap/src/bsci_wait.c:23):
> > launcher returned error waiting for completion
> > [mpiexec@slurm-master] HYD_pmci_wait_for_completion
> > (/root/mpich-3.1.4/src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:218): launcher
> > returned error waiting for completion
> > [mpiexec@slurm-master] main
> > (/root/mpich-3.1.4/src/pm/hydra/ui/mpich/mpiexec.c:344): process manager
> > error waiting for completion
> >
> > ---
> > ---
> >
> > Everything until now seems to be OK. I get the same output with serial
> > applications.
> >
> > Then I try to restart,
> >
> > ---
> > ---
> >
> > APPLICATION
> > [slurm@slurm-master ~]$ sh /tmp/dmtcp_restart_script.sh
> > [21164] ERROR at coordinatorapi.cpp:514 in sendRecvHandshake;
> > REASON='JASSERT(msg.type == DMT_ACCEPT) failed'
> > dmtcp_restart (21164): Terminating...
> >
> > COORDINATOR
> >
> > [20427] NOTE at dmtcp_coordinator.cpp:1143 in
> > validateRestartingWorkerProcess; REASON='Computation not in RESTARTING or
> > CHECKPOINTED state.  Reject incoming computation process requesting
> > restart.'
> >      compId = 6db90f3d5a9dd200-40000-55536cf2
> >      hello_remote.compGroup = 6db90f3d5a9dd200-40000-55536cf2
> >      minimumState() = WorkerState::RUNNING
> > ---
> > ---
> >
> >
> >
> >
> >
> > --
> > Dr. Manuel Rodríguez-Pascual
> > skype: manuel.rodriguez.pascual
> > phone: (+34) 913466173 // (+34) 679925108
> >
> > CIEMAT-Moncloa
> > Edificio 22, desp. 1.25
> > Avenida Complutense, 40
> > 28040- MADRID
> > SPAIN
>
> >
> ------------------------------------------------------------------------------
> > One dashboard for servers and applications across Physical-Virtual-Cloud
> > Widest out-of-the-box monitoring support with 50+ applications
> > Performance metrics, stats and reports that give you Actionable Insights
> > Deep dive visibility with transaction tracing using APM Insight.
> > http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
>
> > _______________________________________________
> > Dmtcp-forum mailing list
> > Dmtcp-forum@lists.sourceforge.net <javascript:;>
> > https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>
>
>
> ------------------------------------------------------------------------------
> One dashboard for servers and applications across Physical-Virtual-Cloud
> Widest out-of-the-box monitoring support with 50+ applications
> Performance metrics, stats and reports that give you Actionable Insights
> Deep dive visibility with transaction tracing using APM Insight.
> http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
> _______________________________________________
> Dmtcp-forum mailing list
> Dmtcp-forum@lists.sourceforge.net <javascript:;>
> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum
>


-- 
Dr. Manuel Rodríguez-Pascual
skype: manuel.rodriguez.pascual
phone: (+34) 913466173 // (+34) 679925108

CIEMAT-Moncloa
Edificio 22, desp. 1.25
Avenida Complutense, 40
28040- MADRID
SPAIN

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to