Terry -- 

Please add this to the agenda for Oct 18.  I'd like to invite Alex and his 
advisor to the Oct 18 teleconf to discuss.


On Oct 6, 2011, at 2:58 AM, Alex Brick wrote:

> WHAT: Bring in the mtcp CRS component
> 
> WHY: Add support for the MTCP checkpoint/restart service
> 
> WHERE: opal/mca/crs/mtcp
> 
> TIMEOUT: Tuesday teleconf, 2011-10-18 (about 2 weeks from now)
> 
> -------------------------------------------
> What is MTCP?
> 
> DMTCP (Distributed MultiThreaded CheckPointing, http://dmtcp.sourceforge.net) 
> is a mature open source (LGPL) checkpointing package that has been under 
> development for seven years. It operates entirely in user space, with no 
> kernel modules, or modifications to the target application.  If used in the 
> simplest possible way, it works as:
> 
> dmtcp_checkpoint ./a.out
> dmtcp_command --checkpoint
> dmtcp_restart ckpt_a.out_*.dmtcp
> 
> DMTCP is contagious.  Any calls to fork(), pthread_create(), or "ssh",
> are recognized by DMTCP, and it maintains those threads, and local and
> remote processes under checkpoint control.  At checkpoint time, it also
> generates a script, dmtcp_restart_script.sh, that can restart a distributed 
> computation.  As a sign of its maturity, it can also checkpoint Open MPI 
> "from on top":  dmtcp_checkpoint mpirun hello_mpi
> 
> The MTCP component of DMTCP is the single-process component.  It is used
> both internally by DMTCP as well as directly by users only interested in
> checkpointing a single process.  This second feature was used in order to 
> develop an Open MPI module for the Open MPI checkpoint-restart service 
> similar to BLCR, except that no kernel modules are required.
> 
> DMTCP is currently a Debian package (Debian testing), and is planned also for 
> Fedora and openSuSe.  These packages also provide the MTCP component for Open 
> MPI.
> 
> -------------------------------------------
> More details:
> 
> Open MPI MTCP integration implementation available at:
> 
>  https://bitbucket.org/jsquyres/ompi-dmtcp2
> 
> The DMTCP parent project website is below:
> 
>  http://dmtcp.sourceforge.net/
> 
> The Distributed MultiThreaded CheckPointing (DMTCP) Project supports 
> user-level, transparent checkpoint/restart of a variety of sequential and 
> parallel programs.  In Open MPI terms, this contribution is an alternative to 
> the BLCR CRS module, meaning that users can use DMTCP to checkpoint their 
> applications instead of BLCR.
> 
> The MTCP component is currently restricted to supporting communication over 
> sockets and shared memory.  In an effort to support a wider range of networks 
> (e.g., InfiniBand, Myrinet), they have created a CRS component to hook into 
> Open MPI's checkpoint/restart infrastructure. The MTCP user-level 
> checkpoint/restart service is the single process checkpoint kernel of the 
> DMTCP project.  The MTCP kernel is what is used in the mtcp CRS component.
> 
> Jeff Squyres and Josh Hursey have been working with the DMTCP authors (based 
> out of the US Northeastern University in Boston, MA, USA) for quite a while 
> and feel that this component is ready to be brought into the Open MPI main 
> line for inclusion in the 1.7.x series (and possibly the 1.5.x series?).  The 
> authors have submitted OMPI 3rd party contribution agreements.
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to