Yes, we were trying to give some background on the project and use consistent branding. Our package is called DMTCP, which includes two components: DMTCP (a distributed checkpointer), and MTCP (a single process checkpointer, which can be used both standalone and internally by DMTCP).
This RFC is for a CRS module that uses only the MTCP component. -- Alex Josh Hursey <jjhur...@open-mpi.org> wrote: >>From what I have seen during development, this RFC integrates the MTCP >single process checkpointer into the C/R infrastructure of Open MPI. >The MTCP component of the DMTCP project can be used in insolation, >which is what they are integrating. So they can use DMTCP to >checkpoint/restart an unmodified Open MPI, but only over certain >networks. By integrating the MTCP checkpointer as a CRS component they >use Open MPI to coordinate across processes, and gain support for a >larger number of networks (e.g., IB, MX). > >Alex, does that sound about right? > >-- Josh > > >On Thu, Oct 6, 2011 at 4:33 PM, George Bosilca <bosi...@eecs.utk.edu> wrote: >> Alex, >> >> It looks like there is a mismatch between what you propose to achieve and >> the text in your RFC. You propose to add a new single-process >> checkpoint-restart mechanism (MTCP), to the ones already provided in Open >> MPI. However, most of the text in your RFC is about DMTCP, which is another >> layer on top of MTCP capable of checkpoint/restarting distributed >> application. >> >> I would like to understand what this RFC is really about: MTCP or DMTCP? >> >> george. >> >> On Oct 6, 2011, at 02:58 , Alex Brick wrote: >> >>> WHAT: Bring in the mtcp CRS component >>> >>> WHY: Add support for the MTCP checkpoint/restart service >>> >>> WHERE: opal/mca/crs/mtcp >>> >>> TIMEOUT: Tuesday teleconf, 2011-10-18 (about 2 weeks from now) >>> >>> ------------------------------------------- >>> What is MTCP? >>> >>> DMTCP (Distributed MultiThreaded CheckPointing, >>> http://dmtcp.sourceforge.net) is a mature open source (LGPL) checkpointing >>> package that has been under development for seven years. It operates >>> entirely in user space, with no kernel modules, or modifications to the >>> target application. If used in the simplest possible way, it works as: >>> >>> dmtcp_checkpoint ./a.out >>> dmtcp_command --checkpoint >>> dmtcp_restart ckpt_a.out_*.dmtcp >>> >>> DMTCP is contagious. Any calls to fork(), pthread_create(), or "ssh", >>> are recognized by DMTCP, and it maintains those threads, and local and >>> remote processes under checkpoint control. At checkpoint time, it also >>> generates a script, dmtcp_restart_script.sh, that can restart a distributed >>> computation. As a sign of its maturity, it can also checkpoint Open MPI >>> "from on top": dmtcp_checkpoint mpirun hello_mpi >>> >>> The MTCP component of DMTCP is the single-process component. It is used >>> both internally by DMTCP as well as directly by users only interested in >>> checkpointing a single process. This second feature was used in order to >>> develop an Open MPI module for the Open MPI checkpoint-restart service >>> similar to BLCR, except that no kernel modules are required. >>> >>> DMTCP is currently a Debian package (Debian testing), and is planned also >>> for Fedora and openSuSe. These packages also provide the MTCP component >>> for Open MPI. >>> >>> ------------------------------------------- >>> More details: >>> >>> Open MPI MTCP integration implementation available at: >>> >>> https://bitbucket.org/jsquyres/ompi-dmtcp2 >>> >>> The DMTCP parent project website is below: >>> >>> http://dmtcp.sourceforge.net/ >>> >>> The Distributed MultiThreaded CheckPointing (DMTCP) Project supports >>> user-level, transparent checkpoint/restart of a variety of sequential and >>> parallel programs. In Open MPI terms, this contribution is an alternative >>> to the BLCR CRS module, meaning that users can use DMTCP to checkpoint >>> their applications instead of BLCR. >>> >>> The MTCP component is currently restricted to supporting communication over >>> sockets and shared memory. In an effort to support a wider range of >>> networks (e.g., InfiniBand, Myrinet), they have created a CRS component to >>> hook into Open MPI's checkpoint/restart infrastructure. The MTCP user-level >>> checkpoint/restart service is the single process checkpoint kernel of the >>> DMTCP project. The MTCP kernel is what is used in the mtcp CRS component. >>> >>> Jeff Squyres and Josh Hursey have been working with the DMTCP authors >>> (based out of the US Northeastern University in Boston, MA, USA) for quite >>> a while and feel that this component is ready to be brought into the Open >>> MPI main line for inclusion in the 1.7.x series (and possibly the 1.5.x >>> series?). The authors have submitted OMPI 3rd party contribution >>> agreements. >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> > > > >-- >Joshua Hursey >Postdoctoral Research Associate >Oak Ridge National Laboratory >http://users.nccs.gov/~jjhursey > >_______________________________________________ >devel mailing list >de...@open-mpi.org >http://www.open-mpi.org/mailman/listinfo.cgi/devel