>From what I have seen during development, this RFC integrates the MTCP single process checkpointer into the C/R infrastructure of Open MPI. The MTCP component of the DMTCP project can be used in insolation, which is what they are integrating. So they can use DMTCP to checkpoint/restart an unmodified Open MPI, but only over certain networks. By integrating the MTCP checkpointer as a CRS component they use Open MPI to coordinate across processes, and gain support for a larger number of networks (e.g., IB, MX).
Alex, does that sound about right? -- Josh On Thu, Oct 6, 2011 at 4:33 PM, George Bosilca <bosi...@eecs.utk.edu> wrote: > Alex, > > It looks like there is a mismatch between what you propose to achieve and the > text in your RFC. You propose to add a new single-process checkpoint-restart > mechanism (MTCP), to the ones already provided in Open MPI. However, most of > the text in your RFC is about DMTCP, which is another layer on top of MTCP > capable of checkpoint/restarting distributed application. > > I would like to understand what this RFC is really about: MTCP or DMTCP? > > george. > > On Oct 6, 2011, at 02:58 , Alex Brick wrote: > >> WHAT: Bring in the mtcp CRS component >> >> WHY: Add support for the MTCP checkpoint/restart service >> >> WHERE: opal/mca/crs/mtcp >> >> TIMEOUT: Tuesday teleconf, 2011-10-18 (about 2 weeks from now) >> >> ------------------------------------------- >> What is MTCP? >> >> DMTCP (Distributed MultiThreaded CheckPointing, >> http://dmtcp.sourceforge.net) is a mature open source (LGPL) checkpointing >> package that has been under development for seven years. It operates >> entirely in user space, with no kernel modules, or modifications to the >> target application. If used in the simplest possible way, it works as: >> >> dmtcp_checkpoint ./a.out >> dmtcp_command --checkpoint >> dmtcp_restart ckpt_a.out_*.dmtcp >> >> DMTCP is contagious. Any calls to fork(), pthread_create(), or "ssh", >> are recognized by DMTCP, and it maintains those threads, and local and >> remote processes under checkpoint control. At checkpoint time, it also >> generates a script, dmtcp_restart_script.sh, that can restart a distributed >> computation. As a sign of its maturity, it can also checkpoint Open MPI >> "from on top": dmtcp_checkpoint mpirun hello_mpi >> >> The MTCP component of DMTCP is the single-process component. It is used >> both internally by DMTCP as well as directly by users only interested in >> checkpointing a single process. This second feature was used in order to >> develop an Open MPI module for the Open MPI checkpoint-restart service >> similar to BLCR, except that no kernel modules are required. >> >> DMTCP is currently a Debian package (Debian testing), and is planned also >> for Fedora and openSuSe. These packages also provide the MTCP component for >> Open MPI. >> >> ------------------------------------------- >> More details: >> >> Open MPI MTCP integration implementation available at: >> >> https://bitbucket.org/jsquyres/ompi-dmtcp2 >> >> The DMTCP parent project website is below: >> >> http://dmtcp.sourceforge.net/ >> >> The Distributed MultiThreaded CheckPointing (DMTCP) Project supports >> user-level, transparent checkpoint/restart of a variety of sequential and >> parallel programs. In Open MPI terms, this contribution is an alternative >> to the BLCR CRS module, meaning that users can use DMTCP to checkpoint their >> applications instead of BLCR. >> >> The MTCP component is currently restricted to supporting communication over >> sockets and shared memory. In an effort to support a wider range of >> networks (e.g., InfiniBand, Myrinet), they have created a CRS component to >> hook into Open MPI's checkpoint/restart infrastructure. The MTCP user-level >> checkpoint/restart service is the single process checkpoint kernel of the >> DMTCP project. The MTCP kernel is what is used in the mtcp CRS component. >> >> Jeff Squyres and Josh Hursey have been working with the DMTCP authors (based >> out of the US Northeastern University in Boston, MA, USA) for quite a while >> and feel that this component is ready to be brought into the Open MPI main >> line for inclusion in the 1.7.x series (and possibly the 1.5.x series?). >> The authors have submitted OMPI 3rd party contribution agreements. >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > -- Joshua Hursey Postdoctoral Research Associate Oak Ridge National Laboratory http://users.nccs.gov/~jjhursey