Thanks Alex. Can you answer George's other question about "hand waving"?
On Oct 7, 2011, at 3:59 PM, Alex Brick wrote: > Yes, we were trying to give some background on the project and use consistent > branding. Our package is called DMTCP, which includes two components: DMTCP > (a distributed checkpointer), and MTCP (a single process checkpointer, which > can be used both standalone and internally by DMTCP). > > This RFC is for a CRS module that uses only the MTCP component. > > > -- Alex > > Josh Hursey <jjhur...@open-mpi.org> wrote: > >>> From what I have seen during development, this RFC integrates the MTCP >> single process checkpointer into the C/R infrastructure of Open MPI. >> The MTCP component of the DMTCP project can be used in insolation, >> which is what they are integrating. So they can use DMTCP to >> checkpoint/restart an unmodified Open MPI, but only over certain >> networks. By integrating the MTCP checkpointer as a CRS component they >> use Open MPI to coordinate across processes, and gain support for a >> larger number of networks (e.g., IB, MX). >> >> Alex, does that sound about right? >> >> -- Josh >> >> >> On Thu, Oct 6, 2011 at 4:33 PM, George Bosilca <bosi...@eecs.utk.edu> wrote: >>> Alex, >>> >>> It looks like there is a mismatch between what you propose to achieve and >>> the text in your RFC. You propose to add a new single-process >>> checkpoint-restart mechanism (MTCP), to the ones already provided in Open >>> MPI. However, most of the text in your RFC is about DMTCP, which is another >>> layer on top of MTCP capable of checkpoint/restarting distributed >>> application. >>> >>> I would like to understand what this RFC is really about: MTCP or DMTCP? >>> >>> george. >>> >>> On Oct 6, 2011, at 02:58 , Alex Brick wrote: >>> >>>> WHAT: Bring in the mtcp CRS component >>>> >>>> WHY: Add support for the MTCP checkpoint/restart service >>>> >>>> WHERE: opal/mca/crs/mtcp >>>> >>>> TIMEOUT: Tuesday teleconf, 2011-10-18 (about 2 weeks from now) >>>> >>>> ------------------------------------------- >>>> What is MTCP? >>>> >>>> DMTCP (Distributed MultiThreaded CheckPointing, >>>> http://dmtcp.sourceforge.net) is a mature open source (LGPL) checkpointing >>>> package that has been under development for seven years. It operates >>>> entirely in user space, with no kernel modules, or modifications to the >>>> target application. If used in the simplest possible way, it works as: >>>> >>>> dmtcp_checkpoint ./a.out >>>> dmtcp_command --checkpoint >>>> dmtcp_restart ckpt_a.out_*.dmtcp >>>> >>>> DMTCP is contagious. Any calls to fork(), pthread_create(), or "ssh", >>>> are recognized by DMTCP, and it maintains those threads, and local and >>>> remote processes under checkpoint control. At checkpoint time, it also >>>> generates a script, dmtcp_restart_script.sh, that can restart a >>>> distributed computation. As a sign of its maturity, it can also >>>> checkpoint Open MPI "from on top": dmtcp_checkpoint mpirun hello_mpi >>>> >>>> The MTCP component of DMTCP is the single-process component. It is used >>>> both internally by DMTCP as well as directly by users only interested in >>>> checkpointing a single process. This second feature was used in order to >>>> develop an Open MPI module for the Open MPI checkpoint-restart service >>>> similar to BLCR, except that no kernel modules are required. >>>> >>>> DMTCP is currently a Debian package (Debian testing), and is planned also >>>> for Fedora and openSuSe. These packages also provide the MTCP component >>>> for Open MPI. >>>> >>>> ------------------------------------------- >>>> More details: >>>> >>>> Open MPI MTCP integration implementation available at: >>>> >>>> https://bitbucket.org/jsquyres/ompi-dmtcp2 >>>> >>>> The DMTCP parent project website is below: >>>> >>>> http://dmtcp.sourceforge.net/ >>>> >>>> The Distributed MultiThreaded CheckPointing (DMTCP) Project supports >>>> user-level, transparent checkpoint/restart of a variety of sequential and >>>> parallel programs. In Open MPI terms, this contribution is an alternative >>>> to the BLCR CRS module, meaning that users can use DMTCP to checkpoint >>>> their applications instead of BLCR. >>>> >>>> The MTCP component is currently restricted to supporting communication >>>> over sockets and shared memory. In an effort to support a wider range of >>>> networks (e.g., InfiniBand, Myrinet), they have created a CRS component to >>>> hook into Open MPI's checkpoint/restart infrastructure. The MTCP >>>> user-level checkpoint/restart service is the single process checkpoint >>>> kernel of the DMTCP project. The MTCP kernel is what is used in the mtcp >>>> CRS component. >>>> >>>> Jeff Squyres and Josh Hursey have been working with the DMTCP authors >>>> (based out of the US Northeastern University in Boston, MA, USA) for quite >>>> a while and feel that this component is ready to be brought into the Open >>>> MPI main line for inclusion in the 1.7.x series (and possibly the 1.5.x >>>> series?). The authors have submitted OMPI 3rd party contribution >>>> agreements. >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >> >> >> >> -- >> Joshua Hursey >> Postdoctoral Research Associate >> Oak Ridge National Laboratory >> http://users.nccs.gov/~jjhursey >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/