>From what I have seen during development, this RFC integrates the MTCP
single process checkpointer into the C/R infrastructure of Open MPI.
The MTCP component of the DMTCP project can be used in insolation,
which is what they are integrating. So they can use DMTCP to
checkpoint/restart an unmodified Open MPI, but only over certain
networks. By integrating the MTCP checkpointer as a CRS component they
use Open MPI to coordinate across processes, and gain support for a
larger number of networks (e.g., IB, MX).

Alex, does that sound about right?

-- Josh


On Thu, Oct 6, 2011 at 4:33 PM, George Bosilca <bosi...@eecs.utk.edu> wrote:
> Alex,
>
> It looks like there is a mismatch between what you propose to achieve and the 
> text in your RFC. You propose to add a new single-process checkpoint-restart 
> mechanism (MTCP), to the ones already provided in Open MPI. However, most of 
> the text in your RFC is about DMTCP, which is another layer on top of MTCP 
> capable of checkpoint/restarting distributed application.
>
> I would like to understand what this RFC is really about: MTCP or DMTCP?
>
>  george.
>
> On Oct 6, 2011, at 02:58 , Alex Brick wrote:
>
>> WHAT: Bring in the mtcp CRS component
>>
>> WHY: Add support for the MTCP checkpoint/restart service
>>
>> WHERE: opal/mca/crs/mtcp
>>
>> TIMEOUT: Tuesday teleconf, 2011-10-18 (about 2 weeks from now)
>>
>> -------------------------------------------
>> What is MTCP?
>>
>> DMTCP (Distributed MultiThreaded CheckPointing, 
>> http://dmtcp.sourceforge.net) is a mature open source (LGPL) checkpointing 
>> package that has been under development for seven years. It operates 
>> entirely in user space, with no kernel modules, or modifications to the 
>> target application.  If used in the simplest possible way, it works as:
>>
>> dmtcp_checkpoint ./a.out
>> dmtcp_command --checkpoint
>> dmtcp_restart ckpt_a.out_*.dmtcp
>>
>> DMTCP is contagious.  Any calls to fork(), pthread_create(), or "ssh",
>> are recognized by DMTCP, and it maintains those threads, and local and
>> remote processes under checkpoint control.  At checkpoint time, it also
>> generates a script, dmtcp_restart_script.sh, that can restart a distributed 
>> computation.  As a sign of its maturity, it can also checkpoint Open MPI 
>> "from on top":  dmtcp_checkpoint mpirun hello_mpi
>>
>> The MTCP component of DMTCP is the single-process component.  It is used
>> both internally by DMTCP as well as directly by users only interested in
>> checkpointing a single process.  This second feature was used in order to 
>> develop an Open MPI module for the Open MPI checkpoint-restart service 
>> similar to BLCR, except that no kernel modules are required.
>>
>> DMTCP is currently a Debian package (Debian testing), and is planned also 
>> for Fedora and openSuSe.  These packages also provide the MTCP component for 
>> Open MPI.
>>
>> -------------------------------------------
>> More details:
>>
>> Open MPI MTCP integration implementation available at:
>>
>>  https://bitbucket.org/jsquyres/ompi-dmtcp2
>>
>> The DMTCP parent project website is below:
>>
>>  http://dmtcp.sourceforge.net/
>>
>> The Distributed MultiThreaded CheckPointing (DMTCP) Project supports 
>> user-level, transparent checkpoint/restart of a variety of sequential and 
>> parallel programs.  In Open MPI terms, this contribution is an alternative 
>> to the BLCR CRS module, meaning that users can use DMTCP to checkpoint their 
>> applications instead of BLCR.
>>
>> The MTCP component is currently restricted to supporting communication over 
>> sockets and shared memory.  In an effort to support a wider range of 
>> networks (e.g., InfiniBand, Myrinet), they have created a CRS component to 
>> hook into Open MPI's checkpoint/restart infrastructure. The MTCP user-level 
>> checkpoint/restart service is the single process checkpoint kernel of the 
>> DMTCP project.  The MTCP kernel is what is used in the mtcp CRS component.
>>
>> Jeff Squyres and Josh Hursey have been working with the DMTCP authors (based 
>> out of the US Northeastern University in Boston, MA, USA) for quite a while 
>> and feel that this component is ready to be brought into the Open MPI main 
>> line for inclusion in the 1.7.x series (and possibly the 1.5.x series?).  
>> The authors have submitted OMPI 3rd party contribution agreements.
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>



-- 
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey

Reply via email to