Terry -- Please add this to the agenda for Oct 18. I'd like to invite Alex and his advisor to the Oct 18 teleconf to discuss.
On Oct 6, 2011, at 2:58 AM, Alex Brick wrote: > WHAT: Bring in the mtcp CRS component > > WHY: Add support for the MTCP checkpoint/restart service > > WHERE: opal/mca/crs/mtcp > > TIMEOUT: Tuesday teleconf, 2011-10-18 (about 2 weeks from now) > > ------------------------------------------- > What is MTCP? > > DMTCP (Distributed MultiThreaded CheckPointing, http://dmtcp.sourceforge.net) > is a mature open source (LGPL) checkpointing package that has been under > development for seven years. It operates entirely in user space, with no > kernel modules, or modifications to the target application. If used in the > simplest possible way, it works as: > > dmtcp_checkpoint ./a.out > dmtcp_command --checkpoint > dmtcp_restart ckpt_a.out_*.dmtcp > > DMTCP is contagious. Any calls to fork(), pthread_create(), or "ssh", > are recognized by DMTCP, and it maintains those threads, and local and > remote processes under checkpoint control. At checkpoint time, it also > generates a script, dmtcp_restart_script.sh, that can restart a distributed > computation. As a sign of its maturity, it can also checkpoint Open MPI > "from on top": dmtcp_checkpoint mpirun hello_mpi > > The MTCP component of DMTCP is the single-process component. It is used > both internally by DMTCP as well as directly by users only interested in > checkpointing a single process. This second feature was used in order to > develop an Open MPI module for the Open MPI checkpoint-restart service > similar to BLCR, except that no kernel modules are required. > > DMTCP is currently a Debian package (Debian testing), and is planned also for > Fedora and openSuSe. These packages also provide the MTCP component for Open > MPI. > > ------------------------------------------- > More details: > > Open MPI MTCP integration implementation available at: > > https://bitbucket.org/jsquyres/ompi-dmtcp2 > > The DMTCP parent project website is below: > > http://dmtcp.sourceforge.net/ > > The Distributed MultiThreaded CheckPointing (DMTCP) Project supports > user-level, transparent checkpoint/restart of a variety of sequential and > parallel programs. In Open MPI terms, this contribution is an alternative to > the BLCR CRS module, meaning that users can use DMTCP to checkpoint their > applications instead of BLCR. > > The MTCP component is currently restricted to supporting communication over > sockets and shared memory. In an effort to support a wider range of networks > (e.g., InfiniBand, Myrinet), they have created a CRS component to hook into > Open MPI's checkpoint/restart infrastructure. The MTCP user-level > checkpoint/restart service is the single process checkpoint kernel of the > DMTCP project. The MTCP kernel is what is used in the mtcp CRS component. > > Jeff Squyres and Josh Hursey have been working with the DMTCP authors (based > out of the US Northeastern University in Boston, MA, USA) for quite a while > and feel that this component is ready to be brought into the Open MPI main > line for inclusion in the 1.7.x series (and possibly the 1.5.x series?). The > authors have submitted OMPI 3rd party contribution agreements. > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/