Ken,

At UTK we focus on developing two generic frameworks for scalable fault 
tolerant approaches. One is based on uncoordinated checkpoint/restart while the 
other is application level.

1) uncoordinated C/R based on message logging. Such approaches are fully 
automatic, rely on an external checkpoint/restart mechanism (BLCR currently), 
and do not require any synchronization. A process restarts independently, and 
it catch-up with the others. During its recovery the others can continue their 
execution undisturbed. The framework developed by UTK is currently used by at 
our knowledge by two other team to implement different uncoordinated mechanisms.

Redesigning the Message Logging Model for High Performance, Aurelien 
Bouteiller, G. Bosilca, and J. Dongarra, accepted in Concurrency and Computing: 
Practice and Experience, January 2010 
(http://www.netlib.org/netlib/utk/people/JackDongarra/PAPERS/isc-cppe-final.pdf)

Dodging the Cost of Unavoidable Memory Copies in Message Logging Protocols, 
George Bosilca, Aurelien Bouteiller, Thomas Herault, Pierre Lemarinier, and 
Jack Dongarra, Euro MPI 2010 
(http://icl.cs.utk.edu/news_pub/submissions/hpc-ml.pdf)

Reasons to be Pessimist or Optimist for Failure Recovery in High Performance 
Clusters, Aurelien Bouteiller, Thomas Ropars, George Bosilca, Christine Morin 
and Jack Dongarra. , Cluster 2009 
(http://www.netlib.org/netlib/utk/people/JackDongarra/PAPERS/msglog.final.pdf)

2) application level. We developed a framework allowing distinct application 
level responses to faults. In other terms the error is reported up to the 
application level, which become responsible for handling the error. The "still 
alive" processes in the MPI application as well as the whole runtime system 
remain totally functional, they can continue their work without interruption. 
On top of this generic framework we implemented a method very similar with 
FT-MPI, including some additions (such as supporting the MPI 2.0 standard).

Extending the MPI Specification for Process Fault Tolerance on High Performance 
Computing Systems, Graham E. Fagg, Edgar Gabriel, George Bosilca, Thara 
Angskun, Zizhong Chen, Jelena Pjesivac-Grbovic, Kevin London and Jack J. 
Dongarra, Proceedings of the ISC2004 meeting Heidelberg, June 23, 2004. 
(http://www.netlib.org/utk/people/JackDongarra/PAPERS/isc2004-FT-MPI.pdf)

Hope this helps,
  george.


On Apr 22, 2011, at 15:03 , Joshua Hursey wrote:

> 
> On Apr 22, 2011, at 1:20 PM, N.M. Maclaren wrote:
> 
>> On Apr 22 2011, Ralph Castain wrote:
>> 
>>> Several of us are. Josh and George (plus teammates), and some other outside 
>>> folks, are working the MPI side of it.
>>> 
>>> I'm working only the ORTE side of the problem.
>>> 
>>> Quite a bit of capability is already in the trunk, but there is always more 
>>> to do :-)
>> 
>> Is there a specification of what objectives are covered by 'fault-tolerant'?
> 
> We do not really have a website to point folks to at the moment. Some of the 
> existing functionally in and planned functionality for Open MPI has been 
> announced and documented, but not uniformly or in a central place at the 
> moment. We have a developers meeting in a couple weeks and this is a topic I 
> am planning on covering:
>  https://svn.open-mpi.org/trac/ompi/wiki/May11Meeting
> Once something is available, we'll post to the users/developers lists so that 
> people know where to look for developments.
> 
> I am responsible for two fault tolerance features in Open MPI: 
> Checkpoint/Restart and MPI Forum's Fault Tolerance Working Group proposals. 
> The Checkpoint/Restart support is documented here:
>  http://osl.iu.edu/research/ft/ompi-cr/
> 
> Most of my attention is focused on the MPI Forum's Fault Tolerance Working 
> Group proposals that are focused on enabling fault tolerant applications to 
> be developed on top of MPI (so non-transparent fault tolerance). The Open MPI 
> prototype is not yet publicly available, but soon. Information about the 
> semantics and interfaces of that project can be found at the links below:
>  https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/FaultToleranceWikiPage
>  
> https://svn.mpi-forum.org/trac/mpi-forum-web/wiki/ft/run_through_stabilization
> 
> That is what I have been up to regarding fault tolerance. Others can probably 
> elaborate on what they are working on if they wish.
> 
> -- Josh
> 
>> 
>> Regards,
>> Nick Maclaren.
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

George Bosilca
Research Assistant Professor
Innovative Computing Laboratory
Department of Electrical Engineering and Computer Science
University of Tennessee, Knoxville
http://web.eecs.utk.edu/~bosilca/


Reply via email to