Committed in r23587

:)

On Jul 31, 2010, at 12:51 PM, Joshua Hursey wrote:

> WHAT:
> Checkpoint/Restart-based automatic recovery and process migration, advanced 
> checkpoint storage, C/R-enabled debugging, MPI Extension API for C/R, and 
> some bug fixes.
> 
> WHY:
> This commit includes a variety of checkpoint/restart advancements that have 
> been pending on a temporary branch for a long while. Users have been waiting 
> on many of these bug fixes and advancements for a while now. More details 
> below.
> 
> WHERE:
>  http://bitbucket.org/jjhursey/ompi-cr-recos
> Last sync'ed to trunk in r23536 (July 31, 2010)
> 
> WHEN:
> Move into the trunk in the next two weeks. Then into the 1.5 series with the 
> ORTE refresh (Ticket #2471).
> 
> TIMEOUT:
> Aug 10, 2010 @ teleconf (commit at COB)
> 
> DOCUMENTATION
> Following public site will be fully updated upon commit:
>  http://osl.iu.edu/research/ft
> Temporary documentation site (will be taken down upon commit):
>  http://osl.iu.edu/~jjhursey/research/ft-www-preview
> Man page documentation will be updated soon.
> 
> ----------------------------------------------------------------------------
> The changes may seem large but are isolated to a C/R components and 
> frameworks except where they are wired into the infrastructure.
> 
> This commit brings in a variety of pending features and bug fixes that have 
> been accumulating over the past 8-12 months. Highlights are below (full 
> change log at bottom):
> * Added C/R-enabled Debugging Support
> * Added a Stable Storage framework for advanced checkpoint storage techniques
> * Added checkpoint caching and compression support
> * Added two C/R-based recovery policies
>   * C/R-based Process Migration (API and ompi-migrate tool activated)
>   * C/R-based Automatic Recovery
> * Added a variety of C/R MPI Extensions functions (e.g., Checkpoint, Restart, 
> Migrate)
> * Added C/R progress meters to File Movement (FileM), Stable Storage 
> (SStore), and Snapshot Coordination (SnapC) frameworks
> 
> While this RFC is pending I plan to clean up the man page documentation for 
> these features and update copyrights in the code base.
> 
> 
> 
> Change Log:
> -----------
> Major Changes:
> --------------
> * Added C/R-enabled Debugging support.
>   Enabled with the --enable-crdebug flag. See the following website for more 
> information:
>   http://osl.iu.edu/research/ft/crdebug/
> * Added Stable Storage (SStore) framework for checkpoint storage
>   * 'central' component does a direct to central storage save
>   * 'stage' component stages checkpoints to central storage while the 
> application continues execution.
>     * 'stage' supports offline compression of checkpoints before moving 
> (sstore_stage_compress)
>     * 'stage' supports local caching of checkpoints to improve automatic 
> recovery (sstore_stage_caching)
> * Added Compression (compress) framework to support
> * Add two new ErrMgr recovery policies
>   * {{{crmig}}} C/R Process Migration
>   * {{{autor}}} C/R Automatic Recovery
> * Added the {{{ompi-migrate}}} command line tool to support the {{{crmig}}} 
> ErrMgr component
> * Added CR MPI Ext functions (enable them with {{{--enable-mpi-ext=cr}}} 
> configure option)
>   * {{{OMPI_CR_Checkpoint}}} (Fixes #2342)
>   * {{{OMPI_CR_Restart}}}
>   * {{{OMPI_CR_Migrate}}} (may need some more work for mapping rules)
>   * {{{OMPI_CR_INC_register_callback}}} (Fixes #2192)
>   * {{{OMPI_CR_Quiesce_start}}}
>   * {{{OMPI_CR_Quiesce_checkpoint}}}
>   * {{{OMPI_CR_Quiesce_end}}}
>   * {{{OMPI_CR_self_register_checkpoint_callback}}}
>   * {{{OMPI_CR_self_register_restart_callback}}}
>   * {{{OMPI_CR_self_register_continue_callback}}}
> * The ErrMgr predicted_fault() interface has been changed to take an 
> opal_list_t of ErrMgr defined types. This will allow us to better support a 
> wider range of fault prediction services in the future.
> * Add a progress meter to:
>   * FileM rsh (filem_rsh_process_meter)
>   * SnapC full (snapc_full_progress_meter)
>   * SStore stage (sstore_stage_progress_meter)
> * Added 2 new command line options to ompi-restart
>   * --showme : Display the full command line that would have been exec'ed.
>   * --mpirun_opts : Command line options to pass directly to mpirun. (Fixes 
> #2413)
> * Deprecated some MCA params:
>   * crs_base_snapshot_dir deprecated, use sstore_stage_local_snapshot_dir
>   * snapc_base_global_snapshot_dir deprecated, use 
> sstore_base_global_snapshot_dir
>   * snapc_base_global_shared deprecated, use sstore_stage_global_is_shared
>   * snapc_base_store_in_place deprecated, replaced with different components 
> of SStore
>   * snapc_base_global_snapshot_ref deprecated, use 
> sstore_base_global_snapshot_ref
>   * snapc_base_establish_global_snapshot_dir deprecated, never well supported
>   * snapc_full_skip_filem deprecated, use sstore_stage_skip_filem
> 
> Minor Changes:
> --------------
> * Fixes #1924 : {{{ompi-restart}}} now recognizes path prefixed checkpoint 
> handles and does the right thing.
> * Fixes #2097 : {{{ompi-info}}} should now report all available CRS components
> * Fixes #2161 : Manual checkpoint movement. A user can 'mv' a checkpoint 
> directory from the original location to another and still restart from it.
> * Fixes #2208 : Honor various TMPDIR varaibles instead of forcing {{{/tmp}}}
> * Move {{{ompi_cr_continue_like_restart}}} to 
> {{{orte_cr_continue_like_restart}}} to be more flexible in where this should 
> be set.
> * opal_crs_base_metadata_write* functions have been moved to SStore to 
> support a wider range of metadata handling functionality.
> * Cleanup the CRS framework and components to work with the SStore framework.
> * Cleanup the SnapC framework and components to work with the SStore 
> framework (cleans up these code paths considerably).
> * Add 'quiesce' hook to CRCP for a future enhancement.
> * We now require a BLCR version that supports {{{cr_request_file()}}} or 
> {{{cr_request_checkpoint()}}} in order to make the code more maintainable. 
> Note that {{{cr_request_file}}} has been deprecated since 0.7.0, so we prefer 
> to use {{{cr_request_checkpoint()}}}.
> * Add optional application level INC callbacks (registered through the CR MPI 
> Ext interface).
> * Increase the {{{opal_cr_thread_sleep_wait}}} parameter to 1000 microseconds 
> to make the C/R thread less aggressive.
> * {{{opal-restart}}} now looks for cache directories before falling back on 
> stable storage when asked.
> * {{{opal-restart}}} also support local decompression before restarting
> * {{{orte-checkpoint}}} now uses the SStore framework to work with the 
> metadata
> * {{{orte-restart}}} now uses the SStore framework to work with the metadata
> * Remove the {{{orte-restart}}} preload option. This was removed since the 
> user only needs to select the 'stage' component in order to support this 
> functionality.
> * Since the '-am' parameter is saved in the metadata, {{{ompi-restart}}} no 
> longer hard codes {{{-am ft-enable-cr}}}.
> * Fix {{{hnp}}} ErrMgr so that if a previous component in the stack has 
> 'fixed' the problem, then it should be skipped.
> * Make sure to decrement the number of 'num_local_procs' in the orted when 
> one goes away.
> * odls now checks the SStore framework to see if it needs to load any 
> checkpoint files before launching (to support 'stage'). This separates the 
> SStore logic from the --preload-[binary|files] options.
> * Add unique IDs to the named pipes established between the orted and the app 
> in SnapC. This is to better support migration and automatic recovery 
> activities.
> * Improve the checks for 'already checkpointing' error path.
> * A a recovery output timer, to show how long it takes to restart a job
> * Do a better job of cleaning up the old session directory on restart.
> * Add a local module to the autor and crmig ErrMgr components. These small 
> modules prevent the 'orted' component from attempting a local recovery (Which 
> does not work for MPI apps at the moment)
> * Add a fix for bounding the checkpointable region between MPI_Init and 
> MPI_Finalize. 
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to