Committed in r23587 :)
On Jul 31, 2010, at 12:51 PM, Joshua Hursey wrote: > WHAT: > Checkpoint/Restart-based automatic recovery and process migration, advanced > checkpoint storage, C/R-enabled debugging, MPI Extension API for C/R, and > some bug fixes. > > WHY: > This commit includes a variety of checkpoint/restart advancements that have > been pending on a temporary branch for a long while. Users have been waiting > on many of these bug fixes and advancements for a while now. More details > below. > > WHERE: > http://bitbucket.org/jjhursey/ompi-cr-recos > Last sync'ed to trunk in r23536 (July 31, 2010) > > WHEN: > Move into the trunk in the next two weeks. Then into the 1.5 series with the > ORTE refresh (Ticket #2471). > > TIMEOUT: > Aug 10, 2010 @ teleconf (commit at COB) > > DOCUMENTATION > Following public site will be fully updated upon commit: > http://osl.iu.edu/research/ft > Temporary documentation site (will be taken down upon commit): > http://osl.iu.edu/~jjhursey/research/ft-www-preview > Man page documentation will be updated soon. > > ---------------------------------------------------------------------------- > The changes may seem large but are isolated to a C/R components and > frameworks except where they are wired into the infrastructure. > > This commit brings in a variety of pending features and bug fixes that have > been accumulating over the past 8-12 months. Highlights are below (full > change log at bottom): > * Added C/R-enabled Debugging Support > * Added a Stable Storage framework for advanced checkpoint storage techniques > * Added checkpoint caching and compression support > * Added two C/R-based recovery policies > * C/R-based Process Migration (API and ompi-migrate tool activated) > * C/R-based Automatic Recovery > * Added a variety of C/R MPI Extensions functions (e.g., Checkpoint, Restart, > Migrate) > * Added C/R progress meters to File Movement (FileM), Stable Storage > (SStore), and Snapshot Coordination (SnapC) frameworks > > While this RFC is pending I plan to clean up the man page documentation for > these features and update copyrights in the code base. > > > > Change Log: > ----------- > Major Changes: > -------------- > * Added C/R-enabled Debugging support. > Enabled with the --enable-crdebug flag. See the following website for more > information: > http://osl.iu.edu/research/ft/crdebug/ > * Added Stable Storage (SStore) framework for checkpoint storage > * 'central' component does a direct to central storage save > * 'stage' component stages checkpoints to central storage while the > application continues execution. > * 'stage' supports offline compression of checkpoints before moving > (sstore_stage_compress) > * 'stage' supports local caching of checkpoints to improve automatic > recovery (sstore_stage_caching) > * Added Compression (compress) framework to support > * Add two new ErrMgr recovery policies > * {{{crmig}}} C/R Process Migration > * {{{autor}}} C/R Automatic Recovery > * Added the {{{ompi-migrate}}} command line tool to support the {{{crmig}}} > ErrMgr component > * Added CR MPI Ext functions (enable them with {{{--enable-mpi-ext=cr}}} > configure option) > * {{{OMPI_CR_Checkpoint}}} (Fixes #2342) > * {{{OMPI_CR_Restart}}} > * {{{OMPI_CR_Migrate}}} (may need some more work for mapping rules) > * {{{OMPI_CR_INC_register_callback}}} (Fixes #2192) > * {{{OMPI_CR_Quiesce_start}}} > * {{{OMPI_CR_Quiesce_checkpoint}}} > * {{{OMPI_CR_Quiesce_end}}} > * {{{OMPI_CR_self_register_checkpoint_callback}}} > * {{{OMPI_CR_self_register_restart_callback}}} > * {{{OMPI_CR_self_register_continue_callback}}} > * The ErrMgr predicted_fault() interface has been changed to take an > opal_list_t of ErrMgr defined types. This will allow us to better support a > wider range of fault prediction services in the future. > * Add a progress meter to: > * FileM rsh (filem_rsh_process_meter) > * SnapC full (snapc_full_progress_meter) > * SStore stage (sstore_stage_progress_meter) > * Added 2 new command line options to ompi-restart > * --showme : Display the full command line that would have been exec'ed. > * --mpirun_opts : Command line options to pass directly to mpirun. (Fixes > #2413) > * Deprecated some MCA params: > * crs_base_snapshot_dir deprecated, use sstore_stage_local_snapshot_dir > * snapc_base_global_snapshot_dir deprecated, use > sstore_base_global_snapshot_dir > * snapc_base_global_shared deprecated, use sstore_stage_global_is_shared > * snapc_base_store_in_place deprecated, replaced with different components > of SStore > * snapc_base_global_snapshot_ref deprecated, use > sstore_base_global_snapshot_ref > * snapc_base_establish_global_snapshot_dir deprecated, never well supported > * snapc_full_skip_filem deprecated, use sstore_stage_skip_filem > > Minor Changes: > -------------- > * Fixes #1924 : {{{ompi-restart}}} now recognizes path prefixed checkpoint > handles and does the right thing. > * Fixes #2097 : {{{ompi-info}}} should now report all available CRS components > * Fixes #2161 : Manual checkpoint movement. A user can 'mv' a checkpoint > directory from the original location to another and still restart from it. > * Fixes #2208 : Honor various TMPDIR varaibles instead of forcing {{{/tmp}}} > * Move {{{ompi_cr_continue_like_restart}}} to > {{{orte_cr_continue_like_restart}}} to be more flexible in where this should > be set. > * opal_crs_base_metadata_write* functions have been moved to SStore to > support a wider range of metadata handling functionality. > * Cleanup the CRS framework and components to work with the SStore framework. > * Cleanup the SnapC framework and components to work with the SStore > framework (cleans up these code paths considerably). > * Add 'quiesce' hook to CRCP for a future enhancement. > * We now require a BLCR version that supports {{{cr_request_file()}}} or > {{{cr_request_checkpoint()}}} in order to make the code more maintainable. > Note that {{{cr_request_file}}} has been deprecated since 0.7.0, so we prefer > to use {{{cr_request_checkpoint()}}}. > * Add optional application level INC callbacks (registered through the CR MPI > Ext interface). > * Increase the {{{opal_cr_thread_sleep_wait}}} parameter to 1000 microseconds > to make the C/R thread less aggressive. > * {{{opal-restart}}} now looks for cache directories before falling back on > stable storage when asked. > * {{{opal-restart}}} also support local decompression before restarting > * {{{orte-checkpoint}}} now uses the SStore framework to work with the > metadata > * {{{orte-restart}}} now uses the SStore framework to work with the metadata > * Remove the {{{orte-restart}}} preload option. This was removed since the > user only needs to select the 'stage' component in order to support this > functionality. > * Since the '-am' parameter is saved in the metadata, {{{ompi-restart}}} no > longer hard codes {{{-am ft-enable-cr}}}. > * Fix {{{hnp}}} ErrMgr so that if a previous component in the stack has > 'fixed' the problem, then it should be skipped. > * Make sure to decrement the number of 'num_local_procs' in the orted when > one goes away. > * odls now checks the SStore framework to see if it needs to load any > checkpoint files before launching (to support 'stage'). This separates the > SStore logic from the --preload-[binary|files] options. > * Add unique IDs to the named pipes established between the orted and the app > in SnapC. This is to better support migration and automatic recovery > activities. > * Improve the checks for 'already checkpointing' error path. > * A a recovery output timer, to show how long it takes to restart a job > * Do a better job of cleaning up the old session directory on restart. > * Add a local module to the autor and crmig ErrMgr components. These small > modules prevent the 'orted' component from attempting a local recovery (Which > does not work for MPI apps at the moment) > * Add a fix for bounding the checkpointable region between MPI_Init and > MPI_Finalize. > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel