When you configure with '--with-ft=cr' this enables the C/R fault tolerance frameworks, tools and code paths. One code path is the component selection logic you cited below. When you run an application compiled with Open MPI passing the '-am ft-enable-cr' or '-am ft-enable-cr-recovery' options this activates the logic below to pick only those components that have self identified as 'checkpoint ready'. 'checkpoint ready' means different things for different frameworks. Some frameworks do not need to do anything (e.g., timer), while others require much more work (e.g., BTLs).
There are some components that have not been verified to work well under C/R scenarios, and they are not selected when you pass the '-am ' parameters cited above. The Shared Memory BTL -is- checkpoint ready, and -will- be selected (on the current 1.4, 1.5 and trunk branches). See the code below (Line 94): https://svn.open-mpi.org/trac/ompi/browser/trunk/ompi/mca/btl/sm/btl_sm_component.c#L94 The shared memory collective module [also called 'sm'] (which is not enabled under normal use due to testing - Line 89 in coll_sm_component.c) is -not- checkpoint ready (line 77), also due to testing: https://svn.open-mpi.org/trac/ompi/browser/trunk/ompi/mca/coll/sm/coll_sm_component.c#L76 So shared memory communication support has been available for checkpoint/restart functionality for a couple years now. The shared memory collective has not matured or been tested enough to be active even under non-C/R circumstances. Once it is ready, we can consider possibly trying to support it under C/R enabled activities. I hope that clarifies what is going on. -- Josh On Aug 23, 2010, at 12:50 PM, <ananda.mu...@wipro.com> <ananda.mu...@wipro.com> wrote: > Hi > > In the file “mca_base_components_open.c”, following code checks for the > components that are checkpointable. If I configure OpenMPI library with > “—enable-cr” option, I was under the assumption that all components will be > checkpointable. However I see that quite a few components are not > checkpointable and that list includes “Shared Memmory (sm)”. Do I have to add > any other options to “configure” command so that all components are > checkpointable? Thanks > > 186 /* > 187 * If the user asked for a checkpoint enabled run > 188 * then only load checkpoint enabled components. > 189 */ > 190 if( MCA_BASE_METADATA_PARAM_CHECKPOINT & open_only_flags) { > 191 if( MCA_BASE_METADATA_PARAM_CHECKPOINT & > dummy->data.param_field) { > 192 opal_output_verbose(10, output_id, > 193 "mca: base: components_open: " > 194 "(%s) Component %s is > Checkpointable", > 195 type_name, > 196 > dummy->version.mca_component_name); > 197 } > 198 else { > 199 opal_output_verbose(10, output_id, > 200 "mca: base: components_open: " > 201 "(%s) Component %s is *NOT* > Checkpointable - Disabled", > 202 type_name, > 203 > dummy->version.mca_component_name); > 204 opal_list_remove_item(&components_found, item); > 205 } > 206 } > 207 } > 208 } > > Thanks > > Ananda > > > Ananda B Mudar, PMP > Senior Technical Architect > Wipro Technologies > Please do not print this email unless it is absolutely necessary. > > The information contained in this electronic message and any attachments to > this message are intended for the exclusive use of the addressee(s) and may > contain proprietary, confidential or privileged information. If you are not > the intended recipient, you should not disseminate, distribute or copy this > e-mail. Please notify the sender immediately and destroy all copies of this > message and any attachments. > > WARNING: Computer viruses can be transmitted via email. The recipient should > check this email and any attachments for the presence of viruses. The company > accepts no liability for any damage caused by any virus transmitted by this > email. > > www.wipro.com > > <ATT00001..txt> ------------------------------------ Joshua Hursey Postdoctoral Research Associate Oak Ridge National Laboratory http://www.cs.indiana.edu/~jjhursey