When you configure with '--with-ft=cr' this enables the C/R fault tolerance 
frameworks, tools and code paths. One code path is the component selection 
logic you cited below. When you run an application compiled with Open MPI 
passing the '-am ft-enable-cr' or '-am ft-enable-cr-recovery' options this 
activates the logic below to pick only those components that have self 
identified as 'checkpoint ready'. 'checkpoint ready' means different things for 
different frameworks. Some frameworks do not need to do anything (e.g., timer), 
while others require much more work (e.g., BTLs).

There are some components that have not been verified to work well under C/R 
scenarios, and they are not selected when you pass the '-am ' parameters cited 
above. The Shared Memory BTL -is- checkpoint ready, and -will- be selected (on 
the current 1.4, 1.5 and trunk branches).  See the code below (Line 94):
  
https://svn.open-mpi.org/trac/ompi/browser/trunk/ompi/mca/btl/sm/btl_sm_component.c#L94

The shared memory collective module [also called 'sm'] (which is not enabled 
under normal use due to testing - Line 89 in coll_sm_component.c) is -not- 
checkpoint ready (line 77), also due to testing:
  
https://svn.open-mpi.org/trac/ompi/browser/trunk/ompi/mca/coll/sm/coll_sm_component.c#L76

So shared memory communication support has been available for 
checkpoint/restart functionality for a couple years now. The shared memory 
collective has not matured or been tested enough to be active even under 
non-C/R circumstances. Once it is ready, we can consider possibly trying to 
support it under C/R enabled activities.

I hope that clarifies what is going on.

-- Josh

On Aug 23, 2010, at 12:50 PM, <ananda.mu...@wipro.com> <ananda.mu...@wipro.com> 
wrote:

> Hi
>  
> In the file “mca_base_components_open.c”, following code checks for the 
> components that are checkpointable. If I configure OpenMPI library with 
> “—enable-cr” option, I was under the assumption that all components will be 
> checkpointable. However I see that quite a few components are not 
> checkpointable and that list includes “Shared Memmory (sm)”. Do I have to add 
> any other options to “configure” command so that all components are 
> checkpointable? Thanks
>  
>  186            /*
>  187             * If the user asked for a checkpoint enabled run
>  188             * then only load checkpoint enabled components.
>  189             */
>  190            if( MCA_BASE_METADATA_PARAM_CHECKPOINT & open_only_flags) {
>  191                if( MCA_BASE_METADATA_PARAM_CHECKPOINT & 
> dummy->data.param_field) {
>  192                    opal_output_verbose(10, output_id,
>  193                                        "mca: base: components_open: "
>  194                                        "(%s) Component %s is 
> Checkpointable",
>  195                                        type_name,
>  196                                        
> dummy->version.mca_component_name);
>  197                }
>  198                else {
>  199                    opal_output_verbose(10, output_id,
>  200                                        "mca: base: components_open: "
>  201                                        "(%s) Component %s is *NOT* 
> Checkpointable - Disabled",
>  202                                        type_name,
>  203                                        
> dummy->version.mca_component_name);
>  204                    opal_list_remove_item(&components_found, item);
>  205                }
>  206            }
>  207        }
>  208    }
>  
> Thanks
> 
> Ananda
> 
>  
> Ananda B Mudar, PMP
> Senior Technical Architect
> Wipro Technologies
> Please do not print this email unless it is absolutely necessary.
> 
> The information contained in this electronic message and any attachments to 
> this message are intended for the exclusive use of the addressee(s) and may 
> contain proprietary, confidential or privileged information. If you are not 
> the intended recipient, you should not disseminate, distribute or copy this 
> e-mail. Please notify the sender immediately and destroy all copies of this 
> message and any attachments.
> 
> WARNING: Computer viruses can be transmitted via email. The recipient should 
> check this email and any attachments for the presence of viruses. The company 
> accepts no liability for any damage caused by any virus transmitted by this 
> email.
> 
> www.wipro.com
> 
> <ATT00001..txt>

------------------------------------
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://www.cs.indiana.edu/~jjhursey





Reply via email to