I am now trying to run orte-restart. As far as I understand it
orte-restart analyzes the checkpoint metadata and then tries to exec()
mpirun which then starts opal-restart. During the startup of
opal-restart (during initialize()) detection of the best CRS module is
disabled:

    /* 
     * Turn off the selection of the CRS component,
     * we need to do that later
     */
    (void) mca_base_var_env_name("crs_base_do_not_select", &tmp_env_var);
    opal_setenv(tmp_env_var,
                "1", /* turn off the selection */
                true, &environ);
    free(tmp_env_var);
    tmp_env_var = NULL;

This seems to work. Later when actually selecting the correct CRS module
to restart the checkpointed process the selection is enabled again:

    /* Re-enable the selection of the CRS component, so we can choose the right 
one */
    (void) mca_base_var_env_name("crs_base_do_not_select", &tmp_env_var);
    opal_setenv(tmp_env_var,
                "0", /* turn on the selection */
                true, &environ);
    free(tmp_env_var);
    tmp_env_var = NULL;

This does not seem to have an effect. The one reason why it does not work
is pretty obvious. The mca variable crs_base_do_not_select is registered during
opal_crs_base_register() and written to the bool variable 
opal_crs_base_do_not_select
only once (during register). Later in opal_crs_base_select() this bool
variable is queried if select should run or not and as it is only changed
during register it never changes. So from the code flow it cannot work
and is probably the result of one of the rewrites since C/R was introduced.

To fix this I am trying to read the value of the MCA variable
opal_crs_base_do_not_select during opal_crs_base_select() like this:

 idx = mca_base_var_find("opal", "crs", "base", "do_not_select")
 mca_base_var_get_value(idx, &value, NULL, NULL);

This also seems to work because it is different if I change the first
opal_setenv() during initialize(). The problem I am seeing is that the
second opal_setenv() (back to 0) cannot be detected using 
mca_base_var_get_value().

So my question is: what is the preferred way to read and write MCA
variables to access them in the different modules? Is the existing
code still correct? There is also mca_base_var_set_value() should I rather
use this to set 'opal_crs_base_do_not_select'. I was, however, not able
to use mca_base_var_set_value() without a segfault. There are not much
uses of mca_base_var_set_value() in the existing code and none uses
a bool variable.

I also discovered I can just access to global C variable 
'opal_crs_base_do_not_select'
from opal-restart.c as well as from opal_crs_base_select(). This also works.
This would solve my problem setting and reading MCA variables.

                Adrian

Reply via email to