I am now trying to run orte-restart. As far as I understand it orte-restart analyzes the checkpoint metadata and then tries to exec() mpirun which then starts opal-restart. During the startup of opal-restart (during initialize()) detection of the best CRS module is disabled:
/* * Turn off the selection of the CRS component, * we need to do that later */ (void) mca_base_var_env_name("crs_base_do_not_select", &tmp_env_var); opal_setenv(tmp_env_var, "1", /* turn off the selection */ true, &environ); free(tmp_env_var); tmp_env_var = NULL; This seems to work. Later when actually selecting the correct CRS module to restart the checkpointed process the selection is enabled again: /* Re-enable the selection of the CRS component, so we can choose the right one */ (void) mca_base_var_env_name("crs_base_do_not_select", &tmp_env_var); opal_setenv(tmp_env_var, "0", /* turn on the selection */ true, &environ); free(tmp_env_var); tmp_env_var = NULL; This does not seem to have an effect. The one reason why it does not work is pretty obvious. The mca variable crs_base_do_not_select is registered during opal_crs_base_register() and written to the bool variable opal_crs_base_do_not_select only once (during register). Later in opal_crs_base_select() this bool variable is queried if select should run or not and as it is only changed during register it never changes. So from the code flow it cannot work and is probably the result of one of the rewrites since C/R was introduced. To fix this I am trying to read the value of the MCA variable opal_crs_base_do_not_select during opal_crs_base_select() like this: idx = mca_base_var_find("opal", "crs", "base", "do_not_select") mca_base_var_get_value(idx, &value, NULL, NULL); This also seems to work because it is different if I change the first opal_setenv() during initialize(). The problem I am seeing is that the second opal_setenv() (back to 0) cannot be detected using mca_base_var_get_value(). So my question is: what is the preferred way to read and write MCA variables to access them in the different modules? Is the existing code still correct? There is also mca_base_var_set_value() should I rather use this to set 'opal_crs_base_do_not_select'. I was, however, not able to use mca_base_var_set_value() without a segfault. There are not much uses of mca_base_var_set_value() in the existing code and none uses a bool variable. I also discovered I can just access to global C variable 'opal_crs_base_do_not_select' from opal-restart.c as well as from opal_crs_base_select(). This also works. This would solve my problem setting and reading MCA variables. Adrian