I'm not entirely sure what you are doing here. The orte_job_t, orte_node_t, and orte_proc_t objects are only used on mpirun - the arrays built from those objects are only defined on mpirun itself, not on any orted.

When you say "orte daemon which acts as HNP", are you implying that you have some orted out there that is trying to behave like an HNP? Or do you really mean mpirun itself?

I suspect the reason you are seeing a difference is that orte-ps only gets its info from mpirun, and you are somehow storing the modified data on an orted instead.

Did you modify orted itself to create and store an orte_job_t array? This would not be a good idea as a significant amount of code in the system expects that array to only exist inside of mpirun. You could run into some really strange behavior in various scenarios.

Ralph


On Oct 1, 2008, at 9:09 AM, Leonardo Fialho wrote:

Hi All,

I have a little doubt about how to update the orte_proc structure.

I have modified the orte_proc structure to include another field (orte_name_proc_t type) to describe the node whose store my checkpoints and logs:

struct orte_proc_t {
...
#if OPAL_ENABLE_FT_RADIC == 1
  /* protector node */
  orte_process_name_t protector;
#endif
};

Thus, I have added in orted_comm.c a code which I think that would update de job structure:
/* Update the structure */
if (NULL == (jdata = orte_get_job_data_object(sender_jobid))) {
  ORTE_ERROR_LOG(ORTE_ERR_NOT_FOUND);
 goto CLEANUP;
}
procs = (orte_proc_t**)jdata->procs->addr;
if (NULL == procs[sender_vpid] ) {
  ORTE_ERROR_LOG(ORTE_ERR_NOT_FOUND);
  goto CLEANUP;
}
procs[sender_vpid]->protector.jobid = protector_jobid;
procs[sender_vpid]->protector.vpid  = protector_vpid;
opal_output(0, "%s is the protector of %s", ORTE_NAME_PRINT(&procs[sender_vpid]->name), ORTE_NAME_PRINT(&procs[sender_vpid]->protector));

In the log of the orte daemon which acts as HNP I can see correct informations which was added to the orte_proc structure, but, when I use my modified version of orte-ps I found incorrect information ([[INVALID],INVALID]). Bellow is the code I have used in orte-ps:

#if OPAL_ENABLE_FT_RADIC == 1
      protector = orte_util_print_name_args(&vpid->protector);
      printf("%*s |",   len_protector, protector);
#endif

The question is: why the HNP show the correct information, and the orte-ps don“t?

Thanks
--

Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http://www.caos.uab.es
Phone: +34-93-581-2888
Fax: +34-93-581-2478

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to