I'm not entirely sure what you are doing here. The orte_job_t,
orte_node_t, and orte_proc_t objects are only used on mpirun - the
arrays built from those objects are only defined on mpirun itself, not
on any orted.
When you say "orte daemon which acts as HNP", are you implying that
you have some orted out there that is trying to behave like an HNP? Or
do you really mean mpirun itself?
I suspect the reason you are seeing a difference is that orte-ps only
gets its info from mpirun, and you are somehow storing the modified
data on an orted instead.
Did you modify orted itself to create and store an orte_job_t array?
This would not be a good idea as a significant amount of code in the
system expects that array to only exist inside of mpirun. You could
run into some really strange behavior in various scenarios.
Ralph
On Oct 1, 2008, at 9:09 AM, Leonardo Fialho wrote:
Hi All,
I have a little doubt about how to update the orte_proc structure.
I have modified the orte_proc structure to include another field
(orte_name_proc_t type) to describe the node whose store my
checkpoints and logs:
struct orte_proc_t {
...
#if OPAL_ENABLE_FT_RADIC == 1
/* protector node */
orte_process_name_t protector;
#endif
};
Thus, I have added in orted_comm.c a code which I think that would
update de job structure:
/* Update the structure */
if (NULL == (jdata = orte_get_job_data_object(sender_jobid))) {
ORTE_ERROR_LOG(ORTE_ERR_NOT_FOUND);
goto CLEANUP;
}
procs = (orte_proc_t**)jdata->procs->addr;
if (NULL == procs[sender_vpid] ) {
ORTE_ERROR_LOG(ORTE_ERR_NOT_FOUND);
goto CLEANUP;
}
procs[sender_vpid]->protector.jobid = protector_jobid;
procs[sender_vpid]->protector.vpid = protector_vpid;
opal_output(0, "%s is the protector of %s",
ORTE_NAME_PRINT(&procs[sender_vpid]->name),
ORTE_NAME_PRINT(&procs[sender_vpid]->protector));
In the log of the orte daemon which acts as HNP I can see correct
informations which was added to the orte_proc structure, but, when I
use my modified version of orte-ps I found incorrect information
([[INVALID],INVALID]). Bellow is the code I have used in orte-ps:
#if OPAL_ENABLE_FT_RADIC == 1
protector = orte_util_print_name_args(&vpid->protector);
printf("%*s |", len_protector, protector);
#endif
The question is: why the HNP show the correct information, and the
orte-ps don“t?
Thanks
--
Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http://www.caos.uab.es
Phone: +34-93-581-2888
Fax: +34-93-581-2478
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel