On Mar 21, 2011, at 2:51 PM, Hugo Meyer wrote:

> Thanks Ralph for your reply.
> 
> 2011/3/21 Ralph Castain <r...@open-mpi.org>
> You should never access a pointer array's data area that way (i.e., by index 
> against the raw data). You really should do:
> 
> if (NULL == (proc = (orte_proc_t*)opal_pointer_array_get_item(jdata->procs, 
> vpid))) {
>       /* error report */
> }
> 
> 
> About this, i've changed this in my code but i'm getting the same result. 
> Null when asking about a dead process.
>  
> The errmgr generally doesn't remove a process object upon failure - it just 
> sets its state to some appropriate value. However, depending upon where you 
> are trying to do this, and the history that got you down this code path, it 
> is possible.
> 
> I'm writing this code into the errmgr_orted.c, and it is executed when a 
> process fails. 
>  

There's your problem - that module is run in the daemon, where the 
orte_job_data pointer array isn't used. You have to use the orte_local_jobdata 
and orte_local_children lists instead. So once the HNP replies with the jobid, 
you look up the orte_odls_job_t for that job from the orte_local_jobdata list.

If you want to find a particular proc, though, you would look under 
orte_local_children - search the list for a child whose jobid and vpid both 
match.

Note that you will not find that child process -unless- the child is under that 
daemon.

I'm not sure what you are trying to accomplish, so I can't give further advice. 
Note that daemons have limited knowledge of application processes that are not 
their own immediate children. What little they know regarding processes other 
than their own is stored in the nidmap/pidmap arrays - limited to location, 
local rank, and node rank. They have no storage currently allocated for things 
like the state of a non-local process.


> 
> Also, remember that if you are in a daemon, then the jdata objects are not 
> populated. The daemons work exclusively from the orte_local_jobdata and 
> orte_local_children lists, so you would have to find your process there.
> 
> That's why i'm asking to the hnp about the jdata using 
> ORTE_DAEMON_REPORT_JOB_INFO_CMD, i assume that he has the information about 
> the dead process.

Only after the daemon reports it.

> 
> Any idea?
> 
> Best regards.
> 
> Hugo Meyer
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to