You really don't want to do it that way - you'll create a major confusion in 
mpirun and the other daemons about who is where. Have you looked at the code in 
orte/mca/errmgr/hnp/errmgr_hnp.c, line 1573 and following?

The ability to relocate a failed child process is already in the trunk - it 
only requires turning "on" with an --enable-recovery flag at runtime if you 
don't need the checkpoint/restart support. If you do need C/R, you can use that 
too (just requires some configure flags).

At the least, the cited code should provide guidance on how to correctly 
restart procs if you need your own errmgr module for other reasons.

On Mar 24, 2011, at 7:56 AM, Hugo Meyer wrote:

> Hello @ll.
> 
> I'm trying to restart a child that has failed, now i'm catching the failed 
> child in the errmgr and then i'm packing the child and sending it to another 
> node who has to "adopt" it. Is there any way to do this with te actual 
> implementation? something like add_child. Because the i will have to do 
> somethin like:
> 
> opal_list_item_t *item;
> orte_odls_job_t *jobdat;
> orte_app_context_t *app;
> for (item = opal_list_get_first(&orte_local_jobdata);
>          item != opal_list_get_end(&orte_local_jobdata);
>          item = opal_list_get_next(item)) {
>         jobdat = (orte_odls_job_t*)item;
>         if (jobdat->jobid == child->name->jobid) {
>             break;
>         }
>     }
> app = jobdat->apps[child->app_idx];
> 
> In order to do this, i need to have the child in the jobdat. If there is not 
> such thing implemented, could someone give me an advice on how to do this.
> 
> Best Regards.
> 
> Hugo Meyer
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to