On Mar 24, 2011, at 3:57 PM, Hugo Meyer wrote:

> 2011/3/24 Ralph Castain <r...@open-mpi.org>
> You really don't want to do it that way - you'll create a major confusion in 
> mpirun and the other daemons about who is where. Have you looked at the code 
> in orte/mca/errmgr/hnp/errmgr_hnp.c, line 1573 and following?
> I did not look at that, but i will do it right now. 
> 
> The ability to relocate a failed child process is already in the trunk - it 
> only requires turning "on" with an --enable-recovery flag at runtime if you 
> don't need the checkpoint/restart support. If you do need C/R, you can use 
> that too (just requires some configure flags).
> About this, i'm needing C/R support, because what i'm trying to do is to 
> restart a process in another node(as a child of the orted residing there) 
> from a previous checkpoint .I will take a look to the relocation feature that 
> you are mentioning and try to use it.

From what you've described before, I suspect all you'll need to do is add some 
code in orte/mca/odls/base/odls_base_default_fns.c that (a) checks to see if a 
process in the launch message is being relocated (the construct_child_list code 
does that already), and then (b) sends the required info to all local child 
processes so they can take appropriate action.

Failure detection, re-launch, etc. have all been taken care of for you.

> 
> At the least, the cited code should provide guidance on how to correctly 
> restart procs if you need your own errmgr module for other reasons.
> 
> Again thanks Ralph, you have been very helpful.
> 
> Best regards.
> 
> Hugo Meyer
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to