On Mar 24, 2011, at 3:57 PM, Hugo Meyer wrote: > 2011/3/24 Ralph Castain <r...@open-mpi.org> > You really don't want to do it that way - you'll create a major confusion in > mpirun and the other daemons about who is where. Have you looked at the code > in orte/mca/errmgr/hnp/errmgr_hnp.c, line 1573 and following? > I did not look at that, but i will do it right now. > > The ability to relocate a failed child process is already in the trunk - it > only requires turning "on" with an --enable-recovery flag at runtime if you > don't need the checkpoint/restart support. If you do need C/R, you can use > that too (just requires some configure flags). > About this, i'm needing C/R support, because what i'm trying to do is to > restart a process in another node(as a child of the orted residing there) > from a previous checkpoint .I will take a look to the relocation feature that > you are mentioning and try to use it.
From what you've described before, I suspect all you'll need to do is add some code in orte/mca/odls/base/odls_base_default_fns.c that (a) checks to see if a process in the launch message is being relocated (the construct_child_list code does that already), and then (b) sends the required info to all local child processes so they can take appropriate action. Failure detection, re-launch, etc. have all been taken care of for you. > > At the least, the cited code should provide guidance on how to correctly > restart procs if you need your own errmgr module for other reasons. > > Again thanks Ralph, you have been very helpful. > > Best regards. > > Hugo Meyer > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel