Ok Ralph. Thanks a lot for your help, i will do as you said and then let you know how it goes.
Best Regards. Hugo Meyer 2011/3/25 Ralph Castain <r...@open-mpi.org> > > On Mar 25, 2011, at 10:48 AM, Hugo Meyer wrote: > > From what you've described before, I suspect all you'll need to do is add >> some code in orte/mca/odls/base/odls_base_default_fns.c that (a) checks to >> see if a process in the launch message is being relocated (the >> construct_child_list code does that already), and then (b) sends the >> required info to all local child processes so they can take appropriate >> action. >> >> Failure detection, re-launch, etc. have all been taken care of for you. >> > > > I looked at the code that you mentioned me and i realize that i have two > possible options, that i'm going to share with you to know your opinion. > > First of all i will let you know my actual situation with the > implementation. As i'm working in a Fault Tolerant system, but using > uncoordinated checkpoint i'm taking checkpoints of all my process at > different time and storing them on the machine where there are residing, but > i also send this checkpoints to another node (lets call it protector), so if > this node fails his process should be restarted in the protector that have > his checkpoints. > > Right now i'm detecting the failure of a process and i know where this > process should be restarted, and also i have the checkpoint in the > protector. And i also have the child information of course. > > So, my options are: > *First Option* > * > * > I detect the failure, and then i use > orte_errmgr_hnp_base_global_update_state() with some modifications and the > hnp_relocate but changing the spawning to make a restart from a checkpoint, > i suposse that using this, the migration of the process to another node will > be updated and everyone will know it, because is the hnp who is going to do > this (is this ok?). > > > This is the option I would use. The other one is much, much more work. In > this option, you only have to: > > (a) modify the mapper so you can specify the location of the proc being > restarted. The resilient mapper module will be handling the restart - if you > look at orte/mca/rmaps/resilient/rmaps_resilient.c, you can see the code > doing the "replacement" and modify accordingly. > > (b) add any required info about your checkpoint to the launch message. This > gets created in orte/mca/odls/base/odls_base_default_fns.c, the > "get_add_procs_data" function (at the top of the file). > > (c) modify the launch code to handle your checkpoint, if required - see the > file in (b), the "construct_child" and "launch" functions. > > HTH > Ralph > > > > *Second Option* > * > * > Modify one of the spawn variations(probably the remote_spawn from rsh) in > the PLM framework and then use the orted_comm to command a remote_spawn in > the protector, but i don't know here how to update the info so everyone > knows about the change or how this is managed. > > I might be very wrong in what I said, my apologies if so. > > Thanks a lot for all the help. > > Best regards. > > Hugo Meyer > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >