Ok Ralph.

Thanks a lot for your help, i will do as you said and then let you know how
it goes.

Best Regards.

Hugo Meyer

2011/3/25 Ralph Castain <r...@open-mpi.org>

>
> On Mar 25, 2011, at 10:48 AM, Hugo Meyer wrote:
>
> From what you've described before, I suspect all you'll need to do is add
>> some code in orte/mca/odls/base/odls_base_default_fns.c that (a) checks to
>> see if a process in the launch message is being relocated (the
>> construct_child_list code does that already), and then (b) sends the
>> required info to all local child processes so they can take appropriate
>> action.
>>
>> Failure detection, re-launch, etc. have all been taken care of for you.
>>
>
>
> I looked at the code that you mentioned me and i realize that i have two
> possible options, that i'm going to share with you to know your opinion.
>
> First of all i will let you know my actual situation with the
> implementation. As i'm working in a Fault Tolerant system, but using
> uncoordinated checkpoint i'm taking checkpoints of all my process at
> different time and storing them on the machine where there are residing, but
> i also send this checkpoints to another node (lets call it protector), so if
> this node fails his process should be restarted in the protector that have
> his checkpoints.
>
> Right now i'm detecting the failure of a process and i know where this
> process should be restarted, and also i have the checkpoint in the
> protector. And i also have the child information of course.
>
> So, my options are:
> *First Option*
> *
> *
> I detect the failure, and then i use
> orte_errmgr_hnp_base_global_update_state()  with some modifications and the
> hnp_relocate but changing the spawning to make a restart from a checkpoint,
> i suposse that using this, the migration of the process to another node will
> be updated and everyone will know it, because is the hnp who is going to do
> this (is this ok?).
>
>
> This is the option I would use. The other one is much, much more work. In
> this option, you only have to:
>
> (a) modify the mapper so you can specify the location of the proc being
> restarted. The resilient mapper module will be handling the restart - if you
> look at orte/mca/rmaps/resilient/rmaps_resilient.c, you can see the code
> doing the "replacement" and modify accordingly.
>
> (b) add any required info about your checkpoint to the launch message. This
> gets created in orte/mca/odls/base/odls_base_default_fns.c, the
> "get_add_procs_data" function (at the top of the file).
>
> (c) modify the launch code to handle your checkpoint, if required - see the
> file in (b), the "construct_child" and "launch" functions.
>
> HTH
> Ralph
>
>
>
> *Second Option*
> *
> *
> Modify one of the spawn variations(probably the remote_spawn from rsh) in
> the PLM framework and then use the orted_comm to command a remote_spawn in
> the protector, but i don't know here how to update the info so everyone
> knows about the change or how this is managed.
>
> I might be very wrong in what I said, my apologies if so.
>
> Thanks a lot for all the help.
>
> Best regards.
>
> Hugo Meyer
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

Reply via email to