Hello again. I'm working in the launch code to handle my checkpoints, but i'm a little stuck in how to set the path to my checkpoint and the executable (ompi_blcr_context.PID). I take a look at the code in odls_base_default_fns.c and this piece of code took my attention:
#if OPAL_ENABLE_FT_CR == 1 /* * OPAL CRS components need the opportunity to take action before a process * is forked. * Needs access to: * - Environment * - Rank/ORTE Name * - Binary to exec */ if( NULL != opal_crs.crs_prelaunch ) { if( OPAL_SUCCESS != (rc = opal_crs.crs_prelaunch(child->name->vpid, orte_sstore_base_prelaunch_location, &(app->app), &(app->cwd), &(app->argv), &(app->env) ) ) ) { ORTE_ERROR_LOG(rc); goto CLEANUP; } } #endif But i didn't find out how to set orte_sstore_base_prelaunch_location, i now that initially this is set in the sstore_base_open. For example, as i'm transfering my checkpoint from one node to another, i store the checkpoint that has to be restore in /tmp/1/ and it has a name like ompi_blcr_context.PID. Is there any function that i didn't see that allows me to do this? I'm asking this because I do not want to change the signature of the functions to pass the details of the checkpoint and the PID. Best Regards. Hugo Meyer 2011/3/30 Hugo Meyer <meyer.h...@gmail.com> > Thanks Ralph. > I have finished the (a) point, and now its working, now i have to work to > relaunch from my checkpoint as you said. > > Best regards. > > Hugo Meyer > > > 2011/3/29 Ralph Castain <r...@open-mpi.org> > >> The resilient mapper -only- works on procs being restarted - it cannot map >> a job for its initial launch. You shouldn't set any rmaps flag and things >> will work correctly - the default round-robin mapper will map the initial >> launch, and then the resilient mapper will handle restarts. >> >> >> On Mar 29, 2011, at 5:18 AM, Hugo Meyer wrote: >> >> Ralph. >> >> I'm having a problem when i try to select the rmaps resilient to be used: >> >> /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -v -np 4 --hostfile >> ../hostfile --bynode -mca rmaps resilient -mca vprotocol receiver -mca plm >> rsh -mca routed cm ./coll 6 10 2>out.txt >> >> >> I get this as error: >> >> [clus9:25568] [[53334,0],0] hostfile: checking hostfile ../hostfile for >> nodes >> -------------------------------------------------------------------------- >> Your job failed to map. Either no mapper was available, or none >> of the available mappers was able to perform the requested >> mapping operation. This can happen if you request a map type >> (e.g., loadbalance) and the corresponding mapper was not built. >> >> -------------------------------------------------------------------------- >> [clus9:25568] errmgr:hnp:update_state() [[53334,0],0]) ------- App. >> Process state updated for process NULL >> [clus9:25568] [[53334,0],0] errmgr:hnp: job [53334,0] reported state NEVER >> LAUNCHED for proc NULL state UNDEFINED pid 0 exit_code 1 >> [clus9:25568] [[53334,0],0] errmgr:hnp: job [53334,0] reported state NEVER >> LAUNCHED >> [clus9:25568] [[53334,0],0] errmgr:hnp: abort called on job [53334,0] with >> status 1 >> >> >> Is there a flag that i'm not turning on? or a component that i should have >> selected? >> >> Thanks again. >> >> Hugo Meyer >> >> >> 2011/3/26 Hugo Meyer <meyer.h...@gmail.com> >> >>> Ok Ralph. >>> >>> Thanks a lot for your help, i will do as you said and then let you know >>> how it goes. >>> >>> Best Regards. >>> >>> Hugo Meyer >>> >>> >>> 2011/3/25 Ralph Castain <r...@open-mpi.org> >>> >>>> >>>> On Mar 25, 2011, at 10:48 AM, Hugo Meyer wrote: >>>> >>>> From what you've described before, I suspect all you'll need to do is >>>>> add some code in orte/mca/odls/base/odls_base_default_fns.c that (a) >>>>> checks >>>>> to see if a process in the launch message is being relocated (the >>>>> construct_child_list code does that already), and then (b) sends the >>>>> required info to all local child processes so they can take appropriate >>>>> action. >>>>> >>>>> Failure detection, re-launch, etc. have all been taken care of for you. >>>>> >>>> >>>> >>>> I looked at the code that you mentioned me and i realize that i have >>>> two possible options, that i'm going to share with you to know your >>>> opinion. >>>> >>>> First of all i will let you know my actual situation with the >>>> implementation. As i'm working in a Fault Tolerant system, but using >>>> uncoordinated checkpoint i'm taking checkpoints of all my process at >>>> different time and storing them on the machine where there are residing, >>>> but >>>> i also send this checkpoints to another node (lets call it protector), so >>>> if >>>> this node fails his process should be restarted in the protector that have >>>> his checkpoints. >>>> >>>> Right now i'm detecting the failure of a process and i know where this >>>> process should be restarted, and also i have the checkpoint in the >>>> protector. And i also have the child information of course. >>>> >>>> So, my options are: >>>> *First Option* >>>> * >>>> * >>>> I detect the failure, and then i use >>>> orte_errmgr_hnp_base_global_update_state() with some modifications and the >>>> hnp_relocate but changing the spawning to make a restart from a checkpoint, >>>> i suposse that using this, the migration of the process to another node >>>> will >>>> be updated and everyone will know it, because is the hnp who is going to do >>>> this (is this ok?). >>>> >>>> >>>> This is the option I would use. The other one is much, much more work. >>>> In this option, you only have to: >>>> >>>> (a) modify the mapper so you can specify the location of the proc being >>>> restarted. The resilient mapper module will be handling the restart - if >>>> you >>>> look at orte/mca/rmaps/resilient/rmaps_resilient.c, you can see the code >>>> doing the "replacement" and modify accordingly. >>>> >>>> (b) add any required info about your checkpoint to the launch message. >>>> This gets created in orte/mca/odls/base/odls_base_default_fns.c, the >>>> "get_add_procs_data" function (at the top of the file). >>>> >>>> (c) modify the launch code to handle your checkpoint, if required - see >>>> the file in (b), the "construct_child" and "launch" functions. >>>> >>>> HTH >>>> Ralph >>>> >>>> >>>> >>>> *Second Option* >>>> * >>>> * >>>> Modify one of the spawn variations(probably the remote_spawn from rsh) in >>>> the PLM framework and then use the orted_comm to command a remote_spawn in >>>> the protector, but i don't know here how to update the info so everyone >>>> knows about the change or how this is managed. >>>> >>>> I might be very wrong in what I said, my apologies if so. >>>> >>>> Thanks a lot for all the help. >>>> >>>> Best regards. >>>> >>>> Hugo Meyer >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>> >>> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > >