Thanks Ralph. I found a set_lifeline with that i think i solve that error, but, now i'm dealing with another.
[clus3:32001] [[44269,0],2] -> [[44269,1],1] (node: node3) oob-tcp: Number of attempts to create TCP connection has been exceeded. Can not communicate with peer Open MPI Error Report:[32001]: While communicating to proc [[44269,1],1] on node node3, proc [[44269,0],2] on node clus3 encountered an error 'Communication failure':OOB Connection retries exceeded. Can not communicate with peer I think that this occurs because the daemon [[44269,0],2] doesn't know in wich port and address has been restored the proc. I will look for a way to update this information. Best regards. Hugo 2011/4/6 Ralph Castain <r...@open-mpi.org> > Looks like the lifeline is still pointing to its old daemon instead of > being updated to the new one. Look in orte/mca/routed/cm/routed_cm.c - > should be something in there that updates the lifeline during restart of a > checkpoint. > > > On Apr 6, 2011, at 7:50 AM, Hugo Meyer wrote: > > Hi all. > > > I corrected the error with the port. The mistake was because he tried to > start theprocess back and the ports are static, the process was taking a port > where an app was already running. > > Initially, the process was running on [[65478,0],1] and then it moves > to [[65478,0],2]. > > So now i get the socket binded, but i'm getting a communication failure > in [[65478,0],1]. I'm sending as an atachment my debug output (there are > some things in spanish, but there still are the open-mpi default debug > output), where you can see the moment where i kill the process running con > *clus5 *to the moment where it is restored in *clus3. *And then i get > a TERMINATED WITHOUT SYNC in the proc restarted: > > *clus3:21615] [[65478,0],2] errmgr:orted got state TERMINATED WITHOUT SYNC > for proc [[65478,1],1] pid 21705* > > * > * > Here i put the output of my stdout after the socket is binded again when > the process restarts. > > > [1,1]<stdout>:SOCKET BINDED > [1,1]<stdout>:[clus5:19425] App) notify_response: Waiting for final > handshake. > [1,1]<stdout>:[clus5:19425] App) update_status: Update checkpoint status > (13, /tmp/radic/1) for [[65478,1],1] > [1,0]<stdout>:INICIEI O BROADCAST (6) > [1,0]<stdout>:FINALIZEI O BROADCAST (6) > [1,0]<stdout>:INICIEI O BROADCAST > [1,3]<stdout>:INICIEI O BROADCAST (6) > [1,2]<stdout>:INICIEI O BROADCAST (6) > [1,3]<stdout>:FINALIZEI O BROADCAST (6) > [1,3]<stdout>:INICIEI O BROADCAST > [1,2]<stdout>:FINALIZEI O BROADCAST (6) > [1,2]<stdout>:INICIEI O BROADCAST > [1,1]<stdout>:[clus5:19425] [[65478,1],1] errmgr:app: job [65478,0] > reported state COMMUNICATION FAILURE for proc [[65478,0],1] state > COMMUNICATION FAILURE exit_code 1 > [1,1]<stdout>:[clus5:19425] [[65478,1],1] routed:cm: Connection to lifeline > [[65478,0],1] lost > [1,1]<stdout>:[[65478,1],1] assigned port 31256 > > Any help on how to solve this error, or how to interpret it will be greatly > appreciated. > > Best regards. > > Hugo > > 2011/4/5 Hugo Meyer <meyer.h...@gmail.com> > >> Hello Ralph and @ll. >> >> Ralph, by following your recomendations i've already restart the process >> in another node from his checkpoint. But now i'm having a small problem with >> the oob_tcp. There is the output: >> >> odls_base_default_fns:SETEANDO BLCR CONTEXT >> CKPT-FILE: /tmp/radic/1/ompi_blcr_context.13374 >> ODLS_BASE_DEFAULT_FNS: REINICIO PROCESO EN [[34224,0],2] >> [1,1]<stdout>:INICIEI O BROADCAST (2) >> [1,1]<stdout>:[clus5:13374] snapc:single:app do_checkpoint: RESTART (3) >> *[1,1]<stdout>:[clus5:13374] mca_oob_tcp_init: creating listen socket* >> *[1,1]<stdout>:[clus5:13374] mca_oob_tcp_init: unable to create IPv4 >> listen socket: Unable to open a TCP socket for out-of-band communications >> * >> [1,1]<stdout>:[clus5:13374] App) notify_response: Waiting for final >> handshake*.* >> [1,1]<stdout>:[clus5:13374] App) update_status: Update checkpoint status >> (13, /tmp/radic/1) for [[34224,1],1] >> [1,0]<stdout>:INICIEI O BROADCAST (6) >> [1,0]<stdout>:FINALIZEI O BROADCAST (6) >> [1,0]<stdout>:INICIEI O BROADCAST >> [1,3]<stdout>:INICIEI O BROADCAST (6) >> [1,3]<stdout>:FINALIZEI O BROADCAST (6) >> [1,3]<stdout>:INICIEI O BROADCAST >> *[1,1]<stdout>:[clus5:13374] [[34224,1],1] errmgr:app: job [34224,0] >> reported state COMMUNICATION FAILURE for proc [[34224,0],1] state >> COMMUNICATION FAILURE exit_code 1* >> *[1,1]<stdout>:[clus5:13374] [[34224,1],1] routed:cm: Connection to >> lifeline [[34224,0],1] lost* >> >> >> I'm thinking that this error ocurrs because the process want to create >> the socket using the port that was previously assigned to it. So, if i >> want to restart it using another port or something how the other daemons and >> process will find out about this? Is this a good choice? >> >> Best regards. >> >> Hugo Meyer >> >> 2011/3/31 Hugo Meyer <meyer.h...@gmail.com> >> >>> Ok Ralph. >>> Thanks a lot, i will resend this message with a new subject. >>> >>> Best Regards. >>> >>> Hugo >>> >>> >>> 2011/3/31 Ralph Castain <r...@open-mpi.org> >>> >>>> Sorry - should have included the devel list when I sent this. >>>> >>>> >>>> On Mar 30, 2011, at 6:11 PM, Ralph Castain wrote: >>>> >>>> I'm not the expert on this area - Josh is, so I'll defer to him. I did >>>> take a quick glance at the sstore framework, though, and it looks like >>>> there >>>> are some params you could set that might help. >>>> >>>> "ompi_info --param sstore all" >>>> >>>> should tell you what's available. Also, note that Josh created a man >>>> page to explain how sstore works. It's in section 7, looks like "man >>>> orte_sstore" should get it. >>>> >>>> >>>> On Mar 30, 2011, at 3:09 PM, Hugo Meyer wrote: >>>> >>>> Hello again. >>>> >>>> I'm working in the launch code to handle my checkpoints, but i'm a >>>> little stuck in how to set the path to my checkpoint and the executable >>>> (ompi_blcr_context.PID). I take a look at the code in >>>> odls_base_default_fns.c and this piece of code took my attention: >>>> >>>> #if OPAL_ENABLE_FT_CR == 1 >>>> /* >>>> * OPAL CRS components need the opportunity to take action >>>> before a process >>>> * is forked. >>>> * Needs access to: >>>> * - Environment >>>> * - Rank/ORTE Name >>>> * - Binary to exec >>>> */ >>>> if( NULL != opal_crs.crs_prelaunch ) { >>>> if( OPAL_SUCCESS != (rc = >>>> opal_crs.crs_prelaunch(child->name->vpid, >>>> >>>> orte_sstore_base_prelaunch_location, >>>> >>>> &(app->app), >>>> >>>> &(app->cwd), >>>> >>>> &(app->argv), >>>> >>>> &(app->env) ) ) ) { >>>> ORTE_ERROR_LOG(rc); >>>> goto CLEANUP; >>>> } >>>> } >>>> #endif >>>> >>>> >>>> But i didn't find out how to set orte_sstore_base_prelaunch_location, i >>>> now that initially this is set in the sstore_base_open. For example, as i'm >>>> transfering my checkpoint from one node to another, i store the checkpoint >>>> that has to be restore in /tmp/1/ and it has a name >>>> like ompi_blcr_context.PID. >>>> >>>> Is there any function that i didn't see that allows me to do this? I'm >>>> asking this because I do not want to change the signature of the >>>> functions to pass the details of the checkpoint and the PID. >>>> >>>> Best Regards. >>>> >>>> Hugo Meyer >>>> >>>> 2011/3/30 Hugo Meyer <meyer.h...@gmail.com> >>>> >>>>> Thanks Ralph. >>>>> I have finished the (a) point, and now its working, now i have to work >>>>> to relaunch from my checkpoint as you said. >>>>> >>>>> Best regards. >>>>> >>>>> Hugo Meyer >>>>> >>>>> >>>>> 2011/3/29 Ralph Castain <r...@open-mpi.org> >>>>> >>>>>> The resilient mapper -only- works on procs being restarted - it >>>>>> cannot map a job for its initial launch. You shouldn't set any rmaps flag >>>>>> and things will work correctly - the default round-robin mapper will map >>>>>> the >>>>>> initial launch, and then the resilient mapper will handle restarts. >>>>>> >>>>>> >>>>>> On Mar 29, 2011, at 5:18 AM, Hugo Meyer wrote: >>>>>> >>>>>> Ralph. >>>>>> >>>>>> I'm having a problem when i try to select the rmaps resilient to be >>>>>> used: >>>>>> >>>>>> /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -v -np 4 >>>>>> --hostfile ../hostfile --bynode -mca rmaps resilient -mca vprotocol >>>>>> receiver >>>>>> -mca plm rsh -mca routed cm ./coll 6 10 2>out.txt >>>>>> >>>>>> >>>>>> I get this as error: >>>>>> >>>>>> [clus9:25568] [[53334,0],0] hostfile: checking hostfile ../hostfile >>>>>> for nodes >>>>>> >>>>>> -------------------------------------------------------------------------- >>>>>> Your job failed to map. Either no mapper was available, or none >>>>>> of the available mappers was able to perform the requested >>>>>> mapping operation. This can happen if you request a map type >>>>>> (e.g., loadbalance) and the corresponding mapper was not built. >>>>>> >>>>>> >>>>>> -------------------------------------------------------------------------- >>>>>> [clus9:25568] errmgr:hnp:update_state() [[53334,0],0]) ------- App. >>>>>> Process state updated for process NULL >>>>>> [clus9:25568] [[53334,0],0] errmgr:hnp: job [53334,0] reported state >>>>>> NEVER LAUNCHED for proc NULL state UNDEFINED pid 0 exit_code 1 >>>>>> [clus9:25568] [[53334,0],0] errmgr:hnp: job [53334,0] reported state >>>>>> NEVER LAUNCHED >>>>>> [clus9:25568] [[53334,0],0] errmgr:hnp: abort called on job [53334,0] >>>>>> with status 1 >>>>>> >>>>>> >>>>>> Is there a flag that i'm not turning on? or a component that i should >>>>>> have selected? >>>>>> >>>>>> Thanks again. >>>>>> >>>>>> Hugo Meyer >>>>>> >>>>>> >>>>>> 2011/3/26 Hugo Meyer <meyer.h...@gmail.com> >>>>>> >>>>>>> Ok Ralph. >>>>>>> >>>>>>> Thanks a lot for your help, i will do as you said and then let you >>>>>>> know how it goes. >>>>>>> >>>>>>> Best Regards. >>>>>>> >>>>>>> Hugo Meyer >>>>>>> >>>>>>> >>>>>>> 2011/3/25 Ralph Castain <r...@open-mpi.org> >>>>>>> >>>>>>>> >>>>>>>> On Mar 25, 2011, at 10:48 AM, Hugo Meyer wrote: >>>>>>>> >>>>>>>> From what you've described before, I suspect all you'll need to do >>>>>>>>> is add some code in orte/mca/odls/base/odls_base_default_fns.c that >>>>>>>>> (a) >>>>>>>>> checks to see if a process in the launch message is being relocated >>>>>>>>> (the >>>>>>>>> construct_child_list code does that already), and then (b) sends the >>>>>>>>> required info to all local child processes so they can take >>>>>>>>> appropriate >>>>>>>>> action. >>>>>>>>> >>>>>>>>> Failure detection, re-launch, etc. have all been taken care of for >>>>>>>>> you. >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> I looked at the code that you mentioned me and i realize that i >>>>>>>> have two possible options, that i'm going to share with you to know >>>>>>>> your >>>>>>>> opinion. >>>>>>>> >>>>>>>> First of all i will let you know my actual situation with the >>>>>>>> implementation. As i'm working in a Fault Tolerant system, but using >>>>>>>> uncoordinated checkpoint i'm taking checkpoints of all my process at >>>>>>>> different time and storing them on the machine where there are >>>>>>>> residing, but >>>>>>>> i also send this checkpoints to another node (lets call it protector), >>>>>>>> so if >>>>>>>> this node fails his process should be restarted in the protector that >>>>>>>> have >>>>>>>> his checkpoints. >>>>>>>> >>>>>>>> Right now i'm detecting the failure of a process and i know where >>>>>>>> this process should be restarted, and also i have the checkpoint in the >>>>>>>> protector. And i also have the child information of course. >>>>>>>> >>>>>>>> So, my options are: >>>>>>>> *First Option* >>>>>>>> * >>>>>>>> * >>>>>>>> I detect the failure, and then i use >>>>>>>> orte_errmgr_hnp_base_global_update_state() with some modifications >>>>>>>> and the >>>>>>>> hnp_relocate but changing the spawning to make a restart from a >>>>>>>> checkpoint, >>>>>>>> i suposse that using this, the migration of the process to another >>>>>>>> node will >>>>>>>> be updated and everyone will know it, because is the hnp who is going >>>>>>>> to do >>>>>>>> this (is this ok?). >>>>>>>> >>>>>>>> >>>>>>>> This is the option I would use. The other one is much, much more >>>>>>>> work. In this option, you only have to: >>>>>>>> >>>>>>>> (a) modify the mapper so you can specify the location of the proc >>>>>>>> being restarted. The resilient mapper module will be handling the >>>>>>>> restart - >>>>>>>> if you look at orte/mca/rmaps/resilient/rmaps_resilient.c, you can see >>>>>>>> the >>>>>>>> code doing the "replacement" and modify accordingly. >>>>>>>> >>>>>>>> (b) add any required info about your checkpoint to the launch >>>>>>>> message. This gets created in >>>>>>>> orte/mca/odls/base/odls_base_default_fns.c, >>>>>>>> the "get_add_procs_data" function (at the top of the file). >>>>>>>> >>>>>>>> (c) modify the launch code to handle your checkpoint, if required - >>>>>>>> see the file in (b), the "construct_child" and "launch" functions. >>>>>>>> >>>>>>>> HTH >>>>>>>> Ralph >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> *Second Option* >>>>>>>> * >>>>>>>> * >>>>>>>> Modify one of the spawn variations(probably the remote_spawn from >>>>>>>> rsh) in the PLM framework and then use the orted_comm to command a >>>>>>>> remote_spawn in the protector, but i don't know here how to update the >>>>>>>> info >>>>>>>> so everyone knows about the change or how this is managed. >>>>>>>> >>>>>>>> I might be very wrong in what I said, my apologies if so. >>>>>>>> >>>>>>>> Thanks a lot for all the help. >>>>>>>> >>>>>>>> Best regards. >>>>>>>> >>>>>>>> Hugo Meyer >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> devel mailing list >>>>>>>> de...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> devel mailing list >>>>>>>> de...@open-mpi.org >>>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>>> >>>>>>> >>>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> devel mailing list >>>>>> de...@open-mpi.org >>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>> >>>>> >>>>> >>>> >>>> >>>> >>> >> > <out> > > >