Hi all.
I corrected the error with the port. The mistake was because he tried to start theprocess back and the ports are static, the process was taking a port where an app was already running. Initially, the process was running on [[65478,0],1] and then it moves to [[65478,0],2]. So now i get the socket binded, but i'm getting a communication failure in [[65478,0],1]. I'm sending as an atachment my debug output (there are some things in spanish, but there still are the open-mpi default debug output), where you can see the moment where i kill the process running con *clus5 *to the moment where it is restored in *clus3. *And then i get a TERMINATED WITHOUT SYNC in the proc restarted: *clus3:21615] [[65478,0],2] errmgr:orted got state TERMINATED WITHOUT SYNC for proc [[65478,1],1] pid 21705* * * Here i put the output of my stdout after the socket is binded again when the process restarts. [1,1]<stdout>:SOCKET BINDED [1,1]<stdout>:[clus5:19425] App) notify_response: Waiting for final handshake. [1,1]<stdout>:[clus5:19425] App) update_status: Update checkpoint status (13, /tmp/radic/1) for [[65478,1],1] [1,0]<stdout>:INICIEI O BROADCAST (6) [1,0]<stdout>:FINALIZEI O BROADCAST (6) [1,0]<stdout>:INICIEI O BROADCAST [1,3]<stdout>:INICIEI O BROADCAST (6) [1,2]<stdout>:INICIEI O BROADCAST (6) [1,3]<stdout>:FINALIZEI O BROADCAST (6) [1,3]<stdout>:INICIEI O BROADCAST [1,2]<stdout>:FINALIZEI O BROADCAST (6) [1,2]<stdout>:INICIEI O BROADCAST [1,1]<stdout>:[clus5:19425] [[65478,1],1] errmgr:app: job [65478,0] reported state COMMUNICATION FAILURE for proc [[65478,0],1] state COMMUNICATION FAILURE exit_code 1 [1,1]<stdout>:[clus5:19425] [[65478,1],1] routed:cm: Connection to lifeline [[65478,0],1] lost [1,1]<stdout>:[[65478,1],1] assigned port 31256 Any help on how to solve this error, or how to interpret it will be greatly appreciated. Best regards. Hugo 2011/4/5 Hugo Meyer <meyer.h...@gmail.com> > Hello Ralph and @ll. > > Ralph, by following your recomendations i've already restart the process in > another node from his checkpoint. But now i'm having a small problem with > the oob_tcp. There is the output: > > odls_base_default_fns:SETEANDO BLCR CONTEXT > CKPT-FILE: /tmp/radic/1/ompi_blcr_context.13374 > ODLS_BASE_DEFAULT_FNS: REINICIO PROCESO EN [[34224,0],2] > [1,1]<stdout>:INICIEI O BROADCAST (2) > [1,1]<stdout>:[clus5:13374] snapc:single:app do_checkpoint: RESTART (3) > *[1,1]<stdout>:[clus5:13374] mca_oob_tcp_init: creating listen socket* > *[1,1]<stdout>:[clus5:13374] mca_oob_tcp_init: unable to create IPv4 > listen socket: Unable to open a TCP socket for out-of-band communications* > [1,1]<stdout>:[clus5:13374] App) notify_response: Waiting for final > handshake*.* > [1,1]<stdout>:[clus5:13374] App) update_status: Update checkpoint status > (13, /tmp/radic/1) for [[34224,1],1] > [1,0]<stdout>:INICIEI O BROADCAST (6) > [1,0]<stdout>:FINALIZEI O BROADCAST (6) > [1,0]<stdout>:INICIEI O BROADCAST > [1,3]<stdout>:INICIEI O BROADCAST (6) > [1,3]<stdout>:FINALIZEI O BROADCAST (6) > [1,3]<stdout>:INICIEI O BROADCAST > *[1,1]<stdout>:[clus5:13374] [[34224,1],1] errmgr:app: job [34224,0] > reported state COMMUNICATION FAILURE for proc [[34224,0],1] state > COMMUNICATION FAILURE exit_code 1* > *[1,1]<stdout>:[clus5:13374] [[34224,1],1] routed:cm: Connection to > lifeline [[34224,0],1] lost* > > > I'm thinking that this error ocurrs because the process want to create the > socket using the port that was previously assigned to it. So, if i want to > restart it using another port or something how the other daemons and process > will find out about this? Is this a good choice? > > Best regards. > > Hugo Meyer > > 2011/3/31 Hugo Meyer <meyer.h...@gmail.com> > >> Ok Ralph. >> Thanks a lot, i will resend this message with a new subject. >> >> Best Regards. >> >> Hugo >> >> >> 2011/3/31 Ralph Castain <r...@open-mpi.org> >> >>> Sorry - should have included the devel list when I sent this. >>> >>> >>> On Mar 30, 2011, at 6:11 PM, Ralph Castain wrote: >>> >>> I'm not the expert on this area - Josh is, so I'll defer to him. I did >>> take a quick glance at the sstore framework, though, and it looks like there >>> are some params you could set that might help. >>> >>> "ompi_info --param sstore all" >>> >>> should tell you what's available. Also, note that Josh created a man page >>> to explain how sstore works. It's in section 7, looks like "man orte_sstore" >>> should get it. >>> >>> >>> On Mar 30, 2011, at 3:09 PM, Hugo Meyer wrote: >>> >>> Hello again. >>> >>> I'm working in the launch code to handle my checkpoints, but i'm a little >>> stuck in how to set the path to my checkpoint and the executable >>> (ompi_blcr_context.PID). I take a look at the code in >>> odls_base_default_fns.c and this piece of code took my attention: >>> >>> #if OPAL_ENABLE_FT_CR == 1 >>> /* >>> * OPAL CRS components need the opportunity to take action >>> before a process >>> * is forked. >>> * Needs access to: >>> * - Environment >>> * - Rank/ORTE Name >>> * - Binary to exec >>> */ >>> if( NULL != opal_crs.crs_prelaunch ) { >>> if( OPAL_SUCCESS != (rc = >>> opal_crs.crs_prelaunch(child->name->vpid, >>> >>> orte_sstore_base_prelaunch_location, >>> >>> &(app->app), >>> >>> &(app->cwd), >>> >>> &(app->argv), >>> >>> &(app->env) ) ) ) { >>> ORTE_ERROR_LOG(rc); >>> goto CLEANUP; >>> } >>> } >>> #endif >>> >>> >>> But i didn't find out how to set orte_sstore_base_prelaunch_location, i >>> now that initially this is set in the sstore_base_open. For example, as i'm >>> transfering my checkpoint from one node to another, i store the checkpoint >>> that has to be restore in /tmp/1/ and it has a name >>> like ompi_blcr_context.PID. >>> >>> Is there any function that i didn't see that allows me to do this? I'm >>> asking this because I do not want to change the signature of the >>> functions to pass the details of the checkpoint and the PID. >>> >>> Best Regards. >>> >>> Hugo Meyer >>> >>> 2011/3/30 Hugo Meyer <meyer.h...@gmail.com> >>> >>>> Thanks Ralph. >>>> I have finished the (a) point, and now its working, now i have to work >>>> to relaunch from my checkpoint as you said. >>>> >>>> Best regards. >>>> >>>> Hugo Meyer >>>> >>>> >>>> 2011/3/29 Ralph Castain <r...@open-mpi.org> >>>> >>>>> The resilient mapper -only- works on procs being restarted - it cannot >>>>> map a job for its initial launch. You shouldn't set any rmaps flag and >>>>> things will work correctly - the default round-robin mapper will map the >>>>> initial launch, and then the resilient mapper will handle restarts. >>>>> >>>>> >>>>> On Mar 29, 2011, at 5:18 AM, Hugo Meyer wrote: >>>>> >>>>> Ralph. >>>>> >>>>> I'm having a problem when i try to select the rmaps resilient to be >>>>> used: >>>>> >>>>> /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -v -np 4 >>>>> --hostfile ../hostfile --bynode -mca rmaps resilient -mca vprotocol >>>>> receiver >>>>> -mca plm rsh -mca routed cm ./coll 6 10 2>out.txt >>>>> >>>>> >>>>> I get this as error: >>>>> >>>>> [clus9:25568] [[53334,0],0] hostfile: checking hostfile ../hostfile for >>>>> nodes >>>>> >>>>> -------------------------------------------------------------------------- >>>>> Your job failed to map. Either no mapper was available, or none >>>>> of the available mappers was able to perform the requested >>>>> mapping operation. This can happen if you request a map type >>>>> (e.g., loadbalance) and the corresponding mapper was not built. >>>>> >>>>> >>>>> -------------------------------------------------------------------------- >>>>> [clus9:25568] errmgr:hnp:update_state() [[53334,0],0]) ------- App. >>>>> Process state updated for process NULL >>>>> [clus9:25568] [[53334,0],0] errmgr:hnp: job [53334,0] reported state >>>>> NEVER LAUNCHED for proc NULL state UNDEFINED pid 0 exit_code 1 >>>>> [clus9:25568] [[53334,0],0] errmgr:hnp: job [53334,0] reported state >>>>> NEVER LAUNCHED >>>>> [clus9:25568] [[53334,0],0] errmgr:hnp: abort called on job [53334,0] >>>>> with status 1 >>>>> >>>>> >>>>> Is there a flag that i'm not turning on? or a component that i should >>>>> have selected? >>>>> >>>>> Thanks again. >>>>> >>>>> Hugo Meyer >>>>> >>>>> >>>>> 2011/3/26 Hugo Meyer <meyer.h...@gmail.com> >>>>> >>>>>> Ok Ralph. >>>>>> >>>>>> Thanks a lot for your help, i will do as you said and then let you >>>>>> know how it goes. >>>>>> >>>>>> Best Regards. >>>>>> >>>>>> Hugo Meyer >>>>>> >>>>>> >>>>>> 2011/3/25 Ralph Castain <r...@open-mpi.org> >>>>>> >>>>>>> >>>>>>> On Mar 25, 2011, at 10:48 AM, Hugo Meyer wrote: >>>>>>> >>>>>>> From what you've described before, I suspect all you'll need to do is >>>>>>>> add some code in orte/mca/odls/base/odls_base_default_fns.c that (a) >>>>>>>> checks >>>>>>>> to see if a process in the launch message is being relocated (the >>>>>>>> construct_child_list code does that already), and then (b) sends the >>>>>>>> required info to all local child processes so they can take appropriate >>>>>>>> action. >>>>>>>> >>>>>>>> Failure detection, re-launch, etc. have all been taken care of for >>>>>>>> you. >>>>>>>> >>>>>>> >>>>>>> >>>>>>> I looked at the code that you mentioned me and i realize that i have >>>>>>> two possible options, that i'm going to share with you to know your >>>>>>> opinion. >>>>>>> >>>>>>> First of all i will let you know my actual situation with the >>>>>>> implementation. As i'm working in a Fault Tolerant system, but using >>>>>>> uncoordinated checkpoint i'm taking checkpoints of all my process at >>>>>>> different time and storing them on the machine where there are >>>>>>> residing, but >>>>>>> i also send this checkpoints to another node (lets call it protector), >>>>>>> so if >>>>>>> this node fails his process should be restarted in the protector that >>>>>>> have >>>>>>> his checkpoints. >>>>>>> >>>>>>> Right now i'm detecting the failure of a process and i know where >>>>>>> this process should be restarted, and also i have the checkpoint in the >>>>>>> protector. And i also have the child information of course. >>>>>>> >>>>>>> So, my options are: >>>>>>> *First Option* >>>>>>> * >>>>>>> * >>>>>>> I detect the failure, and then i use >>>>>>> orte_errmgr_hnp_base_global_update_state() with some modifications and >>>>>>> the >>>>>>> hnp_relocate but changing the spawning to make a restart from a >>>>>>> checkpoint, >>>>>>> i suposse that using this, the migration of the process to another node >>>>>>> will >>>>>>> be updated and everyone will know it, because is the hnp who is going >>>>>>> to do >>>>>>> this (is this ok?). >>>>>>> >>>>>>> >>>>>>> This is the option I would use. The other one is much, much more >>>>>>> work. In this option, you only have to: >>>>>>> >>>>>>> (a) modify the mapper so you can specify the location of the proc >>>>>>> being restarted. The resilient mapper module will be handling the >>>>>>> restart - >>>>>>> if you look at orte/mca/rmaps/resilient/rmaps_resilient.c, you can see >>>>>>> the >>>>>>> code doing the "replacement" and modify accordingly. >>>>>>> >>>>>>> (b) add any required info about your checkpoint to the launch >>>>>>> message. This gets created in >>>>>>> orte/mca/odls/base/odls_base_default_fns.c, >>>>>>> the "get_add_procs_data" function (at the top of the file). >>>>>>> >>>>>>> (c) modify the launch code to handle your checkpoint, if required - >>>>>>> see the file in (b), the "construct_child" and "launch" functions. >>>>>>> >>>>>>> HTH >>>>>>> Ralph >>>>>>> >>>>>>> >>>>>>> >>>>>>> *Second Option* >>>>>>> * >>>>>>> * >>>>>>> Modify one of the spawn variations(probably the remote_spawn from >>>>>>> rsh) in the PLM framework and then use the orted_comm to command a >>>>>>> remote_spawn in the protector, but i don't know here how to update the >>>>>>> info >>>>>>> so everyone knows about the change or how this is managed. >>>>>>> >>>>>>> I might be very wrong in what I said, my apologies if so. >>>>>>> >>>>>>> Thanks a lot for all the help. >>>>>>> >>>>>>> Best regards. >>>>>>> >>>>>>> Hugo Meyer >>>>>>> >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> devel mailing list >>>>>>> de...@open-mpi.org >>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>>>> >>>>>> >>>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>>> >>>> >>>> >>> >>> >>> >> >
out
Description: Binary data