Re: [OMPI devel] Restarting processes on different node

Leonardo Fialho Thu, 23 Oct 2008 08:05:19 -0400

Thanks Paul,

It´s working fine with PRELINK=NO.


Leonardo

Paul H. Hargrove escribió:

Leonardo,
As you say, there is the possiblity that moving from one node toanother has caused problems due to different shared libraries. Theresult from this could be a segmentation fault, an illegal instructionor even a bus error. In all three cases, however, this failuregenerates a signal (SIGSEGV, SIGILL or SIGBUG). So, it is possiblethat you are seeing the failure mode that you were expecting.There are at least 2 ways you can deal with heterogenous libaries.The first is that if the libs are only different due to preloading,you can undo the preloading as described in the BLCR FAQ(http://mantis.lbl.gov/blcr/doc/html/FAQ.html#prelink)The second would be to include the shared libaries in the checpointitself. While this is very costly in terms of storage, you may findit lets you restart in cases where you might not otherwise be ableto. The trick is to add --save-private or --save-all to thecheckpoint command that OpenMPI uses to checkpoint the applicationprocesses.
-Paul

Leonardo Fialho wrote:
Hi All,
I´m trying to implement my FT architecture in Open MPI. Just now Ineed to restart a faulty process from a checkpoint. I saw that Joshuses orte-restart which call opal-restart through an ordinary mpiruncall. It´s now good for me because in this case the restarted processbecomes in a new job. I need to restart the process checkpoint in thesame job and in another node under an existing orted. The checkpointsare taken without the "--term" option.
My modified orted receive a "restart request" from my modifiedheartbeat mechanism. I have tried to restart using the BLCRcr_restart command. It does not work, I think because thestderr/stdin/stdout was not handled by the opal environment. So, Itried to restart the checkpoint forking the orted and doing an execvpto the opal-restart. It recovers the checkpoint, but after the"opal_cr_init", it dies (*** Process received signal ***).
As follows is the job structure (from ompi-ps) after a fault:
Process Name | ORTE Name | Local Rank | PID | Node | State| HB Dest. |-------------------------------------------------------------------------------------orterun | [[8002,0],0] | 65535 | 30434 | aoclsb | Running| |orted | [[8002,0],1] | 65535 | 30435 | nodo1 | Running |[[8002,0],3] |orted | [[8002,0],2] | 65535 | 30438 | nodo2 | Faulty |[[8002,0],3] |orted | [[8002,0],3] | 65535 | 30441 | nodo3 | Running |[[8002,0],4] |orted | [[8002,0],4] | 65535 | 30444 | nodo4 | Running |[[8002,0],1] |
Process Name | ORTE Name | Local Rank | PID | Node | State| Ckpt State | Ckpt Loc | Protector |------------------------------------------------------------------------------------------------------------------./ping/wait | [[8002,1],0] | 0 | 9069 | nodo1 | Running| Finished | /tmp/radic/0 | [[8002,0],2] |./ping/wait | [[8002,1],1] | 0 | 6086 | nodo2 | Restoring| Finished | /tmp/radic/1 | [[8002,0],3] |./ping/wait | [[8002,1],2] | 0 | 5864 | nodo3 | Running| Finished | /tmp/radic/2 | [[8002,0],4] |./ping/wait | [[8002,1],3] | 0 | 7405 | nodo4 | Running| Finished | /tmp/radic/3 | [[8002,0],1] |
The orted running on "nodo2" dies. It was detected by the orted[[8002,0],1] running on "nodo1" and informed to the HNP. The HNPupdate the procs structure and look for processes running on thefaulty node, so it sends a restart request for the orted which holdsthe checkpoint of the faulty processes.
Below is the log generated:
[aoclsb:30434] [[8002,0],0] orted_recv: update state request from[[8002,0],3][aoclsb:30434] [[8002,0],0] orted_update_state: updating state (17)for orted process (vpid=2)[aoclsb:30434] [[8002,0],0] orted_update_state: found process[[8002,1],1] on node nodo2, requesting recovery task for that[aoclsb:30434] [[8002,0],0] orted_update_state: sending restore([[8002,1],1] process) request to [[8002,0],3][nodo3:05841] [[8002,0],3] orted_recv: restore checkpoint requestfrom [[8002,0],0][nodo3:05841] [[8002,0],3] orted_restore_checkpoint: restartingprocess from checkpoint file (/tmp/radic/1/ompi_blcr_context.6086)[nodo3:05841] [[8002,0],3] orted_restore_checkpoint: executingrestart (opal-restart -mca crs_base_snapshot_dir /tmp/radic/1 .)
[nodo3:05924] opal_cr: init: Verbose Level: 1024
[nodo3:05924] opal_cr: init: FT Enabled: 1
[nodo3:05924] opal_cr: init: Is a tool program: 1
[nodo3:05924] opal_cr: init: Checkpoint Signal: 10
[nodo3:05924] opal_cr: init: Debug SIGPIPE: 0 (False)
[nodo3:05924] opal_cr: init: Temp Directory: /tmp
[nodo2:05965] *** Process received signal ***
The orted which receives the restart request forks and the call anexecvp for the opal-restart, and then, unfortunately, it dies. I knowthat the restarted process should generate errors because the URI ofit daemon is incorrect like all other enviroment variables, but itwould generate a communication error, or any kind of error other thana process kill. My question is:
1) Why this process dies? I suspect that the checkpoint have pointerswhich points to libraries which are not loaded, or are loaded ondifferent memory position (because this checkpoint becomes fromanother node). In this case the error should be "segmentation fault"or something like this, no?
If somebody have some information or can give me some help about thiserror I´ll be grateful.
Thanks--

Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http://www.caos.uab.es
Phone: +34-93-581-2888
Fax: +34-93-581-2478

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http://www.caos.uab.es
Phone: +34-93-581-2888
Fax: +34-93-581-2478

Re: [OMPI devel] Restarting processes on different node

Reply via email to