So the ompi-checkpoint command connects with the Global Coordinator in the SnapC 'full' component. The Global Coordinator lives in the HNP (mpirun/orterun) as determined by the 'full' component. As a result to start a checkpoint ompi-checkpoint must connect to the HNP.

From a user standpoint, they are typically running ompi-checkpoint from the same machine where they started mpirun. So it made the most sense to have these two connect to each other, especially if we ask the user to provide the PID of the mpirun process to checkpoint.

That being said, with the proper changes to 'full' (or with a new SnapC component), ompi-checkpoint could issue the checkpoint request to any process in the MPI job [orterun, orted, application processes] and have the correct things happen.

I have received one request for this functionality, but have not had the time yet to dig into it.

Does that help?

Cheers,
Josh

On Jan 31, 2008, at 9:51 AM, Leonardo Fialho wrote:

Hi all (and Josh),

Why the ompi-checkpoint have to contact the HNP specifically? If I use
another process to start the snapshot coordinator, apparently it´s
works fine, no?

PS: I prefer to send this message to the list... to keep it on the
history for further use...

--
Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http://www.caos.uab.es
Phone: +34-93-581-2888
Fax: +34-93-581-2478

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to