Just replied to your other email before seeing this. Take a look at those
comments and let me know if that helps differentiate those interfaces.


On Tue, Feb 18, 2014 at 5:28 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com
> wrote:

> opal_crs.checkpoint() is not used to restart the process, but it does
> return in two different cases:
>
> - in the "continue" case, opal_crs.checkpoint() returns in the original
> process and keeps executing the same process and then, IIRC, invokes
> opal_crs.continue().
>
> - in the "restart" case, opal_crs.checkpoint() returns into a new process
> and then, IIRC, invokes opal_crs.restart().
>
>
> On Feb 18, 2014, at 5:29 AM, Adrian Reber <adr...@lisas.de> wrote:
>
> > I should have read this email before answering the other.
> >
> > So opal_crs.checkpoint() is used to checkpoint the process as well as
> > restart the process? I would have expected opal_crs.restart() is used
> > for restart. I am confused. Looking at CRS/BLCR checkpoint() seems to
> > only checkpoint and restart() seems to only restart. The comment in
> > opal/mca/crs/crs.h says the same as you say.
> >
> >
> > On Mon, Feb 17, 2014 at 03:43:08PM -0600, Josh Hursey wrote:
> >> These values indicate the current state of the checkpointing lifecycle.
> In
> >> particular CONTINUE/RESTART are set by the checkpointer in the CRS (all
> >> others are used by the INC mechanism). In the opal_crs.checkpoint() call
> >> the checkpointer will capture the program state and it is possible to
> >> emerge from this function in one of two scenarios. Either we are
> continuing
> >> execution in the original process (Continue state), or we are resuming
> >> execution from a checkpointed state (Restart state).
> >>
> >> So if the checkpoint was successful, and you are not restarting the
> process
> >> then you want OPAL_CRS_CONTINUE.
> >>
> >> If the process is being restarted from a checkpoint file, then we should
> >> emerge from this function setting the state to OPAL_CRS_RESTART.
> >>
> >> The OPAL_CR_CHECKPOINT state is used in the INC mechanism to notify all
> of
> >> the components to prepare for checkpoint (we probably should have
> called it
> >> OPAL_CR_PREPARE_FOR_CKPT). So not really used by the CRS mechanisms at
> all.
> >> You can see it used in the opal_cr_inc_core_prep() function in
> >> opal/runtime/opal_cr.c
> >>
> >> -- Josh
> >>
> >>
> >>
> >> On Mon, Feb 17, 2014 at 9:28 AM, Adrian Reber <adr...@lisas.de> wrote:
> >>
> >>> This is probably for Josh. What is the meaning of the OPAL_CRS_* enums?
> >>>
> >>> They are probably used to communicate the state of the CRS modules.
> >>> OPAL_CRS_ERROR seems to be used in case an error happened. What is the
> >>> CRS module supposed to set this to if the checkpoint was successful.
> >>>
> >>> OPAL_CRS_CONTINUE or OPAL_CRS_CHECKPOINT?
> >>>
> >>>                Adrian
> >>> _______________________________________________
> >>> devel mailing list
> >>> de...@open-mpi.org
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >>>
> >>
> >>
> >>
> >> --
> >> Joshua Hursey
> >> Assistant Professor of Computer Science
> >> University of Wisconsin-La Crosse
> >> http://cs.uwlax.edu/~jjhursey
> >
> >> _______________________________________________
> >> devel mailing list
> >> de...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>



-- 
Joshua Hursey
Assistant Professor of Computer Science
University of Wisconsin-La Crosse
http://cs.uwlax.edu/~jjhursey

Reply via email to