opal_crs.checkpoint() is not used to restart the process, but it does return in 
two different cases:

- in the "continue" case, opal_crs.checkpoint() returns in the original process 
and keeps executing the same process and then, IIRC, invokes 
opal_crs.continue().

- in the "restart" case, opal_crs.checkpoint() returns into a new process and 
then, IIRC, invokes opal_crs.restart().


On Feb 18, 2014, at 5:29 AM, Adrian Reber <adr...@lisas.de> wrote:

> I should have read this email before answering the other.
> 
> So opal_crs.checkpoint() is used to checkpoint the process as well as
> restart the process? I would have expected opal_crs.restart() is used
> for restart. I am confused. Looking at CRS/BLCR checkpoint() seems to
> only checkpoint and restart() seems to only restart. The comment in
> opal/mca/crs/crs.h says the same as you say.
> 
> 
> On Mon, Feb 17, 2014 at 03:43:08PM -0600, Josh Hursey wrote:
>> These values indicate the current state of the checkpointing lifecycle. In
>> particular CONTINUE/RESTART are set by the checkpointer in the CRS (all
>> others are used by the INC mechanism). In the opal_crs.checkpoint() call
>> the checkpointer will capture the program state and it is possible to
>> emerge from this function in one of two scenarios. Either we are continuing
>> execution in the original process (Continue state), or we are resuming
>> execution from a checkpointed state (Restart state).
>> 
>> So if the checkpoint was successful, and you are not restarting the process
>> then you want OPAL_CRS_CONTINUE.
>> 
>> If the process is being restarted from a checkpoint file, then we should
>> emerge from this function setting the state to OPAL_CRS_RESTART.
>> 
>> The OPAL_CR_CHECKPOINT state is used in the INC mechanism to notify all of
>> the components to prepare for checkpoint (we probably should have called it
>> OPAL_CR_PREPARE_FOR_CKPT). So not really used by the CRS mechanisms at all.
>> You can see it used in the opal_cr_inc_core_prep() function in
>> opal/runtime/opal_cr.c
>> 
>> -- Josh
>> 
>> 
>> 
>> On Mon, Feb 17, 2014 at 9:28 AM, Adrian Reber <adr...@lisas.de> wrote:
>> 
>>> This is probably for Josh. What is the meaning of the OPAL_CRS_* enums?
>>> 
>>> They are probably used to communicate the state of the CRS modules.
>>> OPAL_CRS_ERROR seems to be used in case an error happened. What is the
>>> CRS module supposed to set this to if the checkpoint was successful.
>>> 
>>> OPAL_CRS_CONTINUE or OPAL_CRS_CHECKPOINT?
>>> 
>>>                Adrian
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>> 
>> 
>> 
>> -- 
>> Joshua Hursey
>> Assistant Professor of Computer Science
>> University of Wisconsin-La Crosse
>> http://cs.uwlax.edu/~jjhursey
> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to