Re: [caiman-discuss] Checkpoint DOC node proposal

Darren Kenny Wed, 26 May 2010 11:23:30 -0700

On 05/26/10 07:12 PM, jean.mccormack wrote:
> OK. So now how as a checkpoint do I get my information? I would like 
> something similar to what the DOC provided.
> Some object that I define that will hold the information? That object 
> handle would then be passed to me? Does that
> idea work with pause/resume? Any other thoughts?


I think that it would still be possible for the registration of the Checkpoint
to provide information into the DOC, to provide arguments as you describe, but
what I'm saying is that the Engine won't be relying on the data in there - in
effect it would most likely clear it after doing a resume, although it would
have to be the one to put the information in the DOC so that checkpoints
themselves can find it.

So we've in effect back to the notion of having a CheckpointData (or
CheckpointArgs) object to contain these arguments - but this doesn't in anyway
imply the order that the Engine would execute them.

Does that make sense?

Darren.

> 
> Jean
> 
> On 05/26/10 12:08 PM, Darren Kenny wrote:
>> On 05/26/10 06:34 PM, Karen Tung wrote:
>>    
>>> Hi Darren,
>>>
>>> Please see my response inline.
>>>
>>> On 05/26/10 04:20 AM, Darren Kenny wrote:
>>>      
>>>> Hi Karen,
>>>>
>>>> On 05/25/10 09:04 PM, Karen Tung wrote:
>>>>
>>>>        
>>>>> Hi Darren,
>>>>>
>>>>> Thanks for the detail explanation.  I understand
>>>>> we are proposing to have all checkpoint information
>>>>> be stored in the DOC at all times.
>>>>> This proposal does not work for supporting
>>>>> resume in the engine.
>>>>>
>>>>> The proposal calls for the application/engine to create
>>>>> these checkpoint nodes for storing all the checkpoint
>>>>> information, and store them in DOC upon registration time.
>>>>> Then, at execute time, engine will get the list of checkpoints
>>>>> from the DOC and do it's work.  This will work for the
>>>>> non-resume case.  However, it will not work for the resume case.
>>>>>
>>>>> Below is an example of a resume case that does not work.  This assume
>>>>> the application registers the checkpoints via the engine,
>>>>> and engine creates the checkpoint nodes and stores the
>>>>> information there immediately upon registration.  The engine
>>>>> will not keep a copy of what's already stored in the DOC since
>>>>> it can just "look it up" when it needs the info.
>>>>>
>>>>> * First invocation of application:
>>>>>
>>>>> - application registers checkpoints a, b, c, d, e, these checkpoints are
>>>>> stored
>>>>> in the DOC immediately.
>>>>> - application runs all the checkpoints successfully.  The "completed"
>>>>> flag in the checkpoint is set.  This implies that
>>>>> all the checkpoints are resumable.
>>>>> - Application exits
>>>>>
>>>>> * 2nd invocation of application:
>>>>> - application registerscheckpoints a, b, b1, b2, b3, b4, these checkpoints
>>>>> are stored in the DOC immediately.
>>>>> - application calls engine.restore(latest-snapshot-from-previous-run).
>>>>> - At this time, the DOC is restored back to the state when the first
>>>>> invocation
>>>>> of the application ends.  Information about all the checkpoints registered
>>>>> during this invocation of the application is now lost!
>>>>>
>>>>>          
>>>> I'm fairly sure that I mentioned that the restore needs to be done 
>>>> *first*, and
>>>> then the application registers the checkpoints - in this case it would 
>>>> have to
>>>> be done as a merge, i.e. the application inserts the new checkpoints where 
>>>> it
>>>> wants them to be.
>>>>
>>>>        
>>> The restore can not be done *first*.  In order to resume,
>>> the application specify which checkpoint it wants to resume
>>> to.  Without first registering the checkpoints, how would the engine
>>> know which checkpoints exists.  Therefore, the checkpoint registration
>>> have to be done before the restore, and the checkpoint information
>>> can not be stored in the DOC, since the rollback might clobber it.
>>>
>>>      
>>>> It would *never* work if you loaded a snapshot *after* putting anything in 
>>>> to it
>>>> - it's a roll-back to that snapshot which automatically implies that any 
>>>> update
>>>> done since that snapshot was taken will be lost.
>>>>
>>>>
>>>>        
>>> I totally agree that the rollback will destroy the existing content of
>>> the DOC.  Therefore, the checkpoint information can not be stored
>>> in the DOC after it is registered.  In addition, in my proposed
>>> design of the engine, I specify that before the engine rolls
>>> back the DOC, it makes sure that it is empty.  If not, it will raise
>>> an exception.
>>>
>>> Thanks,
>>>
>>> --Karen
>>>
>>>
>>>      
>> Based on Sarah's response, the need to use the DOC for storing checkpoints 
>> is no
>> longer present - especially if there is no need to generate a DC manifest.
>>
>> And as you pointed out, this doesn't seem to be working for checkpoints given
>> the need to know the checkpoints before it can load a snapshot to resume.
>>
>> Given that, it would seem to make sense that the Engine would remain as
>> independent of the DOC as possible for general operation to avoid such issues
>> with snapshot and roll-back.
>>
>> Thanks,
>>
>> Darren.
>>    
> 
_______________________________________________
caiman-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/caiman-discuss

Re: [caiman-discuss] Checkpoint DOC node proposal

Reply via email to