Re: [caiman-discuss] Checkpoint DOC node proposal

jean.mccormack Wed, 26 May 2010 11:27:50 -0700

On 05/26/10 12:22 PM, Darren Kenny wrote:

On 05/26/10 07:12 PM, jean.mccormack wrote:

OK. So now how as a checkpoint do I get my information? I would like
something similar to what the DOC provided.
Some object that I define that will hold the information? That object
handle would then be passed to me? Does that
idea work with pause/resume? Any other thoughts?

I think that it would still be possible for the registration of the Checkpoint
to provide information into the DOC, to provide arguments as you describe, but
what I'm saying is that the Engine won't be relying on the data in there - in
effect it would most likely clear it after doing a resume, although it would
have to be the one to put the information in the DOC so that checkpoints
themselves can find it.


So we've in effect back to the notion of having a CheckpointData (or
CheckpointArgs) object to contain these arguments - but this doesn't in anyway
imply the order that the Engine would execute them.


I think so. So we have this still (not  meant to be valid xml):

<checkpoint>
        name
        other args specific to the checkpoint.
</checkpoint>

Each checkpoint would have one of these in the DOC and the checkpointsknow which one is theirs

via the name. Not really a change for me.

Jean

Does that make sense?

Darren.

Jean

On 05/26/10 12:08 PM, Darren Kenny wrote:

On 05/26/10 06:34 PM, Karen Tung wrote:

Hi Darren,

Please see my response inline.

On 05/26/10 04:20 AM, Darren Kenny wrote:

Hi Karen,

On 05/25/10 09:04 PM, Karen Tung wrote:

Hi Darren,

Thanks for the detail explanation.  I understand
we are proposing to have all checkpoint information
be stored in the DOC at all times.
This proposal does not work for supporting
resume in the engine.

The proposal calls for the application/engine to create
these checkpoint nodes for storing all the checkpoint
information, and store them in DOC upon registration time.
Then, at execute time, engine will get the list of checkpoints
from the DOC and do it's work.  This will work for the
non-resume case.  However, it will not work for the resume case.

Below is an example of a resume case that does not work.  This assume
the application registers the checkpoints via the engine,
and engine creates the checkpoint nodes and stores the
information there immediately upon registration.  The engine
will not keep a copy of what's already stored in the DOC since
it can just "look it up" when it needs the info.

* First invocation of application:

- application registers checkpoints a, b, c, d, e, these checkpoints are
stored
in the DOC immediately.
- application runs all the checkpoints successfully.  The "completed"
flag in the checkpoint is set.  This implies that
all the checkpoints are resumable.
- Application exits

* 2nd invocation of application:
- application registerscheckpoints a, b, b1, b2, b3, b4, these checkpoints
are stored in the DOC immediately.
- application calls engine.restore(latest-snapshot-from-previous-run).
- At this time, the DOC is restored back to the state when the first
invocation
of the application ends.  Information about all the checkpoints registered
during this invocation of the application is now lost!

I'm fairly sure that I mentioned that the restore needs to be done *first*, and
then the application registers the checkpoints - in this case it would have to
be done as a merge, i.e. the application inserts the new checkpoints where it
wants them to be.

The restore can not be done *first*.  In order to resume,
the application specify which checkpoint it wants to resume
to.  Without first registering the checkpoints, how would the engine
know which checkpoints exists.  Therefore, the checkpoint registration
have to be done before the restore, and the checkpoint information
can not be stored in the DOC, since the rollback might clobber it.

It would *never* work if you loaded a snapshot *after* putting anything in to it
- it's a roll-back to that snapshot which automatically implies that any update
done since that snapshot was taken will be lost.

I totally agree that the rollback will destroy the existing content of
the DOC.  Therefore, the checkpoint information can not be stored
in the DOC after it is registered.  In addition, in my proposed
design of the engine, I specify that before the engine rolls
back the DOC, it makes sure that it is empty.  If not, it will raise
an exception.

Thanks,

--Karen

Based on Sarah's response, the need to use the DOC for storing checkpoints is no
longer present - especially if there is no need to generate a DC manifest.

And as you pointed out, this doesn't seem to be working for checkpoints given
the need to know the checkpoints before it can load a snapshot to resume.

Given that, it would seem to make sense that the Engine would remain as
independent of the DOC as possible for general operation to avoid such issues
with snapshot and roll-back.

Thanks,

Darren.


_______________________________________________
caiman-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/caiman-discuss

Re: [caiman-discuss] Checkpoint DOC node proposal

Reply via email to