Re: [caiman-discuss] Checkpoint DOC node proposal

Karen Tung Wed, 26 May 2010 10:35:51 -0700

Hi Darren,

Please see my response inline.


On 05/26/10 04:20 AM, Darren Kenny wrote:

Hi Karen,

On 05/25/10 09:04 PM, Karen Tung wrote:

Hi Darren,

Thanks for the detail explanation.  I understand
we are proposing to have all checkpoint information
be stored in the DOC at all times.
This proposal does not work for supporting
resume in the engine.

The proposal calls for the application/engine to create
these checkpoint nodes for storing all the checkpoint
information, and store them in DOC upon registration time.
Then, at execute time, engine will get the list of checkpoints
from the DOC and do it's work.  This will work for the
non-resume case.  However, it will not work for the resume case.

Below is an example of a resume case that does not work.  This assume
the application registers the checkpoints via the engine,
and engine creates the checkpoint nodes and stores the
information there immediately upon registration.  The engine
will not keep a copy of what's already stored in the DOC since
it can just "look it up" when it needs the info.

* First invocation of application:

- application registers checkpoints a, b, c, d, e, these checkpoints are
stored
in the DOC immediately.
- application runs all the checkpoints successfully.  The "completed"
flag in the checkpoint is set.  This implies that
all the checkpoints are resumable.
- Application exits

* 2nd invocation of application:
- application registerscheckpoints a, b, b1, b2, b3, b4, these checkpoints
are stored in the DOC immediately.
- application calls engine.restore(latest-snapshot-from-previous-run).
- At this time, the DOC is restored back to the state when the first
invocation
of the application ends.  Information about all the checkpoints registered
during this invocation of the application is now lost!

I'm fairly sure that I mentioned that the restore needs to be done *first*, and
then the application registers the checkpoints - in this case it would have to
be done as a merge, i.e. the application inserts the new checkpoints where it
wants them to be.

The restore can not be done *first*.  In order to resume,
the application specify which checkpoint it wants to resume
to.  Without first registering the checkpoints, how would the engine
know which checkpoints exists.  Therefore, the checkpoint registration
have to be done before the restore, and the checkpoint information
can not be stored in the DOC, since the rollback might clobber it.

It would *never* work if you loaded a snapshot *after* putting anything in to it
- it's a roll-back to that snapshot which automatically implies that any update
done since that snapshot was taken will be lost.

I totally agree that the rollback will destroy the existing content of
the DOC.  Therefore, the checkpoint information can not be stored
in the DOC after it is registered.  In addition, in my proposed
design of the engine, I specify that before the engine rolls
back the DOC, it makes sure that it is empty.  If not, it will raise
an exception.

Thanks,

--Karen

Therefore, I have always want to keep all information about registered
checkpoints for each particular invocation of the application as *PRIVATE*
data in the engine.  This data will not be stored in the DOC unless the
engine choose to.  All the data the engine eventually decides to store
in DOC will only be used for computing what's resumable in the next
invocation of the application.  Therefore, only data regarding
successfully executed checkpoints will be stored.

I don't agree, and see no reason for it to be separate to the DOC.

Am I missing something?

Thanks,


Darren.


_______________________________________________
caiman-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/caiman-discuss

Re: [caiman-discuss] Checkpoint DOC node proposal

Reply via email to