[caiman-discuss] Stop/resume feature in new install engine

Karen Tung Fri, 26 Mar 2010 16:21:35 -0700

I uploaded the install engine feature highlight slides:
http://hub.opensolaris.org/bin/download/Project+caiman/CUE_docs/install-engine-feature-highlights.pdf


During the discussion of these slides, we discussed how stop/pause and 
resume would work.
Keith brought up a suggestion on resuming from a snapshot of the data 
cache in /tmp,
and I initially though it would be OK to add the feature.
Upon further consideration, I don't think it is a good idea to support 
it.  I would
like to get your opinion on the issue.

In summary, I proposed for stop and resume to work as follows in the engine:

- After successful execution of each checkpoint, a snapshot of the data 
cache will be taken.
- If the install target ZFS dataset is available, the data cache snap 
shot will be stored in the ZFS
dataset.
- If the install target ZFS dataset is not available, the data cache 
snapshot will be stored in /tmp.
- For resumes without terminating the app, resumes are allowed from
any previously successfully executed checkpoint.
- For application that terminates and resumes upon restarting, resumes 
are only allowed from
checkpoints that have data cache snapshot saved in the install target 
ZFS dataset.

See slides 10-17 for more details.

During the discussion, Keith suggested allowing resume to happen even
if the data cache snapshot is not stored in the ZFS dataset, since the 
data cache snapshot
stored in /tmp can be used.  I thought it would be OK to support that 
also during the meeting.
However, after further consideration, I thought of a couple of reasons 
for opposing to
the support.

1) If we allow a copy of snapshot to be provided to the engine for 
resume, we need to provide an
interface for the user/app to determine which copy of snapshot file 
belong to which process.
Arguably, one can guess based on timestamp, and knowledge of the 
engine..etc..
However, all those are implementation specific, and depending on how 
things evolve,
they are not "official interfaces" that user/applications can count on.

2) The current design will remove all copies of snapshot in /tmp when 
the application/engine
terminates.  If we want to allow resume from a copy of snapshot in /tmp, 
we won't be
cleaning up those snapshots, and over time, we clutter up /tmp with a 
lot of snapshot files.

Based on this, I don't think it is a good idea to "officially" support 
resuming from a data
cache snapshot in /tmp.  I can probably leave a backdoor in the code to 
enable it
somehow if needed.

I would like to hear your thoughts on this.

If you want to discuss or comment on anything else in the highlight 
slides, please feel
free to post them too.

Thanks,

--Karen

[caiman-discuss] Stop/resume feature in new install engine

Reply via email to