Re: [caiman-discuss] Stop/resume feature in new install engine

Sarah Jelinek Tue, 30 Mar 2010 08:42:55 -0700


On 03/30/10 08:53 AM, Dave Miner wrote:

On 03/26/10 07:21 PM, Karen Tung wrote:

I uploaded the install engine feature highlight slides:
http://hub.opensolaris.org/bin/download/Project+caiman/CUE_docs/install-engine-feature-highlights.pdf

During the discussion of these slides, we discussed how stop/pause and
resume would work.
Keith brought up a suggestion on resuming from a snapshot of the data
cache in /tmp,
and I initially though it would be OK to add the feature.
Upon further consideration, I don't think it is a good idea to support
it. I would
like to get your opinion on the issue.

In summary, I proposed for stop and resume to work as follows in the
engine:

- After successful execution of each checkpoint, a snapshot of the data
cache will be taken.
- If the install target ZFS dataset is available, the data cache snap
shot will be stored in the ZFS
dataset.
- If the install target ZFS dataset is not available, the data cache
snapshot will be stored in /tmp.
- For resumes without terminating the app, resumes are allowed from
any previously successfully executed checkpoint.
- For application that terminates and resumes upon restarting, resumes
are only allowed from
checkpoints that have data cache snapshot saved in the install target
ZFS dataset.

See slides 10-17 for more details.

During the discussion, Keith suggested allowing resume to happen even
if the data cache snapshot is not stored in the ZFS dataset, since the
data cache snapshot
stored in /tmp can be used. I thought it would be OK to support that
also during the meeting.
However, after further consideration, I thought of a couple of reasons
for opposing to
the support.

1) If we allow a copy of snapshot to be provided to the engine for
resume, we need to provide an
interface for the user/app to determine which copy of snapshot file
belong to which process.
Arguably, one can guess based on timestamp, and knowledge of the
engine..etc..
However, all those are implementation specific, and depending on how
things evolve,
they are not "official interfaces" that user/applications can count on.


I'd argue, though, that you have this same problem with respect to the
ZFS snapshots, we've just deliberately ignored it. Or am I missing
something about how you're expecting to tag them to record the "ownership"?

I think that in today's world with multiple DC runs, we have thesnapshot in the users DC build area, so we know that the latest iscorrect. So, in this case we knew that the users DC process is the onethat owns the snapshots.

In thinking about this for other install applications, we could havemultiple ZFS snapshots that we could resume from on a system, so Isuppose we would have to use the latest in this case?

For other installers, such as AI, if we stop the AI process and havemultiple snapshots of the data cache in /tmp, it isn't likely that thereare other users of AI on that system that would have written to /tmp.Karen, I know we had this conversation, but I am wondering if theissues, as Dave points out, are the same for snapshots in /tmp and zfssnapshots?


thanks,
sarah
*****

2) The current design will remove all copies of snapshot in /tmp when
the application/engine
terminates. If we want to allow resume from a copy of snapshot in /tmp,
we won't be
cleaning up those snapshots, and over time, we clutter up /tmp with a
lot of snapshot files.


Well, this is likely to be true in the applications as well - DC leaves
snapshots around deliberately, but the installers probably won't on
successful completion. In any event, I think you'll need to make the
cleanup behavior controllable in some way, at the very least for debug
purposes. So, if they're going to be around...

Based on this, I don't think it is a good idea to "officially" support
resuming from a data
cache snapshot in /tmp. I can probably leave a backdoor in the code to
enable it
somehow if needed.

I would like to hear your thoughts on this.


I think I'd consider a little more closely why this would or would not
be useful to support, as I'm not sure the issues you've raised are all
that unique. For example, I'd think target discovery could be relatively
expensive on more complex storage topologies such that it may be
convenient to have restart capability post-TD.

Dave

_______________________________________________
caiman-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/caiman-discuss

_______________________________________________
caiman-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/caiman-discuss

Re: [caiman-discuss] Stop/resume feature in new install engine

Reply via email to