[caiman-discuss] Stop/resume feature in new install engine

Karen Tung Tue, 30 Mar 2010 16:47:50 -0700

Hi Dave,

Please see my responses inline.


On 03/30/10 13:57, Dave Miner wrote:
> On 03/30/10 03:52 PM, Karen Tung wrote:
>> Hi Dave,
>>
>> Please see my responses inline.
>>
>> On 03/30/10 07:53, Dave Miner wrote:
>>> On 03/26/10 07:21 PM, Karen Tung wrote:
>>>> I uploaded the install engine feature highlight slides:
>>>> http://hub.opensolaris.org/bin/download/Project+caiman/CUE_docs/install-engine-feature-highlights.pdf
>>>>  
>>>>
>>>>
>>>>
>>>>
>>>> During the discussion of these slides, we discussed how stop/pause and
>>>> resume would work.
>>>> Keith brought up a suggestion on resuming from a snapshot of the data
>>>> cache in /tmp,
>>>> and I initially though it would be OK to add the feature.
>>>> Upon further consideration, I don't think it is a good idea to support
>>>> it. I would
>>>> like to get your opinion on the issue.
>>>>
>>>> In summary, I proposed for stop and resume to work as follows in the
>>>> engine:
>>>>
>>>> - After successful execution of each checkpoint, a snapshot of the 
>>>> data
>>>> cache will be taken.
>>>> - If the install target ZFS dataset is available, the data cache snap
>>>> shot will be stored in the ZFS
>>>> dataset.
>>>> - If the install target ZFS dataset is not available, the data cache
>>>> snapshot will be stored in /tmp.
>>>> - For resumes without terminating the app, resumes are allowed from
>>>> any previously successfully executed checkpoint.
>>>> - For application that terminates and resumes upon restarting, resumes
>>>> are only allowed from
>>>> checkpoints that have data cache snapshot saved in the install target
>>>> ZFS dataset.
>>>>
>>>> See slides 10-17 for more details.
>>>>
>>>> During the discussion, Keith suggested allowing resume to happen even
>>>> if the data cache snapshot is not stored in the ZFS dataset, since the
>>>> data cache snapshot
>>>> stored in /tmp can be used. I thought it would be OK to support that
>>>> also during the meeting.
>>>> However, after further consideration, I thought of a couple of reasons
>>>> for opposing to
>>>> the support.
>>>>
>>>> 1) If we allow a copy of snapshot to be provided to the engine for
>>>> resume, we need to provide an
>>>> interface for the user/app to determine which copy of snapshot file
>>>> belong to which process.
>>>> Arguably, one can guess based on timestamp, and knowledge of the
>>>> engine..etc..
>>>> However, all those are implementation specific, and depending on how
>>>> things evolve,
>>>> they are not "official interfaces" that user/applications can count 
>>>> on.
>>>>
>>>
>>> I'd argue, though, that you have this same problem with respect to the
>>> ZFS snapshots, we've just deliberately ignored it. Or am I missing
>>> something about how you're expecting to tag them to record the
>>> "ownership"?
>>
>> IMO, the snapshot files from the data cache and the ZFS snapshots are
>> implementation details
>> that should not be exposed to the user. The "official interface" for
>> application to resume,
>> as currently designed, is that they will supply the ZFS dataset that is
>> used as the
>> installation target. Each application does "own" that, and it is a well
>> define value.
>> The engine is just taking advantage and storing it's own book keeping
>> information
>> there.
>
> I agree that snapshots and such are implementation details; even in DC 
> right now we do not expose those as an interface, but resumption is 
> specified to the application using a checkpoint identifier.  So I'm 
> still not seeing much of a difference.  It really seems to be about 
> the naming of what you're storing.

Right.  To me, the engine asking the application to provide the name of 
the install target ZFS dataset
is a pretty well defined thing.  On the other hand, having the app 
somehow figure out the name
of the file generated by the engine, and the name is different every 
time, is not so well defined.

Since we want to  provide the interface in the engine to support resume by
providing a snapshot of the data cache, as noted below, I will just 
leave it up
to the app to somehow figure out which snapshot of data cache to use.
>
>>>
>>>> 2) The current design will remove all copies of snapshot in /tmp when
>>>> the application/engine
>>>> terminates. If we want to allow resume from a copy of snapshot in 
>>>> /tmp,
>>>> we won't be
>>>> cleaning up those snapshots, and over time, we clutter up /tmp with a
>>>> lot of snapshot files.
>>>>
>>>
>>> Well, this is likely to be true in the applications as well - DC
>>> leaves snapshots around deliberately, but the installers probably
>>> won't on successful completion. In any event, I think you'll need to
>>> make the cleanup behavior controllable in some way, at the very least
>>> for debug purposes. So, if they're going to be around...
>>
>> Yes, they will be around, but since each application will write to it's
>> own install target,
>> all the snapshots will belong to that application, and there will only
>> be a fixed number
>> of snapshots in 1 install target regardless how many times you run the
>> application.
>>
>> On the other hand, all process that uses the engine will also use /tmp,
>> and to make sure
>> one process does not overwrite the files from another, we will probably
>> be naming the
>> files with the pid or something. So, every time the program is run, new
>> copies of files
>> are created. If we don't clean it up when each process exits, /tmp might
>> get very cluttered.
>
> A nit: /var/run, not /tmp, for privileged processes, which is going to 
> be the primary case here.
Good point, will update the design with this.
>
>>>
>>>> Based on this, I don't think it is a good idea to "officially" support
>>>> resuming from a data
>>>> cache snapshot in /tmp. I can probably leave a backdoor in the code to
>>>> enable it
>>>> somehow if needed.
>>>>
>>>> I would like to hear your thoughts on this.
>>>>
>>>
>>> I think I'd consider a little more closely why this would or would not
>>> be useful to support, as I'm not sure the issues you've raised are all
>>> that unique. For example, I'd think target discovery could be
>>> relatively expensive on more complex storage topologies such that it
>>> may be convenient to have restart capability post-TD.
>>
>> I do agree with you on this point, that's why I was considering
>> providing a backdoor in the code,
>> such as setting some debugging env variable or some such thing which
>> will preserve the snapshot
>> files in /tmp. However, by default, everything in /tmp will be 
>> cleaned up.
>>
>
> I would recommend you make it a formal part of the engine interface 
> and leave it to the applications how it might be used/exposed (by 
> environment variable or whatever).
>
Sounds good.

Thanks for all your comments and suggestion.

--Karen

[caiman-discuss] Stop/resume feature in new install engine

Reply via email to