[caiman-discuss] Stop/resume feature in new install engine

Karen Tung Tue, 30 Mar 2010 14:13:38 -0700

Hi Sarah and Dave,

Please see my comments inline.



On 03/30/10 12:16, Dave Miner wrote:
> On 03/30/10 11:42 AM, Sarah Jelinek wrote:
>>
>>
>> On 03/30/10 08:53 AM, Dave Miner wrote:
>>> On 03/26/10 07:21 PM, Karen Tung wrote:
>>>> I uploaded the install engine feature highlight slides:
>>>> http://hub.opensolaris.org/bin/download/Project+caiman/CUE_docs/install-engine-feature-highlights.pdf
>>>>  
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> During the discussion of these slides, we discussed how stop/pause and
>>>> resume would work.
>>>> Keith brought up a suggestion on resuming from a snapshot of the data
>>>> cache in /tmp,
>>>> and I initially though it would be OK to add the feature.
>>>> Upon further consideration, I don't think it is a good idea to support
>>>> it. I would
>>>> like to get your opinion on the issue.
>>>>
>>>> In summary, I proposed for stop and resume to work as follows in the
>>>> engine:
>>>>
>>>> - After successful execution of each checkpoint, a snapshot of the 
>>>> data
>>>> cache will be taken.
>>>> - If the install target ZFS dataset is available, the data cache snap
>>>> shot will be stored in the ZFS
>>>> dataset.
>>>> - If the install target ZFS dataset is not available, the data cache
>>>> snapshot will be stored in /tmp.
>>>> - For resumes without terminating the app, resumes are allowed from
>>>> any previously successfully executed checkpoint.
>>>> - For application that terminates and resumes upon restarting, resumes
>>>> are only allowed from
>>>> checkpoints that have data cache snapshot saved in the install target
>>>> ZFS dataset.
>>>>
>>>> See slides 10-17 for more details.
>>>>
>>>> During the discussion, Keith suggested allowing resume to happen even
>>>> if the data cache snapshot is not stored in the ZFS dataset, since the
>>>> data cache snapshot
>>>> stored in /tmp can be used. I thought it would be OK to support that
>>>> also during the meeting.
>>>> However, after further consideration, I thought of a couple of reasons
>>>> for opposing to
>>>> the support.
>>>>
>>>> 1) If we allow a copy of snapshot to be provided to the engine for
>>>> resume, we need to provide an
>>>> interface for the user/app to determine which copy of snapshot file
>>>> belong to which process.
>>>> Arguably, one can guess based on timestamp, and knowledge of the
>>>> engine..etc..
>>>> However, all those are implementation specific, and depending on how
>>>> things evolve,
>>>> they are not "official interfaces" that user/applications can count 
>>>> on.
>>>>
>>>
>>> I'd argue, though, that you have this same problem with respect to the
>>> ZFS snapshots, we've just deliberately ignored it. Or am I missing
>>> something about how you're expecting to tag them to record the
>>> "ownership"?
I think I was not clear enough in the previous email, but after seeing this
exchange between you and Sarah,  I see your point.

These snapshots that I am talking about are snapshots of the data cache
that are taken after each checkpoints are run.  Before ZFS is available,
these snapshots need to be stored somewhere.  I choose to
stored them in /tmp since it is an area all applications can write to.
Since /tmp is writable for all application, I will need to create these
data cache snapshots with something unique in it's name, like the pid.
That does not really mean that particular application really "owns" the 
files.
Any subsequent run of the application, even by a different user, can use
those files.  The unique name is needed so multiple processes running at
the same time, storing their copies of the data cache in /tmp don't 
overwrite
each other's data.  They don't really "own" those files.

If these snapshots of the data cache are left there, they can be used by
other processes.  The problem is, how do the applications identify which 
snapshot
belong to with application, and which snapshot file is the last step 
executed for
that application.


>>
>> I think that in today's world with multiple DC runs, we have the
>> snapshot in the users DC build area, so we know that the latest is
>> correct. So, in this case we knew that the users DC process is the one
>> that owns the snapshots.
>>
>
> I don't think we actually do.  DC is always run as root, and it's run 
> off of a particular manifest.  All we know is that the manifest 
> corresponds to the build area by comparison of the manifest supplied 
> against the copy stored in the snapshot.  That appears to tell us 
> nothing about the actual user.
>
>> In thinking about this for other install applications, we could have
>> multiple ZFS snapshots that we could resume from on a system, so I
>> suppose we would have to use the latest in this case?
>>
>
> I think you could easily associate a particular snapshot with the name 
> of the checkpoint.  That's really all the ZFS snapshots do.
Yes, that's what's being planned.  That's why the checkpoint name is 
required to be unique.

Thanks,

--Karen

>
>> For other installers, such as AI, if we stop the AI process and have
>> multiple snapshots of the data cache in /tmp, it isn't likely that there
>> are other users of AI on that system that would have written to /tmp.
>> Karen, I know we had this conversation, but I am wondering if the
>> issues, as Dave points out, are the same for snapshots in /tmp and zfs
>> snapshots?
>>
>
> When using ZFS snapshots, we don't have to figure out which snapshot 
> is latest or corresponds to a particular checkpoint, as the file 
> system took care of that, but otherwise I just don't see a difference.
> Dave

[caiman-discuss] Stop/resume feature in new install engine

Reply via email to