Re: [caiman-discuss] Install Engine Design Document review

Karen Tung Tue, 15 Jun 2010 15:59:26 -0700

Hi Jack,

Thank you very much on reviewing the document. Please see my responseinline.


On 06/15/10 12:32 PM, Jack Schwartz wrote:

HI Karen.
This document is pretty comprehensive and complete. Here are mycomments, submitted late with your permission:
Section 1:
Readers may find it useful to make the connection that DC can be seenas an "installer" in the sense that it assembles a target image.

I am not clear on which part of section 1 you are referring to, but Iassume it is the last sentenceof the 1st paragraph in section 1. You think it is more clear to change"building an image" to

"assumble a target image"?

Section 2:
In the same vane as for section 1, instead of saying "executing aninstallation", does saying "executing an image-build" make more sense,as that includes DC?

Actually, I think it will be more confusing. I think it is OK to say"executing an installation" herebecause section 1 already makes the point that constructing an image issimilar to

executing an installation.

Section 5.2:
Not sure it adds value to the doc to list the whole class here. Maybethe method signatures with a brief (e.g. 1-line) description?

Since the whole class is very simple, I figured I will list it.However, from others comments,I have made the changes to actually add more description to each of thefunctions, so, it is

more clear on what each of the functions are supposed to do.

6.3.4:
checkpoint_obj: I concur with other respondants that this would bebetter called checkpoint_class or checkpoint_class_name.

Yep, changed to checkpoint_class_name.

args: I think this has to be a list, but the doc doesn't say thatexplicitly. Also, if there is only one arg, does it have to bespecified as a one-item list? Is an empty list OK?

It is not a list. If you specify a list, the list will be passed as 1single argument.Python allow you to specify as many arguments as you can. The *argswill take care of them.

checkpoint log level: This is paragraph is confusing to me. Which twolog levels are different from each other? Do you mean the applicationwants to use a different log level than specified in this argument?Isn't it the application that calls register_checkpoint when it setsup the engine? Why would a keyword arg be needed if the log level isspecified here already? Since each checkpoint is registeredseparately, each can already have its own level.

Yes, each checkpoint can have its own log level.
The overall application/engine/checkpoints will have a log level.
If the application wants to run a checkpoint with a different log level,
the application can specify the checkpoint's log level at registration time.

6.6.1: cleanup_checkpoint(): I would change the name tocleanup_checkpoints() since it cleans up all checkpoints that havebeen executed in the current run, not just one.

Yes, will change.

7.1.1: So to be sure I understand, a checkpoint can be interactive andcan register or change subsequent checkpoints based on input, right?

No, a checkpoint is not supposed to be aware of any other checkpoints.It is supposed to operateby itself. For interactive Installers, the application will run one ormore checkpoints, then, pause,interact with the user, then, continue executing the other checkpoints.The application is the

one that can register additional checkpoints based on input.

7.2:
- I think the first sentence is trying to say that some limitationsexist because ZFS snapshots are used. Is this correct?

Yes

- In the first bullet, ZFS and data cache snapshots are mentioned. Isthe data cache snapshot also ZFS? If not, isn't it not limited by ZFSlimitations? If it is ZFS, how can the second bullet be true?

Taking data cache snapshots doesn't require ZFS. For out-of-processresume to work,data cache snapshots must be stored in a ZFS dataset. So, for all thedata cache snapshotstaken before ZFS dataset is available, they are not stored anywhere.Therefore,we can not resume at those checkpoints, because the engine will not knowwhere to

find the DOC snapshot corresponding to that checkpoint.


7.3.2.2:
- Termites -> terminates  (have you been talking to Sue, lately?)

:-)  Obviously, should be terminates.

- Lead sentence talks of finding out which checkpoints are resumable,but the first bullet talks about registering checkpoints, which isdifferent. Perhaps for the lead sentence you mean something like this:"For an application that terminates, a subsequent invocation of theapplication might want to resume. That application would have to doone of the following to establish resumable checkpoints:"

Actually, the 2 bullets are the 2 steps an application must take to findout which checkpointsare resumable. So, they don't just do either 1 of the steps, they mustdo both.

7.4:
- So the DataObjectCache snapshot will be stored in multiple places?It will be in /tmp (or /var/run or wherever) as well as stored as partof the ZFS dataset? If there is no ZFS dataset, the DOC snapshot in"/tmp" will be used?

If there's no ZFS dataset, DOC snapshot will be stored in /var/run.

When ZFS dataset is available, DOC snapshots will be be stored in theZFS datasets.

When the engine terminates, all DOC snapshots in /var/run will be removed.

- Last PP: It says "the engine verifies the DOC is empty beforerolling back to a DOC snapshot." Wouldn't the normal case be that theDOC isn't empty on resume? (See 7.4.1 #3.) If so, no rollbacks wouldever occur. I'm missing something here...

Why would the DOC be not empty? If an application want to resume, thatrequest shouldhappen immediately after all the checkpoints are registered, beforeanything is executed.Registering checkpoints does not put anything in the DOC. So, when theengine receives

a request, the DOC should be empty.

7.5: resume_execute_checkpoint Description PP: Won't rollback be tostate *before* the named checkpint is completed, rather than *after* ?

Yes, changed.

10.1: I'm not sure a standardized machine is needed nor feasible.(Eventually that machine would become obsolete and unavailable; thenwhat?) I suggest creating a program against which checkpoint timescan be standardized. For example, regardless of the machine the testprogram runs on, let's say it will take 1000 units of time to run.On the same machine it will take checkpoint One an amount of X unitsof time to run. Then when you run on a faster machine, both test andcheckpoint programs will run proportionately faster. (I know I'moversimplifying this and different things (e.g. CPU intensive ops vsnetwork intensive ops) run faster or slower on different machines, butthis is to get an approximation. If some of all kinds of ops arebuilt into the test program it will be more normalized to thedifferent machines.)
Then each checkpoint could return its number of time units to performits task, and have a method inside it to return the % done.

So, this program will still need to be executed on a "standard" machine,right?Then, we run the checkpoint on that same "standard" machine, and wederive theunit based on the value we get from program and value we get fromcheckpoint?

This still doesn't solve the problem of the "standard" machine becomingobsolete.

When our "standard" machine becomes obsolete, all the checkpoints can bere-run onthe new standard machine, and update the get_progress_estimate()function to return

all the updated values.

General: Tomorrow when I'm back in the office, I'll turn over myhardcopy which has grammatical corrections, etc, since as myofficemate you are conveniently located :) .


Thanks Jack again for the review.

--Karen

_______________________________________________
caiman-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/caiman-discuss

Re: [caiman-discuss] Install Engine Design Document review

Reply via email to