Hi Karen.
On 06/15/10 03:58 PM, Karen Tung wrote:
Hi Jack,
Thank you very much on reviewing the document. Please see my response
inline.
On 06/15/10 12:32 PM, Jack Schwartz wrote:
HI Karen.
This document is pretty comprehensive and complete. Here are my
comments, submitted late with your permission:
Section 1:
Readers may find it useful to make the connection that DC can be seen
as an "installer" in the sense that it assembles a target image.
I am not clear on which part of section 1 you are referring to, but I
assume it is the last sentence
of the 1st paragraph in section 1. You think it is more clear to
change "building an image" to
"assumble a target image"?
Actually, you could replace that sentence with "Both DC and the
installer are similar to each other in that they both assemble an image."
Section 2:
In the same vane as for section 1, instead of saying "executing an
installation", does saying "executing an image-build" make more
sense, as that includes DC?
Actually, I think it will be more confusing. I think it is OK to say
"executing an installation" here
because section 1 already makes the point that constructing an image
is similar to
executing an installation.
OK
Section 5.2:
Not sure it adds value to the doc to list the whole class here.
Maybe the method signatures with a brief (e.g. 1-line) description?
Since the whole class is very simple, I figured I will list it.
However, from others comments,
I have made the changes to actually add more description to each of
the functions, so, it is
more clear on what each of the functions are supposed to do.
OK. Method descriptions are fine, even needed. However, I don't think
showing the "raise NotImplementedError" statements add value, since
these methods will be overridden anyway (which is why those statements
are there).
6.3.4:
checkpoint_obj: I concur with other respondants that this would be
better called checkpoint_class or checkpoint_class_name.
Yep, changed to checkpoint_class_name.
args: I think this has to be a list, but the doc doesn't say that
explicitly. Also, if there is only one arg, does it have to be
specified as a one-item list? Is an empty list OK?
It is not a list. If you specify a list, the list will be passed as 1
single argument.
Python allow you to specify as many arguments as you can. The *args
will take care of them.
OK, thanks. Forgot that *args is a pythonism that means an iterable
containing multiple positional args.
checkpoint log level: This is paragraph is confusing to me. Which
two log levels are different from each other? Do you mean the
application wants to use a different log level than specified in this
argument? Isn't it the application that calls register_checkpoint
when it sets up the engine? Why would a keyword arg be needed if the
log level is specified here already? Since each checkpoint is
registered separately, each can already have its own level.
Yes, each checkpoint can have its own log level.
The overall application/engine/checkpoints will have a log level.
If the application wants to run a checkpoint with a different log level,
the application can specify the checkpoint's log level at registration
time.
The log_level argument's description says "if the app wants to use a
different log level for the checkpoint, it can use a keyword argument
log_level. I see now that you mean *this* argument, not one of the
**kwargs. I think this can be clarified by saying "Use this argument if
the app wants to set a different log level for the checkpoint, than was
specified at engine __init__ time."
6.6.1: cleanup_checkpoint(): I would change the name to
cleanup_checkpoints() since it cleans up all checkpoints that have
been executed in the current run, not just one.
Yes, will change.
7.1.1: So to be sure I understand, a checkpoint can be interactive
and can register or change subsequent checkpoints based on input, right?
No, a checkpoint is not supposed to be aware of any other
checkpoints. It is supposed to operate
by itself. For interactive Installers, the application will run one
or more checkpoints, then, pause,
interact with the user, then, continue executing the other
checkpoints. The application is the
one that can register additional checkpoints based on input.
OK. Thanks.
7.2:
- I think the first sentence is trying to say that some limitations
exist because ZFS snapshots are used. Is this correct?
Yes
- In the first bullet, ZFS and data cache snapshots are mentioned.
Is the data cache snapshot also ZFS? If not, isn't it not limited by
ZFS limitations? If it is ZFS, how can the second bullet be true?
Taking data cache snapshots doesn't require ZFS. For out-of-process
resume to work,
data cache snapshots must be stored in a ZFS dataset. So, for all the
data cache snapshots
taken before ZFS dataset is available, they are not stored anywhere.
Therefore,
we can not resume at those checkpoints, because the engine will not
know where to
find the DOC snapshot corresponding to that checkpoint.
OK. I like this explanation. I was getting confused because:
- the first bullet isn't really a limitation of ZFS.
- It also wasn't clear to me that the DOC snapshot would be put into the
ZFS snapshot if a ZFS snapshot was available, because that is covered in
a later section. (You may wish to move the implementation section (7.4)
before the limitations section.)
- The word "snapshot" is a loaded word, and when I see it I think of ZFS
snapshots.
7.3.2.2:
- Termites -> terminates (have you been talking to Sue, lately?)
:-) Obviously, should be terminates.
- Lead sentence talks of finding out which checkpoints are resumable,
but the first bullet talks about registering checkpoints, which is
different. Perhaps for the lead sentence you mean something like this:
"For an application that terminates, a subsequent invocation of the
application might want to resume. That application would have to do
one of the following to establish resumable checkpoints:"
Actually, the 2 bullets are the 2 steps an application must take to
find out which checkpoints
are resumable. So, they don't just do either 1 of the steps, they
must do both.
7.4:
- So the DataObjectCache snapshot will be stored in multiple places?
The third bullet says "... and place a copy of DOC snapshot there" so I
thought there were at least two copies.
It will be in /tmp (or /var/run or wherever) as well as stored as
part of the ZFS dataset? If there is no ZFS dataset, the DOC
snapshot in "/tmp" will be used?
If there's no ZFS dataset, DOC snapshot will be stored in /var/run.
When ZFS dataset is available, DOC snapshots will be be stored in the
ZFS datasets.
When the engine terminates, all DOC snapshots in /var/run will be
removed.
OK.
- Last PP: It says "the engine verifies the DOC is empty before
rolling back to a DOC snapshot." Wouldn't the normal case be that
the DOC isn't empty on resume? (See 7.4.1 #3.) If so, no rollbacks
would ever occur. I'm missing something here...
Why would the DOC be not empty? If an application want to resume,
that request should
happen immediately after all the checkpoints are registered, before
anything is executed.
Registering checkpoints does not put anything in the DOC. So, when
the engine receives
a request, the DOC should be empty.
OK. I was getting the DOC confused with the latest DOC snapshot. Of
course when the engine resumes, before a DOC snapshot is loaded, the DOC
is empty.
So this check is that no DOC rollbacks can occur in the middle of a run.
7.5: resume_execute_checkpoint Description PP: Won't rollback be to
state *before* the named checkpint is completed, rather than *after* ?
Yes, changed.
10.1: I'm not sure a standardized machine is needed nor feasible.
(Eventually that machine would become obsolete and unavailable; then
what?) I suggest creating a program against which checkpoint times
can be standardized. For example, regardless of the machine the test
program runs on, let's say it will take 1000 units of time to run.
On the same machine it will take checkpoint One an amount of X units
of time to run. Then when you run on a faster machine, both test and
checkpoint programs will run proportionately faster. (I know I'm
oversimplifying this and different things (e.g. CPU intensive ops vs
network intensive ops) run faster or slower on different machines,
but this is to get an approximation. If some of all kinds of ops are
built into the test program it will be more normalized to the
different machines.)
Then each checkpoint could return its number of time units to perform
its task, and have a method inside it to return the % done.
So, this program will still need to be executed on a "standard"
machine, right?
No.
Let's say we invent a new unit of time, the "mut" (My Unit of Time :)
). Let's define that the test program takes 1000 muts to run. (This is
regardless of the test system it runs on. The muts will represent
shorter amounts of realtime when run on faster systems.) Suppose that
on your favorite computer, the test program takes 10 seconds to run.
Checkpoints would return their estimated run time in terms of muts.
Suppose you run your new checkpoint on the same system and takes 5
seconds. That translates to 500 muts. The checkpoint would be set up
to return to the engine 500 muts as its estimated run time.
If you were to run the test program and same checkpoint on some other
system, the ratio of durations may vary a little, but will be
approximately the same. By extension, this also means that the relative
durations (or duration ratios) of all checkpoints to each other will be
approximately the same, regardless of the system they are run on. Using
muts weights the times and normalizes them for any computer. This is
what we want for the purposes of calculating progress and time estimates.
Now suppose another brand-spankin-new checkpoint is created half-way
around the world by another person. The same test program is run and
takes 2 seconds to run. On that system, 1000 muts is 2 seconds. The
new checkpoint is run and takes 4 seconds; that checkpoint would be set
up to return 2000 muts. It's mut value would fit well with those of the
other existing checkpoints.
The beauty of this is its portability and maintainability. No need to
worry about a "standard" system going EOL and having to rerun
periodically on a different system. No need to acquire a "standard"
system to get a time estimate.
One could say that using muts sacrifices accuracy, but I don't think
that's true. For example, system A may have a very fast network and
slow CPU and system B may have a mediocre network and fast CPU, so
taking measurements for different checkpoints on different computers
will skew the numbers. However, different system characteristics
between a "standard" test system and the one being installed (where
progress numbers are needed) will also skew duration numbers. So
nothing is compromised by using a test program instead of a "standard"
system. Furthermore, if the test program includes significant amounts
of all operations (e.g. network intensive, CPU intensive, disk
intensive, etc), differences between systems will be accounted for to
some extent.
Then, we run the checkpoint on that same "standard" machine, and we
derive the
unit based on the value we get from program and value we get from
checkpoint?
This still doesn't solve the problem of the "standard" machine
becoming obsolete.
When our "standard" machine becomes obsolete, all the checkpoints can
be re-run on
the new standard machine, and update the get_progress_estimate()
function to return
all the updated values.
General: Tomorrow when I'm back in the office, I'll turn over my
hardcopy which has grammatical corrections, etc, since as my
officemate you are conveniently located :) .
Thanks Jack again for the review.
--Karen
Thanks,
Jack
_______________________________________________
caiman-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/caiman-discuss