Re: [caiman-discuss] Data Object Cache code review.

Keith Mitchell Fri, 23 Jul 2010 15:49:42 -0700

Hi Darren,

Responses inline. Will there be a differential and/or v2 webrev available?


On 07/23/10 09:02 AM, Darren Kenny wrote:

Hi Keith,

Thanks for the detailed review...

On 07/22/10 01:04 AM, Keith Mitchell wrote:

Hi Darren,

I tried to avoid repeating Drew's comments. In addition to his general
PEP 8 notes, and my addendum on use scenarios for "is None" (to
reiterate: use "if<variable>:" or "if<var>  is None:" depending on
context, but never "if<variable>  == None:"). Additionally, I started
this review before your first round of edits - I'll try to remove
anything that was fixed in the intermediate update, but if I've left
something in that you've already fixed, or the line numbers are slightly
off, that's why.

Understood :)

As with my review of the logging service, I'm trying to be extra
detailed here since this is a CUD component that will be around for,
ideally, a long while.

General: Naming - Looking at the layout of the vendor-packages files
added, the file/package names seem rather verbose. I suggest the following:

Rename data_object_cache to simply "cache" or "data_cache".
Move all of cache/data_object.py into cache/__init__.py. This allows one
to import the cache using simply:
"import osol_install.cache"
which would be preferably to "import
osol_install.data_object_cache.data_object_cache" which is not only
redundant, but causes python import problems in some cases -
specifically, statements along the lines of "import
data_object_cache.xyz" can confuse the import mechanism (does it refer
to osol_install.doc.xyz, or osol_install.doc.doc.xyz?).

I agree with you somewhat, it has become a bit redundant. I'm not sure about
the correct naming.

I'm also not sure about putting everything in __init__.py - I've not seen any
precedent to do that with any other modules, but that doesn't rule it out, but
I'm wondering then if I should just put all three files in there, not just
DataObject.py.

There's precedent in several modules in the Python standard library. Forexample, see /usr/lib/python2.6/logging/__init__.py (also the json,ctypes, curses modules, to name a few more).

First off, if I rename the directory, then one option is:

        osol_install.cache.data_object
        osol_install.cache.data_object_cache<- double-cache :-(
        osol_install.cache.data_object_dict

In reality what I would personally prefer would be:

        from osol_install.data_cache import DataObject, DataObjectCache, \
            DataObjectDict

So I suppose to achieve this I would have to put everything into __init__.py,
I just am not a fan of the "one big file" for all sources - it's harder to
maintain, and gets messy over time.

How would people feel about doing this in general?


Yeah, I agree that "one big file" is not a terribly clean approach.

Some options to consider:

solaris_install.data_object # For DataObject
solaris_install.data_object.cache
solaris_install.data_object.dict

Or,

solaris_install.cache # DataObject AND DataObjectCache in __init__
solaris_install.cache.dict

solaris_install.cache.some_other_subclass # Other subclasses ofDataObject get separate files

I like the first one better, because the names read really well whencompared to the names of the classes that will be getting imported. Ofcourse, open to other options as well.

Additionally, regardless of if/how the names change, please use absolute
imports: "import osol_install.data_object_cache.data_object as
data_object" rather than "import data_object" for example.

I don't see many instances of this - in general I would have statements like:

        from data_object import ...

and here isn't fully qualified - these I've fixed.


There were only a few. Thanks.

usr/src/lib/Makefile:
Copyright needs updating.
52: Please alphabetize.

Done

prototype_com:
No need to bother updating

data_object.py:
36, 41, 46: These comments don't add much value

Removed.

37-47: It would be of value to define a common base exception class for
the DOC, and have these subclass that.

Yes, I think that makes sense - so I've created DataObjectException as a
base-class.

194: Nit: I think this should simply be a method, has_children(), not a
property. It "feels" more like a computed value rather than an attribute.

Honestly I feel that there is little to be gained by changing this right now,
but if there is consensus then I will make the change - I suppose this is more
of a style thing, and I feel the attribute style is more natural than forcing
the addition of a () to the end.


Yes, I think it's probably fine either way.

207: Having the default for "class_type" be DataObject (or have the
function set class_type = DataObject if it is passed in as None) would
simplify some of the logic on line 250.

Doing the definition of the parameters with the default here of
"class_type=DataObject" results in an error!

i.e. the following fails:

     class MyClass(object):

        def test_default(value=MyClass):
            print "Hello"

So I've gone with the setting it later in the code and simplifying the check.


I figured it might not work directly.

242: It'd be simpler to just assert that the length of the returned list
were>  0. However, I'd argue that it's not an error to look for
something by name or class and not find any - I think returning an empty
list in those scenarios would be appropriate.

Done the assertion based on length.

I don't mind whether there's an exception or just return an empty-list here,
does anyone else wish to comment?

243-251: Storing the children in a dictionary (where key = name, and
value = [list of DataObjects]) rather than a flat list would speed up
the look-ups by name. (Or, storing a set of references in a dictionary,
if global ordering is important - however, I wonder how important
ordering is for anything other than DataObjects grouped by name?)

Ordering is important - specifically for checkpoints, but possibly other's
too, this is why we allow for insert before/after  - but also, currently
there is no requirement that the name be unique, which would rule out the
use of a dictionary.

Ah you're right, didn't quite think that through all the way. Uniquenessdoesn't rule it out (again, the value of the dictionary key-value paircould be a list of all objects stored with that key), but checkpointsare keyed by unique names rather than listed under the same key.


Makes me wish for the OrderedDict data structure in Python2.7.

259: If this is kept (and for future reference), the trailing '\'
characters are unnecessary (within a parentheses), and the casts to
string on line 260 are also unnecessary (the %s action automatically
calls str() on the object to be formatted). I'd also re-break it more
like this:
raise ObjectNotFound("No matching objects found: name = '%s'"
                       " and class_type = '%s'" %
                       (name, class_type))

Changed.

Also changed '% \' to not use the ' \'  when not needed.

264: It might be worth it to wrap a call to get_children(name,
class_type) and then just return the 1st item from that list (if not,
simply return child - no need to have a "found_child" variable).
Additionally, it seems more likely that a call to get_first_child would
raise an ObjectNotFound exception than a call to get_children. Finally,
it could be argued that not providing a class or name parameter to this
function is not particularly valuable (depending on your thoughts on
whether ordering really matters for anything other than items with the
same name). If only name is pertinent for ordering, and the data layout
is reorganized to be a dictionary, then this code becomes nearly
trivial: 'return self._children[name][0]'

A call to get_children is potentially more expensive to locate an object that
matches criteria - which is why I didn't go that route.

As mentioned before ordering is important, and not just by name.

You're correct, the get_children call is far more expensive. I stillwonder if there's a way to combine the logic - would it be worth theeffort to have an optional "count" keyword arg to get_children, thatreturns the first "count" children?

306: Again, this is redundant code compared to get_children. In
particular, get_children could simply be a call to get_descendants(...,
max_depth=1)

I did as you suggested, and re-used get_descendants, but had to special case
the condition where no parameters are passed, which as fine in get_children,
but not in get_descendants.


Seems reasonable.

410: LOGGER.exception dumps a stack trace, which may or may not be
desired. I think LOGGER.error() would be better here (something higher
up can determine when/where to dump the stack trace - to the log, to the
console, etc.)

Ah, didn't realise that, changed to use .error()...

414: Function should check that caller doesn't pass in both before and
after.

Done + wrote unit test.

439: Set it to "None" to indicate a "not yet determined value". Negative
values are valid parameters to insert, so setting to "-1" could cause
trouble later. (Alternatively, default to 0, and remove lines 467-468)o

Set to 0, and removed the lines.

442-444: Unnecessary - let the __check_object_type call catch this
(NoneType is not an instance of DataObject, so __check_object_type will
fail).

Done.

447, 456: Not strictly needed - if they're not DataObject sub-classes,
they won't be children, so the calls at 449/458 will fail.

Agreed, but I think that it's more useful to report an incorrect data type
error here than the just fail on a "not found" error.


Fair enough.

470-481: I'd recommend having separate functions for "insert one" and
"insert all items in an iterable". (See list.append vs. list.extend). It
provides a cleaner disjunction and allows for one to create and insert
"list like" DataObjects.

Do you mean that I should have two DataObject.insert_children methods - maybe
insert_child and insert_children?

Do you really think it's likely for there to be a DataObject sub-class that
would also be a list (when I say list here, I mean would match
isinstance(.., list) - not just allow [] type access).

Consider the reverse - what happens when I pass in a tuple ofDataObjects? Or any iterable that isn't strictly a subclass of list?

543, 569: This seem in conflict. Is it an error to call delete when
there's nothing to delete, or isn't it? I don't see a problem with
calling delete and having nothing to delete.

They aren't really in conflict. The first one deletes all children, while the
second deletes children with a specific criteria - this one I felt it was
useful for the consumer to know if anything matched or not, this is especially
useful for a developer using the API, who may have a typo or logic error.

It's the Python equivalent of a boolean_t return in C for many methods...

The flip side is that now it's a programming error to forget to check tomake sure there are matching items before calling delete.

For example, say you're looking at the code for setting up the partitiontargets when the installer wants to use the whole disk (i.e., blow awayeverything that's already there; either for UI/display purposes orduring actual installation). With this check, I now have to know thecurrent state of the disk - are there already partitions on it before Icall disk.delete_children(type=Partition)?

Essentially, every call to delete_children will require a prior call toget_children to first check that it's safe to perform the action, or beencased in a try/except block.

I'd also expect programmers to be passing in classes and names viavariables/constants (for which a typo is unlikely). Hard-coding thestring key of an object stored in the DOC at scattered points throughoutthe code is a maintenance issue.

With all that said, the more I think about it, the more I feel it couldgo either way. I will add the caveat that the logic will need to bere-worked if it's left as an error, as an ObjectNotFound error won't beraised if there are no children (lines 531-533), which is inconsistent.

586: This only removes "self" from local scope - so it's not really needed.

Removed.

591, 606: I don't believe we ever reached resolution on why this
couldn't be implemented via the __deepcopy__ and __copy__ special
methods. It seems rather unintuitive to explicitly avoid them - my first
thought when needing to copy/deepcopy a Python object is going to be to
"import copy; copy.deepcopy(my_object)"

See attached tree_class.py to see a tree-like object that takes
advantage of __setstate__ and __getstate__ to copy the right portions of
the tree and fix things as appropriate. (I didn't verify the shallow
copy specifically, but it shouldn't require much additional work, if any).

In looking at your response to Drew's question in this area, I think
this resolves 1&  2. In regards to 3, any subclass implementors can skip
overriding deepcopy - and if they do override it, like any subclass
overriding a parent definition, it needs to be aware of whether or not
it should call super().__deepcopy__() as part of the implementation -
that's a common problem with any OO subclass. And I don't believe it
will interfere with the pickling, either.

Yourself and Drew will be glad to know that I finally managed to resolve this
to my satisfaction, and yours!

Thanks for the sample - it helped a lot (with me still relatively new to
Python) - and I have made this work, and enhanced the unit tests to ensure
things too.

Basically, with a minor tweak, your code worked for the deepcopy, but not
the copy - so I had to provide a __copy__ implementation to solve the copy
one.

(The minor tweak was to copy the state in __getstate__ rather than modify the
passed in one, which seemed to modify the original).

Now I still have the utility methods copy and deepcopy - do you want me to
remove these, I feel there are still useful in the API, but maybe others feel
they are not?


I'd prefer they get removed.

701-702: In general, I'd discourage having heavyweight functions map to
properties. Attributes and properties should be lightweight and quick to
access. For example, on line 714 you call self.xml_tree, and assign it
to xml. Then, 2 lines later, you use self.xml_tree again, instead of
referencing the xml variable - causing the entire tree to be
regenerated. Changing this to a function would make it obvious that a
caller should try to avoid retrieving the value multiple times in a row.

Fixed the duplicate call.

Also changed the API to be get_xml_tree() rather than the property.

720: Rather than using string concatenation, which is slow, add all
items to a list, and then use "\n".join(the_list). Although since this
recurses, the performance gain may be minimal.

I don't really expect this to be used that often - as such wouldn't expect it
to need to be very performant - and since most of the work is done using
recursion the most likely performance issue is likely to be by depth rather
than breadth, which the list wouldn't really add to here.


I agree, ignore it.

729, 738: This use of repr() breaks convention. See:
http://docs.python.org/reference/datamodel.html#object.__repr__
http://docs.python.org/reference/datamodel.html#object.__str__
I don't particularly see the value of defining __repr__ for the DOC.

I agree the __repr__() doesn't provide much here.

I've removed the definition of __repr__() here - but maintaining the __str__()
implementation and people can override __repr__() to change what's output in
the __str__() formatted output, and added to the comment of __str__() to
specify this information.

I suppose that'll be fine. It's unconventional, but it's more forinternal / debug use anyway.

741: The Python collections define class hierarchies such that a mutable
classes inherit from immutable classes. I think we might want the same.
In this case, that means having "DataObjectNoChildManipulation" be the
base class, and DataObject be a subclass of that. The bonus there is
that then it's simply a matter of *not* defining
insert_children/delete_children for the base class, and defining it for
the subclass. That would make it more transparent for users of the
classes as well (it's far easier to determine that a method simply
doesn't exist for an object than it is to determine that the method
exists, but will always raise an exception).

Agreed, that makes more sense. I've updated to have a hierarchy that is like:

                DataObjectBase  
                   +-----+
             | \    \
             |  \    + DataObjectDict
             |   \
             |    + DataObjectCache
             |
             |
                DataObject

I didn't like the name NoChildManipulation ;)

The only difference is the ability to add/delete children is missing in
DataObjectBase right now, but who knows, it might change over time that this
isn't all it doesn't have.

Doesn't change the expected use by everyone else as could be seen in the
minimal changes in the unit tests, where all I had to change was the exception
to look for (AttributeError instead of OperationNotSupported, which isn't used
anymore so I removed it too).


Sounds great.

data_object_cache.py:
128: Seems like it would be simpler to just check "if __instance is None"

Not changing for now, since this will be removed eventually...

144, 146: Should probably be "single-underscored" private variables. The
name-mangling is more to "hide" from sub-classes, and, should there ever
be a subclass of DataObjectCache, I imagine it'd want to access the
persistent and volatile trees. Ditto for __instance ->  We wouldn't want
a subclass creating two separate name-mangled instances.

Agreed, changed to single underscore.

148: This makes it obvious as to why the class hierarchy is the way it
is. By the same token, it also makes it obvious that it's not really
possible to enforce "NoChildManipulation" as such. This block of code
could just as easily be "self._children = [self.__persistent_tree,
self.__volatile_tree]", if following the above suggestion for adjusting
the class hierarchy.

Don't forget that to maintain the tree I need to set their parent values ;)

Changed...

177: I'd have expected this to be the "delete()" method.

Nope, the important distinction here is that it is seen as clearing the whole
tree, apart from the DOC's direct children - delete would be different...


Ah, I see.

190: This is a good candidate for a property ("self.empty")

I've made it a property, but I prefer the "is_empty" to just empty.

Ok.

229: The isinstance check is not needed - all python objects are
instances of "object".

I didn't realise that everything was an object - in that primitive types like
'int' were object, but it turns out they are... ;)


Python surprises me constantly, yet all of its surprises make sense...

238: I think we'd want to close regardless. I'm not sure the advantage
of leaving this file object open for writing. Closing it prevents the
caller from trying to write garbage to the end of our pickled DOC.

I thought the same originally, but in the case of buffers (e.g. StringIO) then
doing a close will free the memory in the buffer - which isn't ideal, but a
fact of the implementation.


...except for that surprise.

Hence the reason that I put this in - but I also feel it's more correct to
assume that if it's not a file we open, then we shouldn't close it, and leave
it to the opener of the file to do it. For example in C, if you pass a
file-descriptor to a function you don't expect it to close it...


Good point.

241: Nit: I find this method name a touch verbose. "load" seems fine to
me, but it's a mere suggestion and completely up to you.

I understand what you're saying, but I feel that the "from_snapshot" is needed
to ensure that someone doesn't incorrectly assume that it can be used to load
from XML.

Ok.

293: In opposition to Drew's complaints about "cls", PEP8 suggests using
"cls" as the name for variables representing classes:

   - If your public attribute name collides with a reserved keyword, append
          a single trailing underscore to your attribute name.  This is
          preferable to an abbreviation or corrupted spelling.  (However,
          notwithstanding this rule, 'cls' is the preferred spelling for any
          variable or argument which is known to be a class, especially the
          first argument to a class method.)

And "requires" it for classmethods:

        Always use 'cls' for the first argument to class methods.

(Sorry Drew!)

Got it, and changed all relevant locations.

295: "del" doesn't actually "free" anything here. It simply removes that
one reference to that object from the scope. As an example, consider:

a = object()
b = a
a

<object object at 0x8080480>

<object object at 0x8080480>

del a
a

Traceback (most recent call last):
   File "<stdin>", line 1, in<module>
NameError: name 'a' is not defined

<object object at 0x8080480>

To truly free an object, *all* references to it need to be "lost," which
in this case will happen automatically once the function returns.

Yep - it's historical code I forgot to remove.

338: Recommend appending to a list and then using "\n".join(the_list)

Done.

data_object_dict.py:

69-74: The ValueError thrown by etree.Element provides an almost
identical error string; as such, this function seems unnecessary -
especially since it seems to be used only to protect against sub-classes.

It is very necessary to protect against sub-classes here - since that's the
only way that you can change the tag and sub-tag correctly. Hence I still feel
this is necessary, even if all it does is call the etree.Element() method.

If you look in the unit tests for this class you will see varied sub-classes,
some with invalid tags :)

I'm of the opinion that if a subclass wants to shoot itself in the foot,it should be allowed to.

This method for checking has problems, in that the check is done foreach instance of the class created. That's rather heavyweight forsomething that really doesn't need to happen more than once. Since itneeds to be validated at runtime, and an object needs to be created forit to run the check, I don't see anything wrong with letting thedeveloper discover that when they try to run the to_xml method.

If this does stay in, I'd like to see the validate_tag method besingle_underscored, so that subclasses, if needed, can properly overrideif they need to.

104: There's no need for generate_xml to be a property; a simple
attribute will suffice. No need to assert that it's a bool - any object
in Python can be evaluated in a boolean context.

While what you say about any object being allowed to be evaluated in a boolean
context is true, I don't agree that we should just make it an attribute - I do
think that it's worth ensuring that it is a boolean value for a cleaner API.

If this is indeed the preference - to allow anything to be stored - then I'm
happy to change it to that, but I feel this should be part of a 'style' guide
to be consistent throughout CUD.

It is my preference - if only because I'd rather not have to check tomake sure that a variable I'm passing in is in fact a strict bool (andnot "None", for example, which could easily slip in as a "False" value).

Python is not a strictly typed language; adding overly strict typeenforcement, particularly in the case of bool, will probably cause moreproblems than it solves. Most likely what would end up happening is allconsumers start casting generate_xml to a bool prior to calling:DataObjectDict(the_name, a_dictionary,generate_xml=bool(some_val_that_might_not_be_a_bool))


In short, there's no real gain from forcing it to be a strict bool.

174: This "del" call isn't needed.

Done

Tests:
I'm going to be a touch more lenient in terms of style/implementation
here. For reference, can you generate a code coverage report and include
it in your reply to this email? (Your tests seem to run fine via the
slim_test runner in the cud_dc gate, so you should be able to just run
that with a --with-cover option to get the results. Optionally, check
"slim_test -h" for info on how to generate an HTML coverage report, and
include *that* instead of the text version).

Summary coverage HTML attached :)

The coverage scores look great - though in the future, perhaps theycould just be added to the webrev before uploading. Especially since afew of the files are missing.

test_data_object_cache_children.py:
39 1/2, elsewhere in this file and others: To eliminate odd issues
between tests, you should reset the DOC instance as part of the tearDown
method, by setting DataObjectCache._DataObjectCache__instance = None

While that is probably true - I'm hoping to remove the singleton code soon, so
I don't see the need to do this right now.


Even better.

53-57, elsewhere: Use "self.assertRaises(<Error>,<function>, [arg1 [,
arg2, ...]])" instead of a try/except block

Didn't notice this assertion mechanism! Changed to use this where possible...

35: Could simply use the "object" class.

Assuming this is the file data_object_cache_registration.py, then I agree -
removed it.

80: Typo: Priority should be 100 here, based on test name / failure message.

Yep, fixed.

EOF: Missing test-cases:
register_class: Assigning multiple classes the same priority level
find_class_to_handle: Need case to ensure that if two classes can both
handle something, the one with the higher priority is chosen. (Create 2
classes whose "can_handle" method always return True to test this in the
simplest fashion possible)
find_class_to_handle: Case to ensure that if no classes are found, None
is returned.

All done :)

test_data_object_cache_snapshots.py:

I'd imagine that only one test will truly require a self.temp_dir/file -
there should only be a need to explicitly test against the filesystem
once. For the rest, comparison of StringIO objects should suffice.

There is one that tests a successful write, and one that tests a successful
read - but other than that, nothing else uses the file system.


I think I noticed that and simply forgot to remove this comment, sorry.

I'd suggest moving the DOC generation to a separate function that can be
called explicitly by tests that need a complex DOC (it shouldn't be
strictly required for more than one or two tests). No need to store the
self.child_* variables either (and thus, no need to clean them up) since
none of the tests reference them. This function could be importable by
any test files that need to generate a reasonable complex cache.

Each of the test files where there is a DOC created are not exactly the same -
some have more duplicates of names, etc. Some done.

And while in THIS file you are correct in that there is no reference to the
self.child* values, in most of the others, there is.

I'll have a think about this, but I'm not convinced that a common file will
work in all cases.

If it can be done easily, great, if not, no point in spending too muchtime making unit test code excessively modular.

Most or all of the tests here should both take_snapshot() and
load_from_snapshot(), and compare the DOC (persistent) strings from
before and after.

229: There's no guarantee that I haven't created such a directory on my
machine. This test can probably just be removed - it's not really
testing DOC as much as it's testing the Python 'open()' built-in.

I think it's still worth testing that we firstly recognise it as a string,
then try use it as a file, which then fails.

To ensure that it always doesn't exist I'm prefixing now with the value of
self.temp_dir which is created using mkdtemp() which should ensure it doesn't
exist.


Sounds good.

241: In essence, this tests the same thing as 234.

True removed one...

test_data_object_cache_xml.py:
235: Typo: geneerates ->  generates

Fixed.

test_data_object_deletion.py:
95: Typo in fail message (should be child_2)

Fixed.

114-115: I'd check for assertNotEqual(child, self.child_3) to be
slightly more precise here.

I do...


I wonder what I was thinking, oops.

I've run out of energy to read through test code, and the tests so far
look really good, so I'll wait for the coverage report to finish looking.

Me too - take best part of 2 days to get things done, and formulate a
response!

I may not get around to reviewing the remaining test_* files in detail;but from everything I've seen so far (including the coverage reports), Iexpect they'll be great. Thanks for setting a great example with theunit tests!


- Keith

Phew,

Thanks,

Darren.


_______________________________________________
caiman-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/caiman-discuss

Re: [caiman-discuss] Data Object Cache code review.

Reply via email to