Re: [caiman-discuss] Data Object Cache code review.

Darren Kenny Fri, 23 Jul 2010 09:04:36 -0700

Hi Keith,

Thanks for the detailed review...

On 07/22/10 01:04 AM, Keith Mitchell wrote:
> Hi Darren,
>
> I tried to avoid repeating Drew's comments. In addition to his general
> PEP 8 notes, and my addendum on use scenarios for "is None" (to
> reiterate: use "if <variable>:" or "if <var> is None:" depending on
> context, but never "if <variable> == None:"). Additionally, I started
> this review before your first round of edits - I'll try to remove
> anything that was fixed in the intermediate update, but if I've left
> something in that you've already fixed, or the line numbers are slightly
> off, that's why.

Understood :)

>
> As with my review of the logging service, I'm trying to be extra
> detailed here since this is a CUD component that will be around for,
> ideally, a long while.
>
> General: Naming - Looking at the layout of the vendor-packages files
> added, the file/package names seem rather verbose. I suggest the following:
>
> Rename data_object_cache to simply "cache" or "data_cache".
> Move all of cache/data_object.py into cache/__init__.py. This allows one
> to import the cache using simply:
> "import osol_install.cache"
> which would be preferably to "import
> osol_install.data_object_cache.data_object_cache" which is not only
> redundant, but causes python import problems in some cases -
> specifically, statements along the lines of "import
> data_object_cache.xyz" can confuse the import mechanism (does it refer
> to osol_install.doc.xyz, or osol_install.doc.doc.xyz?).

I agree with you somewhat, it has become a bit redundant. I'm not sure about
the correct naming.

I'm also not sure about putting everything in __init__.py - I've not seen any
precedent to do that with any other modules, but that doesn't rule it out, but
I'm wondering then if I should just put all three files in there, not just
DataObject.py.

First off, if I rename the directory, then one option is:

        osol_install.cache.data_object
        osol_install.cache.data_object_cache <- double-cache :-(
        osol_install.cache.data_object_dict

In reality what I would personally prefer would be:

        from osol_install.data_cache import DataObject, DataObjectCache, \
            DataObjectDict

So I suppose to achieve this I would have to put everything into __init__.py,
I just am not a fan of the "one big file" for all sources - it's harder to
maintain, and gets messy over time.

How would people feel about doing this in general?

>
> Additionally, regardless of if/how the names change, please use absolute
> imports: "import osol_install.data_object_cache.data_object as
> data_object" rather than "import data_object" for example.

I don't see many instances of this - in general I would have statements like:

        from data_object import ...

and here isn't fully qualified - these I've fixed.

>
> usr/src/lib/Makefile:
> Copyright needs updating.
> 52: Please alphabetize.

Done

>
> prototype_com:
> No need to bother updating
>
> data_object.py:
> 36, 41, 46: These comments don't add much value

Removed.

>
> 37-47: It would be of value to define a common base exception class for
> the DOC, and have these subclass that.

Yes, I think that makes sense - so I've created DataObjectException as a
base-class.

>
> 194: Nit: I think this should simply be a method, has_children(), not a
> property. It "feels" more like a computed value rather than an attribute.

Honestly I feel that there is little to be gained by changing this right now,
but if there is consensus then I will make the change - I suppose this is more
of a style thing, and I feel the attribute style is more natural than forcing
the addition of a () to the end.

>
> 207: Having the default for "class_type" be DataObject (or have the
> function set class_type = DataObject if it is passed in as None) would
> simplify some of the logic on line 250.

Doing the definition of the parameters with the default here of
"class_type=DataObject" results in an error!

i.e. the following fails:

    class MyClass(object):

        def test_default(value=MyClass):
            print "Hello"

So I've gone with the setting it later in the code and simplifying the check.

>
> 242: It'd be simpler to just assert that the length of the returned list
> were > 0. However, I'd argue that it's not an error to look for
> something by name or class and not find any - I think returning an empty
> list in those scenarios would be appropriate.

Done the assertion based on length.

I don't mind whether there's an exception or just return an empty-list here,
does anyone else wish to comment?

>
> 243-251: Storing the children in a dictionary (where key = name, and
> value = [list of DataObjects]) rather than a flat list would speed up
> the look-ups by name. (Or, storing a set of references in a dictionary,
> if global ordering is important - however, I wonder how important
> ordering is for anything other than DataObjects grouped by name?)

Ordering is important - specifically for checkpoints, but possibly other's
too, this is why we allow for insert before/after  - but also, currently
there is no requirement that the name be unique, which would rule out the
use of a dictionary.

>
> 259: If this is kept (and for future reference), the trailing '\'
> characters are unnecessary (within a parentheses), and the casts to
> string on line 260 are also unnecessary (the %s action automatically
> calls str() on the object to be formatted). I'd also re-break it more
> like this:
> raise ObjectNotFound("No matching objects found: name = '%s'"
>                       " and class_type = '%s'" %
>                       (name, class_type))

Changed.

Also changed '% \' to not use the ' \'  when not needed.

>
> 264: It might be worth it to wrap a call to get_children(name,
> class_type) and then just return the 1st item from that list (if not,
> simply return child - no need to have a "found_child" variable).
> Additionally, it seems more likely that a call to get_first_child would
> raise an ObjectNotFound exception than a call to get_children. Finally,
> it could be argued that not providing a class or name parameter to this
> function is not particularly valuable (depending on your thoughts on
> whether ordering really matters for anything other than items with the
> same name). If only name is pertinent for ordering, and the data layout
> is reorganized to be a dictionary, then this code becomes nearly
> trivial: 'return self._children[name][0]'

A call to get_children is potentially more expensive to locate an object that
matches criteria - which is why I didn't go that route.

As mentioned before ordering is important, and not just by name.

>
> 306: Again, this is redundant code compared to get_children. In
> particular, get_children could simply be a call to get_descendants(...,
> max_depth=1)

I did as you suggested, and re-used get_descendants, but had to special case
the condition where no parameters are passed, which as fine in get_children,
but not in get_descendants.

>
> 410: LOGGER.exception dumps a stack trace, which may or may not be
> desired. I think LOGGER.error() would be better here (something higher
> up can determine when/where to dump the stack trace - to the log, to the
> console, etc.)

Ah, didn't realise that, changed to use .error()...

>
> 414: Function should check that caller doesn't pass in both before and
> after.

Done + wrote unit test.

>
> 439: Set it to "None" to indicate a "not yet determined value". Negative
> values are valid parameters to insert, so setting to "-1" could cause
> trouble later. (Alternatively, default to 0, and remove lines 467-468)o

Set to 0, and removed the lines.

>
> 442-444: Unnecessary - let the __check_object_type call catch this
> (NoneType is not an instance of DataObject, so __check_object_type will
> fail).

Done.

>
> 447, 456: Not strictly needed - if they're not DataObject sub-classes,
> they won't be children, so the calls at 449/458 will fail.

Agreed, but I think that it's more useful to report an incorrect data type
error here than the just fail on a "not found" error.

>
> 470-481: I'd recommend having separate functions for "insert one" and
> "insert all items in an iterable". (See list.append vs. list.extend). It
> provides a cleaner disjunction and allows for one to create and insert
> "list like" DataObjects.
>

Do you mean that I should have two DataObject.insert_children methods - maybe
insert_child and insert_children?

Do you really think it's likely for there to be a DataObject sub-class that
would also be a list (when I say list here, I mean would match
isinstance(.., list) - not just allow [] type access).

> 543, 569: This seem in conflict. Is it an error to call delete when
> there's nothing to delete, or isn't it? I don't see a problem with
> calling delete and having nothing to delete.

They aren't really in conflict. The first one deletes all children, while the
second deletes children with a specific criteria - this one I felt it was
useful for the consumer to know if anything matched or not, this is especially
useful for a developer using the API, who may have a typo or logic error.

It's the Python equivalent of a boolean_t return in C for many methods...

>
> 586: This only removes "self" from local scope - so it's not really needed.

Removed.

>
> 591, 606: I don't believe we ever reached resolution on why this
> couldn't be implemented via the __deepcopy__ and __copy__ special
> methods. It seems rather unintuitive to explicitly avoid them - my first
> thought when needing to copy/deepcopy a Python object is going to be to
> "import copy; copy.deepcopy(my_object)"
>
> See attached tree_class.py to see a tree-like object that takes
> advantage of __setstate__ and __getstate__ to copy the right portions of
> the tree and fix things as appropriate. (I didn't verify the shallow
> copy specifically, but it shouldn't require much additional work, if any).
>
> In looking at your response to Drew's question in this area, I think
> this resolves 1 & 2. In regards to 3, any subclass implementors can skip
> overriding deepcopy - and if they do override it, like any subclass
> overriding a parent definition, it needs to be aware of whether or not
> it should call super().__deepcopy__() as part of the implementation -
> that's a common problem with any OO subclass. And I don't believe it
> will interfere with the pickling, either.

Yourself and Drew will be glad to know that I finally managed to resolve this
to my satisfaction, and yours!

Thanks for the sample - it helped a lot (with me still relatively new to
Python) - and I have made this work, and enhanced the unit tests to ensure
things too.

Basically, with a minor tweak, your code worked for the deepcopy, but not
the copy - so I had to provide a __copy__ implementation to solve the copy
one.

(The minor tweak was to copy the state in __getstate__ rather than modify the
passed in one, which seemed to modify the original).

Now I still have the utility methods copy and deepcopy - do you want me to
remove these, I feel there are still useful in the API, but maybe others feel
they are not?

>
> 701-702: In general, I'd discourage having heavyweight functions map to
> properties. Attributes and properties should be lightweight and quick to
> access. For example, on line 714 you call self.xml_tree, and assign it
> to xml. Then, 2 lines later, you use self.xml_tree again, instead of
> referencing the xml variable - causing the entire tree to be
> regenerated. Changing this to a function would make it obvious that a
> caller should try to avoid retrieving the value multiple times in a row.

Fixed the duplicate call.

Also changed the API to be get_xml_tree() rather than the property.

>
> 720: Rather than using string concatenation, which is slow, add all
> items to a list, and then use "\n".join(the_list). Although since this
> recurses, the performance gain may be minimal.

I don't really expect this to be used that often - as such wouldn't expect it
to need to be very performant - and since most of the work is done using
recursion the most likely performance issue is likely to be by depth rather
than breadth, which the list wouldn't really add to here.

>
> 729, 738: This use of repr() breaks convention. See:
> http://docs.python.org/reference/datamodel.html#object.__repr__
> http://docs.python.org/reference/datamodel.html#object.__str__
> I don't particularly see the value of defining __repr__ for the DOC.

I agree the __repr__() doesn't provide much here.

I've removed the definition of __repr__() here - but maintaining the __str__()
implementation and people can override __repr__() to change what's output in
the __str__() formatted output, and added to the comment of __str__() to
specify this information.

>
> 741: The Python collections define class hierarchies such that a mutable
> classes inherit from immutable classes. I think we might want the same.
> In this case, that means having "DataObjectNoChildManipulation" be the
> base class, and DataObject be a subclass of that. The bonus there is
> that then it's simply a matter of *not* defining
> insert_children/delete_children for the base class, and defining it for
> the subclass. That would make it more transparent for users of the
> classes as well (it's far easier to determine that a method simply
> doesn't exist for an object than it is to determine that the method
> exists, but will always raise an exception).

Agreed, that makes more sense. I've updated to have a hierarchy that is like:

                DataObjectBase  
                   +-----+
             | \    \
             |  \    + DataObjectDict
             |   \
             |    + DataObjectCache
             |
             |
                DataObject

I didn't like the name NoChildManipulation ;)

The only difference is the ability to add/delete children is missing in
DataObjectBase right now, but who knows, it might change over time that this
isn't all it doesn't have.

Doesn't change the expected use by everyone else as could be seen in the
minimal changes in the unit tests, where all I had to change was the exception
to look for (AttributeError instead of OperationNotSupported, which isn't used
anymore so I removed it too).

>
> data_object_cache.py:
> 128: Seems like it would be simpler to just check "if __instance is None"

Not changing for now, since this will be removed eventually...

>
> 144, 146: Should probably be "single-underscored" private variables. The
> name-mangling is more to "hide" from sub-classes, and, should there ever
> be a subclass of DataObjectCache, I imagine it'd want to access the
> persistent and volatile trees. Ditto for __instance -> We wouldn't want
> a subclass creating two separate name-mangled instances.

Agreed, changed to single underscore.

>
> 148: This makes it obvious as to why the class hierarchy is the way it
> is. By the same token, it also makes it obvious that it's not really
> possible to enforce "NoChildManipulation" as such. This block of code
> could just as easily be "self._children = [self.__persistent_tree,
> self.__volatile_tree]", if following the above suggestion for adjusting
> the class hierarchy.

Don't forget that to maintain the tree I need to set their parent values ;)

Changed...

>
> 177: I'd have expected this to be the "delete()" method.

Nope, the important distinction here is that it is seen as clearing the whole
tree, apart from the DOC's direct children - delete would be different...

>
> 190: This is a good candidate for a property ("self.empty")

I've made it a property, but I prefer the "is_empty" to just empty.

>
> 229: The isinstance check is not needed - all python objects are
> instances of "object".

I didn't realise that everything was an object - in that primitive types like
'int' were object, but it turns out they are... ;)

>
> 238: I think we'd want to close regardless. I'm not sure the advantage
> of leaving this file object open for writing. Closing it prevents the
> caller from trying to write garbage to the end of our pickled DOC.

I thought the same originally, but in the case of buffers (e.g. StringIO) then
doing a close will free the memory in the buffer - which isn't ideal, but a
fact of the implementation.

Hence the reason that I put this in - but I also feel it's more correct to
assume that if it's not a file we open, then we shouldn't close it, and leave
it to the opener of the file to do it. For example in C, if you pass a
file-descriptor to a function you don't expect it to close it...

>
> 241: Nit: I find this method name a touch verbose. "load" seems fine to
> me, but it's a mere suggestion and completely up to you.

I understand what you're saying, but I feel that the "from_snapshot" is needed
to ensure that someone doesn't incorrectly assume that it can be used to load
from XML.

>
> 293: In opposition to Drew's complaints about "cls", PEP8 suggests using
> "cls" as the name for variables representing classes:
>
>> >   - If your public attribute name collides with a reserved keyword, append
>> >          a single trailing underscore to your attribute name.  This is
>> >          preferable to an abbreviation or corrupted spelling.  (However,
>> >          notwithstanding this rule, 'cls' is the preferred spelling for any
>> >          variable or argument which is known to be a class, especially the
>> >          first argument to a class method.)
>> >
> And "requires" it for classmethods:
>
>> >        Always use 'cls' for the first argument to class methods.
>> >
> (Sorry Drew!)
>

Got it, and changed all relevant locations.

>
> 295: "del" doesn't actually "free" anything here. It simply removes that
> one reference to that object from the scope. As an example, consider:
>>>>> > >>> a = object()
>>>>> > >>> b = a
>>>>> > >>> a
>> > <object object at 0x8080480>
>>>>> > >>> b
>> > <object object at 0x8080480>
>>>>> > >>> del a
>>>>> > >>> a
>> > Traceback (most recent call last):
>> >   File "<stdin>", line 1, in <module>
>> > NameError: name 'a' is not defined
>>>>> > >>> b
>> > <object object at 0x8080480>
> To truly free an object, *all* references to it need to be "lost," which
> in this case will happen automatically once the function returns.

Yep - it's historical code I forgot to remove.

>
> 338: Recommend appending to a list and then using "\n".join(the_list)

Done.

>
> data_object_dict.py:
>
> 69-74: The ValueError thrown by etree.Element provides an almost
> identical error string; as such, this function seems unnecessary -
> especially since it seems to be used only to protect against sub-classes.

It is very necessary to protect against sub-classes here - since that's the
only way that you can change the tag and sub-tag correctly. Hence I still feel
this is necessary, even if all it does is call the etree.Element() method.

If you look in the unit tests for this class you will see varied sub-classes,
some with invalid tags :)

>
> 104: There's no need for generate_xml to be a property; a simple
> attribute will suffice. No need to assert that it's a bool - any object
> in Python can be evaluated in a boolean context.

While what you say about any object being allowed to be evaluated in a boolean
context is true, I don't agree that we should just make it an attribute - I do
think that it's worth ensuring that it is a boolean value for a cleaner API.

If this is indeed the preference - to allow anything to be stored - then I'm
happy to change it to that, but I feel this should be part of a 'style' guide
to be consistent throughout CUD.

>
> 174: This "del" call isn't needed.

Done

>
> Tests:
> I'm going to be a touch more lenient in terms of style/implementation
> here. For reference, can you generate a code coverage report and include
> it in your reply to this email? (Your tests seem to run fine via the
> slim_test runner in the cud_dc gate, so you should be able to just run
> that with a --with-cover option to get the results. Optionally, check
> "slim_test -h" for info on how to generate an HTML coverage report, and
> include *that* instead of the text version).

Summary coverage HTML attached :)

>
> test_data_object_cache_children.py:
> 39 1/2, elsewhere in this file and others: To eliminate odd issues
> between tests, you should reset the DOC instance as part of the tearDown
> method, by setting DataObjectCache._DataObjectCache__instance = None

While that is probably true - I'm hoping to remove the singleton code soon, so
I don't see the need to do this right now.

>
> 53-57, elsewhere: Use "self.assertRaises(<Error>, <function>, [arg1 [,
> arg2, ...]])" instead of a try/except block

Didn't notice this assertion mechanism! Changed to use this where possible...
>
> 35: Could simply use the "object" class.

Assuming this is the file data_object_cache_registration.py, then I agree -
removed it.

>
> 80: Typo: Priority should be 100 here, based on test name / failure message.

Yep, fixed.

>
> EOF: Missing test-cases:
> register_class: Assigning multiple classes the same priority level
> find_class_to_handle: Need case to ensure that if two classes can both
> handle something, the one with the higher priority is chosen. (Create 2
> classes whose "can_handle" method always return True to test this in the
> simplest fashion possible)
> find_class_to_handle: Case to ensure that if no classes are found, None
> is returned.

All done :)

>
> test_data_object_cache_snapshots.py:
>
> I'd imagine that only one test will truly require a self.temp_dir/file -
> there should only be a need to explicitly test against the filesystem
> once. For the rest, comparison of StringIO objects should suffice.

There is one that tests a successful write, and one that tests a successful
read - but other than that, nothing else uses the file system.

>
> I'd suggest moving the DOC generation to a separate function that can be
> called explicitly by tests that need a complex DOC (it shouldn't be
> strictly required for more than one or two tests). No need to store the
> self.child_* variables either (and thus, no need to clean them up) since
> none of the tests reference them. This function could be importable by
> any test files that need to generate a reasonable complex cache.

Each of the test files where there is a DOC created are not exactly the same -
some have more duplicates of names, etc. Some done.

And while in THIS file you are correct in that there is no reference to the
self.child* values, in most of the others, there is.

I'll have a think about this, but I'm not convinced that a common file will
work in all cases.

>
> Most or all of the tests here should both take_snapshot() and
> load_from_snapshot(), and compare the DOC (persistent) strings from
> before and after.
>
> 229: There's no guarantee that I haven't created such a directory on my
> machine. This test can probably just be removed - it's not really
> testing DOC as much as it's testing the Python 'open()' built-in.

I think it's still worth testing that we firstly recognise it as a string,
then try use it as a file, which then fails.

To ensure that it always doesn't exist I'm prefixing now with the value of
self.temp_dir which is created using mkdtemp() which should ensure it doesn't
exist.

>
> 241: In essence, this tests the same thing as 234.

True removed one...

>
> test_data_object_cache_xml.py:
> 235: Typo: geneerates -> generates

Fixed.

>
> test_data_object_deletion.py:
> 95: Typo in fail message (should be child_2)

Fixed.

> 114-115: I'd check for assertNotEqual(child, self.child_3) to be
> slightly more precise here.

I do...

>
> I've run out of energy to read through test code, and the tests so far
> look really good, so I'll wait for the coverage report to finish looking.

Me too - take best part of 2 days to get things done, and formulate a
response!

Phew,

Thanks,

Darren.

Title: Coverage report

Module	statements	run	coverage
Total	401	395	99%
Install/cud_dc/proto/root_i386/usr/lib/python2.6/vendor-packages/osol_install/__init__	2	2	100%
Install/cud_dc/proto/root_i386/usr/lib/python2.6/vendor-packages/osol_install/data_object_cache/__init__	2	2	100%
Install/cud_dc/proto/root_i386/usr/lib/python2.6/vendor-packages/osol_install/data_object_cache/data_object	210	208	99%
Install/cud_dc/proto/root_i386/usr/lib/python2.6/vendor-packages/osol_install/data_object_cache/data_object_cache	128	124	97%
Install/cud_dc/proto/root_i386/usr/lib/python2.6/vendor-packages/osol_install/data_object_cache/data_object_dict	59	59	100%

_______________________________________________
caiman-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/caiman-discuss

Re: [caiman-discuss] Data Object Cache code review.

Coverage report: 99%

Reply via email to