Re: [caiman-discuss] Request for review of Data Object Cache design document (0.5)

Darren Kenny Wed, 26 May 2010 07:19:12 -0700

On 05/25/10 07:59 PM, Keith Mitchell wrote:
>>> 3.3: Can you clarify the reasons for the use of copy() and deepcopy()
>>> instead of the Python standard __copy__() and __deepcopy__()?
>>>
>> The main reason for this is that I wanted to have more control over the copy
>> from DataObject, and simply provide it as part of the object itself rather
>> than expecting people to do other things like using the "copy" module
>> themselves.
>>
>> But one other point is that if you simply call copy.deepcopy() to copy the
>> tree, then you end up copying the WHOLE tree, since deepcopy recurses back up
>> the tree using the parent reference...
>>
>
> copy.copy() uses an object's __copy__() method, and similarly,
> copy.deepcopy() uses the __deepcopy__() method. I'd anticipate that one
> could define those methods in such a way to get the desired behavior,
> unless I misunderstand the problem or the copy module.
>


I believe that you can only get some of the way here - but maybe I need to
look at it a bit closer, since when I last tried I just couldn't get it to
work as I desired.

As I understand it I was able to prevent the copying of parent as a whole, but
not to simply allow the copy of the value, but not recurse up it. But I could
be mistaken.

>> This is messy to avoid since I do want __deepcopy__() to maintain the process
>> of copying the parent reference, but if I exclude it from the copy mechanism,
>> then it won't. As such, the DataObject.deepcopy() method does a copy of 
>> itself
>> and then a copy.deepcopy() of the children to avoid this recursion back up 
>> the
>> tree.
>>
>
> Why would a deepcopy have the same parent? By the definition of
> deepcopy, I'd expect the resultant object to have no relation to the
> original object. I'd actually expect it to have no parent at all, and
> need to be explicitly added back to the DOC (or added as a child to an
> object in the DOC) in order to re-establish a parent.

It doesn't have the same parent - because it first does a copy of itself, this
effectively unset's it's parent reference, and then it does a deep copy of
it's children. This results in a totally disconnected tree of children from
the starting node downwards.

The parent reference that I'm referring to in the quoted text is the parent
references of the children (and their children). When these are copied the are
modified to point to the new copy of the parent class automatically by
copy.deepcopy().

Does that make sense?

>>
>>> Side-note on properties: I meant to bring this up during prototyping,
>>> but I didn't quite get around to it. A simple attribute does not need to
>>> be defined as a property initially, but can simply be an attribute until
>>> a later point in time when the property concept is needed. As such, does
>>> "name" need to be a property?
>>>
>> The main reason that it's a property, is that I want it to be read-only, if I
>> have people accessing it as an attribute then I cannot enforce it being
>> read-only.
>>
>> Aside - I'm really against people directly accessing attributes of an object
>> directly (maybe it's the Java/C++ developer in me - or maybe it's just me
>> being a control-freak... ;) ).
>>
>
> It's a Java/C++ developer thing - break that habit ;)
> Seriously, in Python, if an attribute is read/write, then it should be a
> straight attribute. However, since name is read-only, having it as a
> property makes sense.
>

I'll get there eventually ;)

>>
>>> 3.4.1.2: I'm not sure there's necessarily a need for both an "insert
>>> before" and an "insert after" method. One should be sufficient?
>>> Comparing this to 'simple' lists in Python, for example - most simply
>>> have an "append()" and "insert()" method.
>>>
>> Yes, but in that case you would have to be directly manipulating the children
>> property - knowing that it's a list.
>>
>> But, in my experience it's not good to expose the implementation of an object
>> to the consumers, as such I don't want people relying on the children being a
>> list - hence I like to hide that fact that it is - for support purposes this
>> leaves it open for future changes to the internal implementation without
>> effecting consumers of the API.
>>
>
> Actually, I wasn't implying that literally, simply as a point of
> comparison. I'm simply wondering about a need for *both* an "insert
> before" and an "insert after" method.
>

I think that there is just for usability of the API:

- inserting at the start would require insert_before() to be called on the
  first element.
- insert_after() would be required to be able to say "insert my checkpoint
  directly after TI", of course you could step forward another one, but that
  would require extra effort on the part of the API consumer - to discover
  what the next checkpoint after TI is, to allow you to insert before it.

>>
>>> copying - Can you clarify what you mean by the "parent will be cleared"?
>>> It sounds as if the copy()/deepcopy() methods are more akin to a "move"
>>> operation.
>>>
>> In a way they are - if you take a copy of a node, it should be an independent
>> copy - meaning that it's not tied to the parent, or it's children either (the
>> latter being a deepcopy()).
>>
>> If you don't do this then you would be incorrectly assuming that the parent 
>> of
>> the node is correct, when there is no reciprocal relationship between the
>> parent and the child.
>>
>> I also don't believe that the term move is correct, since the original node 
>> is
>> still in place.
>>
>
> That last sentence is what I was confused about. The wording at first
> made it sound like the original node would be deleted.
>
> This also seems the opposite of what was stated above about a deepcopy
> retaining a reference to the parent.
>

It only maintains the references to the parent in the child nodes, and this
reference is to the new copy, not to the old copy.

>>
>>
>>> 3.4.1.3
>>>
>>> to_xml: What's the reasoning behind not having the object itself make
>>> the recursive call into to_xml()? In particular, the restriction it
>>> creates - "it is possible... for the to_xml() method to generate the
>>> complete tree... In this case the child objects' to_xml() methods must
>>> return None" - seems very unintuitive.
>>>
>> The main reason is to remove the burden of it for the implementors.
>>
>> But as in my response to Dave Miner, yesterday, it does make sense to have a
>> mechanism to prevent further recursion - so that the parent node can make the
>> decision to prevent it if desired - but by default it would descend the tree.
>>
>
> A 'for' loop around its children doesn't seem terribly burdensome; on
> the other hand the additional complexity of supporting two options here
> seems like a hassle to me.
>
> If there's a need to support two paths, splitting up the functions would
> be smoother:
>
> def to_xml():
>      '''Never overridden'''
>      my_xml = self.as_xml()
>      my_xml.append(self.children_as_xml())
>      return my_xml
>
> def as_xml():
>      '''Abstract method; subclasses return an XML object representing
> itself'''
>      ...
>
> def children_as_xml():
>      '''Only overridden by a subclass if it wants to take control of
> adding its children to the XML'''
>      children_xml = SomeXMLObject()
>      for child in self.children:
>          children_xml.append(child.to_xml()) # Note this is a call into
> 'to_xml', not 'as_xml'
>      return children_xml
>
> Splitting it up means that the subclasses have the option for
> controlling, but don't have to worry about anything if they don't choose
> to go that route. This is just one pseudo-codish example, of course - it
> could be designed so that the opposite is the default as well. (I
> realize this pseudo-code is probably not crystal clear; I'd be happy to
> clarify or add detail individually)
>

Sure that's another possible solution, but the one I discussed with Dave Miner
seems simpler since it was only a boolean method, which returns False by
default, and if the consumer wants to handle the children, then it would
simply redefine the method to return True, and then have to_xml() implement
what it wants.

I'm open to either solution...

>>
>>> can_handle: I can imagine that the evaluation process this would require
>>> could get rather lengthy. There's also concern about
>>> complexity/confusion down the line given that can_handle() could be
>>> evaluating more than simply a high level tag. Would restraining it to
>>> simply the top level tag and its attributes could ensure uniqueness? If
>>> so, that could at least simplify things slightly. At minimum, I think
>>> this setup might require some concept of ordering amongst classes, such
>>> that if two classes happen to non-uniquely decide upon ownership of a
>>> certain node, we can still get a consistent result back.
>>>
>> I don't believe that the evaluation *should* be all that expensive - in the
>> majority of cases it's going to be a 1-1 mapping between a tag and a class
>> than can handle it, without a need to look further down the tree.
>>
>> As a possible way of making it simpler for the consumer, we could look at a
>> "matching" mechanism similar to XPath - but I really think that's over-kill
>> here.
>>
>
> Agreed.
>
>> Certainly ordering may be worth considering if people think it's required, 
>> but
>> again, I think it's just as simple to be more specific about what you match -
>> or make the other one less-specific.
>>
>
> For the simple case of the high level tag, maybe a couple attributes,
> it's simple. But if the intent is to ever support anything more complex,
> I think we'll want a way to be able to explicitly determine the
> evaluation order, so that we can ensure consistent results.
>
>> This could be as simple as a number assigned when doing the registering of a
>> class with the DOC.
>>
>
> Yup, exactly.
>

I think we can discuss this as part of a separate discussion on registration
in general.

>>
>>> from_xml:
>>> Nits on the pseudo-code: 4th line in should be "from_xml"
>>>
>>> static vs dynamic: I believe the best solution would be a static
>>> situation, and then simply require that any caller into this function be
>>> sure to register all needed classes prior to calling from_xml(). For
>>> example, ManifestParser could read in an XML, recognize that the schema
>>> is for an AI schema, import all necessary AI related classes, then load
>>> the DOC from the XML DOM. The import of the classes could easily be
>>> set-up to trigger registration with the cache, as was hinted at in this
>>> section.
>>>
>> Heh, you suggest static, Dave Miner suggests dynamic ;)
>>
>> This is part of the reason that I've not yet decided since I think it would
>> warrant it's own discussion...
>>
>
> That was a typo - I meant 'dynamic,' as implied by the context. The
> important piece is that it's up to the caller to ensure that all
> required classes get imported or registered in that dynamic fashion.
>

Gotcha ... I think that we certainly need a separate discussion solely on the
static/dynamic point...

>>
>>> 3.4.2.4:
>>>
>>> dump: Would the __str__ or __repr__ functions provide sufficient
>>> functionality here? Or does the indented parameter imply otherwise?
>>>
>> We're using str for each DataObject, so you're suggesting that simply
>> overriding __str__() to recurse the tree - I think that would be do-able,
>> we had originally avoided that because it again placed the burden on a class
>> that overrides __str__() to ensure it recurses down the tree - which wouldn't
>> be obvious to people to have to do.
>>
>
> I see your point. This places the opposite burden however - a subclass
> of DataObject can't recurse in its __str__ method, or the dump() output
> will have duplicate data. I'd rather see __str__ recurse, as then I can
> be sure that str(<DataObject>), print <DataObject>, and string
> formatting of a data object behave in a reasonable manner. However, I
> wouldn't be opposed to having __str__ print the object, and __repr__
> print the object + children.
>

After I sent this, I had another think about this.

I think it is certainly best to remove the dump() method and simply use
__str__() and __repr__() as you describe above.

>>
>>
>>> 3.4.3.1
>>> 1/2: Since NameValue is simply a single dictionary entry, what is the
>>> value-added of the NameValue class?
>>>
>> Mainly that it's a single object with a name and value - so the XML
>> representation might be different to that desired for something with multiple
>> values. But certainly it could be merged and simply behave differently based
>> on the presence of 1 or more values... I think that would work.
>>
>
> Precisely my thoughts - particularly bearing in mind that the XML tree
> can deviate to some extent from the DOC tree, correct?

Yep

Thanks,

Darren.
_______________________________________________
caiman-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/caiman-discuss

Re: [caiman-discuss] Request for review of Data Object Cache design document (0.5)

Reply via email to