Re: [caiman-discuss] Request for review of Data Object Cache design document (0.5)

Darren Kenny Thu, 27 May 2010 03:46:32 -0700

Hi Ethan,

Thanks for the feedback, comments inline below...

On 05/26/10 07:01 PM, Ethan Quach wrote:
> Hi Darren,
>
> Sorry for being a little late on these questions/comments.  If I've got
> dups, just say so and I'll go looking at past threads.
>
> thanks,
> -ethan
>
>
> page 5 - 2.2 - Are there any assumptions on applications' reliance on the
> order of interleaved elements as they appear in their xml instance files?
> Or, to the point, it seems that order isn't tracked once imported into
> the DataObjectCache.  I think this is actually the correct thing to do and
> keeps the design much simpler, but I think it should be noted here.
> Reason is that currently, AI does take order of appearance to have
> semantic meaning to it, but that just needs to be fixed in AI.

I'm not sure what you mean when you refer to interleaved elements here, but we
do actually have some order to things in the DOC - specifically items will
remain in the order that they were inserted, by default,  but there is also
the ability to put an object in before or after another in the list of
children.

Similarly, traversal of the tree is usually in order of parent so a simply
tree might result in items being traversed similar to the following:

                                  1
                                /    \
                             2         6
                           / | \      /  \
                          3  4  5   7     11
                                  / | \
                                 8  9  10

BUT, what we do not have is the ability to provide an ordering/sort function
to apply an order automatically to the insertion of an element - it is purely
based on position insertion, like a linked list.

Does that make sense?

> page 5 2.2 - Should  high-level -> hierarchical ?

No, the intention here was to mean that we would only look at high-level
elements in an XML tree - meaning maybe to a depth of 2 or 3 - after that it's
assumed that objects will handle the generation of the XML correctly.

It's actually looking like we might move towards the use of XSLTs to handle
this rather than coding it directly in python, which should provide more
flexibility when it comes to arranging things in a generated XML file.

>
> page 7 - 3.2 - Can you expound on what this statement is saying:
>
>          "For example, difficulties can arise when cache data contains
> objects
>           created from dynamically executed code."

I responded to a similar question like this in my response to Jack, and it
essentially said:

    During the prototype, where we had code in the Engine where a checkpoint
    was stored as :

    - path (of python file)
    - class
    - arguments

    Later then we dynamically instantiated that object, and held a reference
    to that new object.

    When we tried to pickle, and then unpickle it, we ran in to issues since
    the "class name" was different (e.g. MyClass, as opposed to the full name
    like mymodule.MyClass) and as such the un-pickling failed since it
    couldn't create such an object.

>
> page 9 - 3.4.1.1 - "name".  Does the statement "... used to identify a
> specific object" mean a specific instance of an object, or specific class
> of an object?  In otherwords, will "name" be the same for all objects
> of the same type, or are they unique for each instance of the object?
> If the latter, then could you rephrase to say "... a specific instance of an
> object."

The name is just a property of an object's instance, it may have some specific
use (e.g. a disk would probably have the name like "c0t0d0") but it will
generally be context based, in other words the name for a disk would be
different to the name of a finalizer.

You can locate objects in the DOC based on their name and/or class type if you
wish, so it's in the consumer's best interest to use something sensible.

Also, it's very likely that the name will be used in the XML generated for a
manifest (as in the disk example above), but it's up to the implementor of an
object to decide that.

>
> page 9 - 3.4.1.1 - "children".  I assume here that the children property
> represents only direct children, and not all hierarchical descendants?

Yes, direct children, if you what to do deeper you need to traverse the tree,
by looking at the children's children, and so on...

>
> page 9 - 3.4.1.2 - I guess this goes back to my first comment.  Is the
> capability offered by this method here an attempt to allow applications
> to dictate order?

It does provide for some basic ordering, yes, but whether this order is
important will really be up to the users of the list of children - some will
care, and others may not.

>
> page 10 - 3.4.1.2 - delete_children() - Just a clarification, but deleting
> a child would in turn delete all of its children as well correct?  If not,
> what happens to its children?

That is correct - if you delete a child node, you're effectively breaking the
link between any children of it and it's parent node.

If you wish to keep a reference to such a tree you should fetch it from the
DOC first, before deleting it, e.g.:

    my_ref = doc.get_child( name="MyObj" )
    doc.delete_child( my_ref )
    # my_ref is still valid here, removing really only decrements the
    # reference count of an object.

>
> page 10 - 3.4.1.2 - copy() - Perhaps its just my apprehensions from living
> a non-garbage collected pointerful world the past few years, but the
> behavior of this method seems like it could get hairy for both the consumer
> and maintainer of this code.  If its main use is for deepcopy(), have you
> considered not exposing it?  Or perhaps change the public exposure of it
> to just copying the object at hand and explicitly not carry over children
> references?

You are correct, copying an object shouldn't copy it's children or parent
pointers - if you want to copy the children you should use deepcopy.

I'll change that.

>
> page 10 - 3.4.1.2 - get_first_child() - Why is this method needed?  Couldn't
> one just do get_children()[0] or some such to achieve this?

If you're just looking for the first child of all the children, then yes.

But if it's the first child with a specific criteria (i.e. using name or type
to fetch) then it would be faster than calling get_children()[0] since it
would stop searching after finding the first one, when get_children() will not.

>
> page 11 - 3.4.1.3 - to_xml() - What is the purpose or envisioned use case
> for an object's to_xml() method to do the full true underneath it?
> I'm assuming here that the recursion method to write out the cache
> to an xml instance isn't the to_xml() method itself.  If it is, then there
> shouldn't be an option for to_xml() not to traverse its children.

In the current design the traversal isn't part of the to_xml(), but in a
response to Dave Miner, we discussed having the ability to stop the traversal
to allow an object to generate the full tree, if desired.

Kieth also provided another alternative to this, which you might want to also
look at.

>
> Also, upon transforming to XML, is there any validation done on the
> resultant instance document, if so when or where does that happen.

This is certainly something we would like to have - and it would make sense to
apply the DTD/Schema to it after generation to ensure it's valid.

>
> page 11 - 3.4.1.3 - can_handle() - "If a class doesn't generate XML, it
> should simply return False from this method."   Do we not expect that
> we'd ever import Data from XML that doesn't ever go to_xml()?  I'm just
> curious as to whether this decision is made simply by choice or because
> of some technical limitation.

While what you say is true multiple objects may handle the same tag type, and
as such we need to be sure we're using the correct one, for example:

    <transfer name="ips1" type="ips" ..../>
    <transfer name="cpio1" type="cpio" ..../>

In this case the can_handle would look at the tag, and the type and decided
whether it can generate an object instance from it - so a TransferIPS class
would pick up the type="ips" and the TransferCPIO class would pick up the
other.

But also, this depends on the XML Manifest's schema, since it could have
different tags - "transfer_ips", and "transfer_cpio".

The use of a can_handle() method allows for more flexibility in this, and
doesn't result in limiting how the XML Manifest Schema is designed.

>
> "... the class implementation should look at Element passed, and by
> looking at the tag, attributes, adn possibly child nodes, it should decide
> whether it is able to handle the given element."   I suspect that this check
> could possibly give false positives in certain scenarios if a view of the
> parent of the given Element isn't made available as part of the check.
> If you disagree I can try to cons up an example to demonstrate.

I don't disagree - the Element object passed in, has a reference to it's
parent, if you desire that information. I'll update the text to mention that
too:

  http://codespeak.net/lxml/api/lxml.etree._Element-class.html#getparent

>
> page 12 - So what's blocking the choice of going 'Dynamically' here?
> I don't see any disadvantages listed, and I can only think that it would
> be advantageous to do it dynamically.

I agree - it's more about how dynamic it should be really - I'm going to write
something up about this soon.

>
> General comment: It would seem that an application is free to create a
> tree of DataObjects that may not conform to what their elemental
> representations are in xml form, and if so, are there any preventative
> measures?  Are applications just suppose to manage their own view of
> what the DataObjectCache tree currently looks like?  I sense that this
> might be difficult to code toward.  Also, the statement on page 5 - 2.2
> says "It is the job of the DOC to keep track of the high-level XML
> structure,
> so that a properly structured XML manifest can be produced from the
> installation data."  But how is the DOC tracking this?

Yes, and application is free to create a tree of DataObjects as it desires,
but the constraints will primarily come from the design of the Engine and
Checkpoints - being the main consumers.

Each Checkpoint will define it's input and output parameters (a contract for
want of a better word), which will all be passed via the DOC - so if they
aren't where these Checkpoints want them, things will fail.

It is expected that most Checkpoints, where it's possible to have data in XML
format, will remain fairly close to this structure in terms of DataObjects
too - it's the most obvious thing to do and simplest from an XML import/output
point of view.

As for keeping track of the high-level structure, I think I answered that
earlier in the e-mail.

>
> page 14 - 3.4.2.2 - load_snapshot() - Just for my clarification, this method
> throws away any current tree data in the DataObjectCache being loaded
> to, correct?  Or perhaps, are there limitations as to when this method
> can be called?

Yes, it totally replaces the existing contents of the DOC - thus performing a
roll-back to a specific point in time.

We don't currently have any limitations as to when it is called, but we would
expect that the Application would be the main consumer of it, and the Engine
the main consumer of the snapshot() method, calling it between Checkpoints.

>
> page 14 - 3.4.2.3 - import_from_manifest_xml() - Same question.

This method is primarily intended to be used by the ManifestParser checkpoint
to populate the DOC from it - one main difference here is that its not a
snapshot/roll-back case - so it may not overwrite anything, but be more likely
to do a merge.

>
> page 15 - 3.4.2.4 - Did you mean DataObjectCache here?

Yep

>
> page 15 - 3.4.3.1 - DataObjectDict Class - how does the schema definition
> for this look??  I would think the 'keynameX' shouldn't be element names
> here, but instead something like:
>
> <ddata name="DICTIONARY NAME">
> <data key="keyname1">value1</data>
> <data key="keyname2">value2</data>
>          ....

At the moment, from the prototype code, it looks like:

  <data name="dictionary1">
    <a>1234</a>
    <pi>3.14159265</pi>
    <bee>-500</bee>
  </data>

But I think that you are correct, and it may be better to use <data ...> as a
structure, or at least allow for people to influence it one way or the other.

Thanks,

Darren.
_______________________________________________
caiman-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/caiman-discuss

Re: [caiman-discuss] Request for review of Data Object Cache design document (0.5)

Reply via email to