Re: [caiman-discuss] Request for review of Data Object Cache design document (0.5)

Darren Kenny Tue, 25 May 2010 06:37:55 -0700

Hi Dave, more below ...

On 05/24/10 09:40 PM, Dave Miner wrote:
> On 05/24/10 10:40 AM, Darren Kenny wrote:
>> On 05/21/10 07:54 PM, Dave Miner wrote:
>>> On 05/19/10 10:34 AM, Darren Kenny wrote:
>>>> Hi,
>>>>
>>>> We would like to ask people to please review the Data Object Cache design
>>>> document, which can be found at:
>>>>
>>>>
>>
http://hub.opensolaris.org/bin/download/Project+caiman/DataObjectCache/DataObjectCache%2DDesign%2D0.5.pdf
>>>>
>>>
>>> Overall, a very well-written document.  But, of course, I have comments :-)
>>
>> Thanks, and I'm not surprised there are comments...
>>>
>>> 2.1, third bullet: s/system/application/
>>
>> Sure, I just used the term system as referring to the complete mechanisms, 
>> but
>> Application probably makes more sense.
>>
>>>
>>> 2.2, last bullet: Would it more correct to say that it's the job of the
>>> application to define the object hierarchy, and hence the data tree, and
>>> that DOC is really just providing the infrastructure for the application
>>> to do that?  As written the DOC seems to be perhaps more omniscient than
>>> you mean for it.
>>
>> Well, not really, mainly since it was thought that the DOC could make some 
>> use
>> of the DTD / Schema (at a high level at least) to correctly structure the XML
>> generated from the DOC. At least this is something we discussed with Sarah.
>>
>> The Application doesn't really know as much about the XML any more, but
>> instead this is being deferred to each object in the DOC to know about their
>> own area - e.g. Targets know how to map Targets from/to XML, etc.
>>
>> By the DOC making use of the Schema knowledge it is able to put the XML
>> generated by various elements of the DOC into the correct places in the
>> Manifest... At least that's the theory...
>>
>
> So how is the DOC notified of the specific schema it is supposed to use?
>   I didn't see a mechanism for that.


Well, it was just going to be implemented to use the latest schema/DTD at the
time - do you think we would need to consider supporting older schemas? I
think that this would be better serviced by using XSLT outside of it.

>
>> This is largely being facilitated by the move towards DC and AI sharing the
>> same Schema, but if there are differences we are considering passing flags to
>> the to_xml(), and maybe from_xml() (but I think this should be less 
>> necessary)
>> to allow for differences e.g.:
>>
>>      DataObject.to_xml( manifest_type = MANIFEST_TYPE_AI )
>>
>
> I think that sort of approach has potential of pushing detailed
> application-specific knowledge into leaf objects.  That seems
> problematic to me.

Hmm, maybe it could have some effect, but it still seems to me to be the
correct place to have the decision made since each object would best know
where it fits in the over all scheme of things.

The only other alternative I can think of is to allow everything to generate
XML as if it's going into the overall-schema, and then the DOC later run an
XSLT or simlar over the generated XML tree, to produce an AI or DC varient,
removing nodes or modifying them as appropriate...

>
>>>
>>> 3.2 (sub-bullet of bullet 4) regarding the pickling requirements, is
>>> there a specific reference or more concrete example that you could
>>> provide to help ensure our object implementors get this right?
>>
>> The document at:
>>
>>      
>> http://docs.python.org/library/pickle.html#what-can-be-pickled-and-unpickled
>>
>> explains what isn't out-of-the-box able to be pickled, but it essentially
>> boils down to:
>>
>>      The following types can be pickled:
>>
>>          - None, True, and False
>>          - integers, long integers, floating point numbers, complex numbers
>>          - normal and Unicode strings
>>          - tuples, lists, sets, and dictionaries containing only picklable
objects
>>          - functions defined at the top level of a module
>>          - built-in functions defined at the top level of a module
>>          - classes that are defined at the top level of a module
>>          - instances of such classes whose __dict__ or __setstate__() is
>>            picklable.
>>
>> So essentially if you stick to normal Python types you should be fairly safe.
>>
>
> Please put the reference and clarification in your next revision.
>

Will do...

>>>
>>> 3.2 bullet 5: Are there things we can do to help object implementers
>>> meet the schema consistency and re-creation constraints?
>>
>> The hope is that the person implementing a given object, will be the most
>> familiar with what the schema dictates for it's own XML equivalent, so it
>> shouldn't be that much of an issue if they focus only on their specific area.
>>
>> Of course that doesn't mean there isn't something we could do to help, but 
>> I'm
>> honestly not sure what we could provide other than a possible breakdown of 
>> the
>> Schema for each area / checkpoint.
>>
>
> I'm concerned that we're going to have developers stumbling around here
> trying to figure out how to get it right, just hoping we had a set of
> basic practices that would limit that.

Do you have any suggestions on how we might provide such things? I need to
think a little more about it before I can come up with something.

Certainly we could document some examples, taking a snippet of the schema, and
showing how to generate it in XML from the DOC, would that suffice?

>
>>
>>>
>>> 3.3 Is there a particular reason we need to constrain these to
>>> Consolidation Private?  (In reality, since install is not really
>>> intended to be a separate consolidation, I'd prefer we avoided
>>> consolidation and went with either Project Private or one of the public
>>> levels).  Are you intending a later update (once you've gone further
>>> into implementation) with an imported interface table?
>>
>> There isn't really any good reason to constrain it to consolidation private, 
>> I
>> wasn't really sure what was best for something like this.
>>
>> Project Private would seem TOO constrained, in that it's been mentioned that
>> other it might be something people outside of the install team might want to
>> utilize for adding another checkpoint (maybe?).
>>
>
> That seems to be an application-specific issue for DC, and the interface
> there is a manifest, not the API's here.  Overall, I believe that at the
> moment these are best regarded as private.  We can open them further as
> we get experience and understand what should be stable.

OK.

>
>> Public would seem TOO open, possibly restricting things going forard.
>>
>> Maybe Uncommitted would be a better middle ground to require contracts for
>> anyone wishing to use the interface outside of the project.
>>
>> I would be fearful that having it Committed would be promising too much at
>> this point in time, but I would hope it would get to that point eventually
>> after a couple of iterations.
>>
>> I totally open to peoples preferences here - but PSARC approved i/fs would
>> seem to prefer the Committed option to Uncommitted.
>>
>> As for imported interfaces, it wasn't in the original document template we
>> had, so really didn't cross our minds - but we should be able to add it, for
>> what we know so far, but until implementation nothing really is set in stone.
>>
>
> That seems no different than anything else here :-)

True...

>
>>>
>>> 3.4.1.1  Is there a reason not to require that names within a class be
>>> unique?  Not being able to depend on this seems to make some of the
>>> other interfaces where retrieval/deletion can use names less useful.
>>
>> One reason that we didn't restrict this is in the case of Targets, where you
>> may have something like:
>>
>>      Targets
>>          TargetDisk  [c0d0]
>>              Partition [p0]
>>              Partition [p1]
>>                  Slice [s0]
>>                  Slice [s1]
>>                  ...
>>                  Slice [s7]
>>          TargetDisk  [c2d0]
>>              Partition [p0]
>>                  Slice [s0]
>>                  Slice [s1]
>>                  ...
>>                  Slice [s7]
>>
>> As you can see the names in this case (partitions/slices) wouldn't be unique
>> in themselves, but would only be considered unique if you include the context
>> of the parents, i.e. c2d0/p0/s1.
>>
>
> I'm somewhat doubtful of that suggested taxonomy.  A slice (or
> partition) seems dependent on its parent device/partition, so I would
> expect the names to be fully-qualified.

I don't believe that would be the case at the moment in the schema design:

    ...
        <target_device>
            <type>
               <ctd>
                 <name>c1t0d0</name>
                 <slice>
                   <name>0</name>
                    <action>
                    <create>
                      <size>1000</size>
                    </create>
                    </action>
                 </slice>
               </ctd>
            </type>
         </target_device>
    ...

I think that the fully qualified name is certainly fetch-able (e.g. calling
slice.get_device_name() ) but I don't think it should be necessary for a child
node to qualify it's name in itself, as in:

     Targets
         TargetDisk  [c0d0]
             Partition [c0d0p0]
             Partition [c0d0p1]
                 Slice [c0d0s0]
                 Slice [c0d0s1]
                 ...
                 Slice [c0d0s7]
         TargetDisk  [c2d0]
             Partition [c2d0p0]
                 Slice [c2d0s0]
                 Slice [c2d0s1]
                 ...
                 Slice [c2d0s7]

this seems like redundant information being repeated unnecessarily, when it's
possible to get it using full_name = name + parent.name ...

>
>> I suppose we could ask that the name would be unique in the any given child
>> list - but I don't think we could ask for it to be the case in the complete
>> tree of objects. This could also open up the ability to refer to children
>> using a dictionary, which might be useful...
>>
>
> That seems like a theoretical use, but it seems to compromise an
> immediate, practical use of the Name to allow it, soI guess I'm
> skeptical still.

And entitled to be ;)

>>> 3.4.1.3 to_xml().  I like the potential choice to have a parent generate
>>> a tree for its children, but I'm not sure how a general child class
>>> would know to return None if it were implemented to normally provide its
>>> own representation; especially if the parent would like to use the
>>> child's to_xml() to assist in its aggregation.  Should it perhaps be the
>>> case that to_xml() also returns a boolean that indicates whether descent
>>> within this object's subtree should continue?  Should this also apply to
>>> can_handle()/from_xml() so that the behavior can be fully symmetric?
>>
>> This is certainly possible to do. I'm honestly still delving into this area 
>> in
>> more depth to see what the best solution would be.
>>
>> But my thinking on it is that if it's likely that the parent object would do
>> the XML generation to include it's children, then most probably the case that
>> the child wouldn't ever generate XML in itself.
>>
>
> I think that assumes a very static object hierarchy, and that's not an
> assumption I'm all that comfortable with at this point.  I'm also
> imagining parent objects that might wish to reprocess the children's xml
> for readability or something, but admittedly I don't have a very good
> case there to suggest right now.

Understood, but I think that until there is a specific case I don't know if I
can really plan for it.

Do you really think that the hierarchy isn't going to be that static, fine
during the development cycle I can see it being very dynamic, but in the end I
would think a lot of it will be minimal, and if not, quite localized.

We originally made the to_xml() work as it does here (and I've said this in
another e-mail to Kieth too) to avoid the requirement that every implementor
of to_xml() wouldn't have to always include a foreach child statement by
default, as a convenience, but maybe it's too convenient?

I think that it's simple enough to override this using the
"generates_xml_for_children()" mechanism below, if you really want to have
more control over the descent of the tree - so if you returned True here, then
you are certainly free to do the descent yourself in a more managed say, thus
allowing for the reprocessing of the XML from children, if desired.

>
>> Of course, there's always some exception - I've just not thought of one 
>> yet...
>>
>> If we're to allow for such a case, it may be better to have an method
>> like "generates_xml_for_children()" which returns boolean - I just don't like
>> methods that return tuples of values as an interface. So this would make it
>> more like:
>>
>>      if not obj.generates_xml_for_children():
>>          for child in obj.children():
>>              ...
>>
>> The default implementation would return False - and this always traverse the
>> children.
>>
>
> The multi-return seems pretty Pythonic, which is why I suggested it, but
> either way would work, I guess.

Sure it is, but I just feel it's too generic, and could cause more programming
errors in the end.

>
>>> Finally, can you expand on the factors that are important to consider in
>>> the decision between static and dynamic registration?  My assumption
>>> would be to lean strongly towards dynamic for flexibility of this
>>> infrastructure, but I'm guessing there are factors that I'm not considering.
>>
>> I'm still thinking about this, but I think the main issue with static
>> registration is that it means you need access to the source code to update 
>> the
>> static registration file, which may not always be the possible.
>>
>
> Yup.
>
>> I would certainly prefer dynamic myself, the question then is just how 
>> dynamic
>> we should be.
>>
>> One case I've been looking at is where on import of a module, it's 
>> __init__.py
>> would call something like:
>>
>>      DataObjectCache.register_class( MyClass )
>>
>> for each class that can be used to import XML in a module.
>>
>> This works quite well (I've source code that tests it), but the main issue is
>> that something needs to import the module... But I'm thinking that the
>> Application, in most cases, will already be doing this, but maybe there are
>> cases where it doesn't...
>>
>> In this case, we would need to consider something like a "signature" for a
>> module which says that it's an install module with objects that can import
>> XML. This would then require us to traverse the PYTHONPATH to find such a
>> signature.
>>
>> This latter option introduces an time penalty at start up, but this may be
>> offset by the flexibility it provides.  would require
>>
>> A signature that I would be thinking of is a special file like:
>>
>>      __install__init__.py
>>
>> which we could search for, and if it's found we would load that file and
>> execute it - it would then contain the register_class() methods...
>>
>> I've still to look into this in more depth, and was intending on doing it as
>> part of the implementation, but maybe I should pick one now...
>>
>
> Seems important for your consumers to know which way to go sooner than
> later.

Sure, and I'm working on a section about this...

>
>>>
>>> 3.4.2.1  A singleton here seems somewhat controversial to me.  Why isn't
>>> it the application's responsibility to control this?  An alternate
>>> formulation that I think accomplishes the goals here is to have the
>>> application provide the Engine with an instance, and the Checkpoint
>>> objects can always get it from their associated Engine.  Are there cases
>>> that this doesn't work for?  (I didn't attempt to map this to all the
>>> use cases so I'm not necessarily asserting it will, but it seems the
>>> more natural solution to me so I'm wondering if you considered it).
>>
>> During the prototype phase we did try something like this, having the engine
>> as the central point for getting a reference to the DOC, Logging, etc. but it
>> presented more problems where everything had to access the engine to get a
>> pointer to an instance, etc. - so instead we came up with each of
>> these being singletons in their own right, so that they could be accessed 
>> from
>> anywhere simply by calling XXX.get_instance() - but with the one caveat that
>> something (in this case the Application) needs to create the first instance
>> specifically - to ensure that ordering, etc is correct.
>>
>
> I guess I don't quite understand the problems it created; if anything I
> would expect Engine to be a natural singleton and using it to access the
> elements that are part of its environment seems pretty obvious to me.
> My feeling is that it's limiting the DataObjectCache, which seems to be
> a more generic component than an installation engine.  I can more easily
> imagine an application such as an authoring tool where I might want to
> use two different caches and move data from one as a starting point to
> another, so that's what I'm a bit stuck on here.

We tried using the Engine to access the Logger at the time, and I think that
was where we first encountered issues - since the DOC also used the Logger, so
to get it, we had to access the Engine, but then the Engine used the DOC, and
as a result there were circular dependencies created on doing the imports
which didn't appear to be simple to remove.

>
>>>
>>> 3.4.2.4  Wouldn't clear() be useful in applications like the interactive
>>> installers where the user might go back to the first step in the parade
>>> of screens (might be useful as a variant of Use Case 1)?  Also, I didn't
>>> grok what "dump( indented )" is supposed to mean?
>>
>> True, clear() could have many uses...
>>
>> As for dump() it's mainly for use in development or debug information to
>> generate a "simple view" of the DOC at a given point in time. So you would 
>> get
>> something like:
>>
>>      DataObjectCache [root]
>>          Targets [DISCOVERED]
>>              Disk [c0t0d0]
>>                  Partition [p0]
>>                      ....
>>
>> and it uses the str() function to generate this, so an object may better
>> represent it's output if it wishes.
>
> I didn't understand specifically what the "indent" argument was meant to
> do, though.  No indication was given what the developer would do with it.

Agreed, we could probably remove the indent argument.

>>> page 25, first paragraph regarding writing out checkpoints to the
>>> manifest: Seems like we need a mechanism for the application to inform
>>> each checkpoint of whether it's to be written out or not.  Not sure
>>> where that falls architecture-wise.
>>
>> We seed as the checkpoints themselves knowing this (see above) - and simply
>> need to return None if they are not going to generate anything.
>>
>
> I'm doubtful of this model yet.  Say we're going to have AI dump out a
> processed manifest for potential use as input to another AI; why would
> any checkpoints be included in that manifest?  Many, if not most, of its
> checkpoints are common with another application such as DC where the
> checkpoint may well need to be dumped.  So I'm again wondering how the
> checkpoint knows this itself without baking in knowledge of the
> containing app, which I find objectionable.

I can see where you're coming from, and maybe the correct approach is to
always generate the XML for nodes that have an XML equivalent, and allow the
Application decide if it wants to pluck elements out of the generated XML - so
DC might leave it in, and AI would remove it - the use of XSLTs make sense
here...

Thanks,

Darren.

_______________________________________________
caiman-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/caiman-discuss

Re: [caiman-discuss] Request for review of Data Object Cache design document (0.5)

Reply via email to