Re: [caiman-discuss] Request for review of Data Object Cache design document (0.5)

Sarah Jelinek Tue, 25 May 2010 07:51:14 -0700

A few comments inline..

On 05/25/10 07:36 AM, Darren Kenny wrote:

Hi Dave, more below ...


On 05/24/10 09:40 PM, Dave Miner wrote:

On 05/24/10 10:40 AM, Darren Kenny wrote:

On 05/21/10 07:54 PM, Dave Miner wrote:

On 05/19/10 10:34 AM, Darren Kenny wrote:

Hi,

We would like to ask people to please review the Data Object Cache design
document, which can be found at:

http://hub.opensolaris.org/bin/download/Project+caiman/DataObjectCache/DataObjectCache%2DDesign%2D0.5.pdf

Overall, a very well-written document.  But, of course, I have comments :-)

Thanks, and I'm not surprised there are comments...

2.1, third bullet: s/system/application/

Sure, I just used the term system as referring to the complete mechanisms, but
Application probably makes more sense.

2.2, last bullet: Would it more correct to say that it's the job of the
application to define the object hierarchy, and hence the data tree, and
that DOC is really just providing the infrastructure for the application
to do that?  As written the DOC seems to be perhaps more omniscient than
you mean for it.

Well, not really, mainly since it was thought that the DOC could make some use
of the DTD / Schema (at a high level at least) to correctly structure the XML
generated from the DOC. At least this is something we discussed with Sarah.

The Application doesn't really know as much about the XML any more, but
instead this is being deferred to each object in the DOC to know about their
own area - e.g. Targets know how to map Targets from/to XML, etc.

By the DOC making use of the Schema knowledge it is able to put the XML
generated by various elements of the DOC into the correct places in the
Manifest... At least that's the theory...

So how is the DOC notified of the specific schema it is supposed to use?
   I didn't see a mechanism for that.

Well, it was just going to be implemented to use the latest schema/DTD at the
time - do you think we would need to consider supporting older schemas? I
think that this would be better serviced by using XSLT outside of it.

Somewhere we have to understand the schema(s) we are operating under sowe can successfully dump an AI manifest. The DOC has to have this dataprovided to it. My assumption was that for any run of AI we would havevalidated the AI manifest against a schema and this schema would be usedto drive the manifest output in the DOC.

This is largely being facilitated by the move towards DC and AI sharing the
same Schema, but if there are differences we are considering passing flags to
the to_xml(), and maybe from_xml() (but I think this should be less necessary)
to allow for differences e.g.:

      DataObject.to_xml( manifest_type = MANIFEST_TYPE_AI )

I think that sort of approach has potential of pushing detailed
application-specific knowledge into leaf objects.  That seems
problematic to me.

Hmm, maybe it could have some effect, but it still seems to me to be the
correct place to have the decision made since each object would best know
where it fits in the over all scheme of things.

I had thought that that the object wouldn't have to know it istranslating itself to a specific manifest. My thought was that anyobject could translate itself to xml, and that the manifest comes fromthe fact that we have the schema and we know the order in which elementsand attributes must appear. Is this not possible?

The only other alternative I can think of is to allow everything to generate
XML as if it's going into the overall-schema, and then the DOC later run an
XSLT or simlar over the generated XML tree, to produce an AI or DC varient,
removing nodes or modifying them as appropriate...

If the above isn't possible then I think that we have to considersomething like this. That is every object translates itself and we usexslt to transform this data into a valid AI manifest.

3.2 (sub-bullet of bullet 4) regarding the pickling requirements, is
there a specific reference or more concrete example that you could
provide to help ensure our object implementors get this right?

The document at:

      
http://docs.python.org/library/pickle.html#what-can-be-pickled-and-unpickled

explains what isn't out-of-the-box able to be pickled, but it essentially
boils down to:

      The following types can be pickled:

          - None, True, and False
          - integers, long integers, floating point numbers, complex numbers
          - normal and Unicode strings
          - tuples, lists, sets, and dictionaries containing only picklable

objects

          - functions defined at the top level of a module
          - built-in functions defined at the top level of a module
          - classes that are defined at the top level of a module
          - instances of such classes whose __dict__ or __setstate__() is
            picklable.

So essentially if you stick to normal Python types you should be fairly safe.

Please put the reference and clarification in your next revision.

Will do...

3.2 bullet 5: Are there things we can do to help object implementers
meet the schema consistency and re-creation constraints?

The hope is that the person implementing a given object, will be the most
familiar with what the schema dictates for it's own XML equivalent, so it
shouldn't be that much of an issue if they focus only on their specific area.

Of course that doesn't mean there isn't something we could do to help, but I'm
honestly not sure what we could provide other than a possible breakdown of the
Schema for each area / checkpoint.

I'm concerned that we're going to have developers stumbling around here
trying to figure out how to get it right, just hoping we had a set of
basic practices that would limit that.

Do you have any suggestions on how we might provide such things? I need to
think a little more about it before I can come up with something.

Certainly we could document some examples, taking a snippet of the schema, and
showing how to generate it in XML from the DOC, would that suffice?

I would think that a developer should be able to take the specificobject and map it to the portion of the schema that it is part of. Theway the schemas are defined, "transfer", "target", "execution","configuration" mean that the objects that represent these are containedin these schemas and from my perspective it should be relativelystraightforward to map the object to xml. The developer can alsovalidate the xml against the schema, for the specific sub-schema theyare dumping xml. They would have to of course validate it againstmultiple objects output, in the correct order, but it seems reasonableto have them do this since the schemas are going to be modular.

Providing examples would be helpful, and developers can use the xmlinstance document examples I have in the soon to be released schemadesign document to guide them as well.

If an object doesn't provide it's own xml translation, the only otheralternative I can think of is that the DOC knows how to do this, foreach object that is part of a valid manifest. My concern about havingthe DOC do this is that when things change in the objects, the DOC hasto track this separately, as opposed to having the developer making thechanges directly in the object itself.

3.3 Is there a particular reason we need to constrain these to
Consolidation Private?  (In reality, since install is not really
intended to be a separate consolidation, I'd prefer we avoided
consolidation and went with either Project Private or one of the public
levels).  Are you intending a later update (once you've gone further
into implementation) with an imported interface table?

There isn't really any good reason to constrain it to consolidation private, I
wasn't really sure what was best for something like this.

Project Private would seem TOO constrained, in that it's been mentioned that
other it might be something people outside of the install team might want to
utilize for adding another checkpoint (maybe?).

That seems to be an application-specific issue for DC, and the interface
there is a manifest, not the API's here.  Overall, I believe that at the
moment these are best regarded as private.  We can open them further as
we get experience and understand what should be stable.

OK.

Public would seem TOO open, possibly restricting things going forard.

Maybe Uncommitted would be a better middle ground to require contracts for
anyone wishing to use the interface outside of the project.

I would be fearful that having it Committed would be promising too much at
this point in time, but I would hope it would get to that point eventually
after a couple of iterations.

I totally open to peoples preferences here - but PSARC approved i/fs would
seem to prefer the Committed option to Uncommitted.

As for imported interfaces, it wasn't in the original document template we
had, so really didn't cross our minds - but we should be able to add it, for
what we know so far, but until implementation nothing really is set in stone.

That seems no different than anything else here :-)

True...

3.4.1.1  Is there a reason not to require that names within a class be
unique?  Not being able to depend on this seems to make some of the
other interfaces where retrieval/deletion can use names less useful.

One reason that we didn't restrict this is in the case of Targets, where you
may have something like:

      Targets
          TargetDisk  [c0d0]
              Partition [p0]
              Partition [p1]
                  Slice [s0]
                  Slice [s1]
                  ...
                  Slice [s7]
          TargetDisk  [c2d0]
              Partition [p0]
                  Slice [s0]
                  Slice [s1]
                  ...
                  Slice [s7]

As you can see the names in this case (partitions/slices) wouldn't be unique
in themselves, but would only be considered unique if you include the context
of the parents, i.e. c2d0/p0/s1.

I'm somewhat doubtful of that suggested taxonomy.  A slice (or
partition) seems dependent on its parent device/partition, so I would
expect the names to be fully-qualified.

I don't believe that would be the case at the moment in the schema design:

     ...
         <target_device>
             <type>
                <ctd>
                  <name>c1t0d0</name>
                  <slice>
                    <name>0</name>
                     <action>
                     <create>
                       <size>1000</size>
                     </create>
                     </action>
                  </slice>
                </ctd>
             </type>
          </target_device>
     ...

I think that the fully qualified name is certainly fetch-able (e.g. calling
slice.get_device_name() ) but I don't think it should be necessary for a child
node to qualify it's name in itself, as in:

      Targets
          TargetDisk  [c0d0]
              Partition [c0d0p0]
              Partition [c0d0p1]
                  Slice [c0d0s0]
                  Slice [c0d0s1]
                  ...
                  Slice [c0d0s7]
          TargetDisk  [c2d0]
              Partition [c2d0p0]
                  Slice [c2d0s0]
                  Slice [c2d0s1]
                  ...
                  Slice [c2d0s7]

this seems like redundant information being repeated unnecessarily, when it's
possible to get it using full_name = name + parent.name ...

The current schemas do not not have children that have fully qualifiednames. The plan moving forward was to keep this the same, and associatea child with the parent, and get the full name that way. I agree withDarren that storing the info in the child seems redundant.

I do have a question though.. the delete_child() method indicates thatthis will remove a specific parent class. Is this method on the parentclass so that you know which parent to use?

I suppose we could ask that the name would be unique in the any given child
list - but I don't think we could ask for it to be the case in the complete
tree of objects. This could also open up the ability to refer to children
using a dictionary, which might be useful...

That seems like a theoretical use, but it seems to compromise an
immediate, practical use of the Name to allow it, soI guess I'm
skeptical still.

And entitled to be ;)

3.4.1.3 to_xml().  I like the potential choice to have a parent generate
a tree for its children, but I'm not sure how a general child class
would know to return None if it were implemented to normally provide its
own representation; especially if the parent would like to use the
child's to_xml() to assist in its aggregation.  Should it perhaps be the
case that to_xml() also returns a boolean that indicates whether descent
within this object's subtree should continue?  Should this also apply to
can_handle()/from_xml() so that the behavior can be fully symmetric?

This is certainly possible to do. I'm honestly still delving into this area in
more depth to see what the best solution would be.

But my thinking on it is that if it's likely that the parent object would do
the XML generation to include it's children, then most probably the case that
the child wouldn't ever generate XML in itself.

I would think that if a parent is generating the xml for itself and itschildren it would still rely on the child to provide its xmlrepresentation. To aggregate the data into a tree. Is this not possibleor desirable for some reason? Why would we want the parent to generatethe xml for itself and its children without traversing the children?



thanks,
sarah

I think that assumes a very static object hierarchy, and that's not an
assumption I'm all that comfortable with at this point.  I'm also
imagining parent objects that might wish to reprocess the children's xml
for readability or something, but admittedly I don't have a very good
case there to suggest right now.

Understood, but I think that until there is a specific case I don't know if I
can really plan for it.

Do you really think that the hierarchy isn't going to be that static, fine
during the development cycle I can see it being very dynamic, but in the end I
would think a lot of it will be minimal, and if not, quite localized.

We originally made the to_xml() work as it does here (and I've said this in
another e-mail to Kieth too) to avoid the requirement that every implementor
of to_xml() wouldn't have to always include a foreach child statement by
default, as a convenience, but maybe it's too convenient?

I think that it's simple enough to override this using the
"generates_xml_for_children()" mechanism below, if you really want to have
more control over the descent of the tree - so if you returned True here, then
you are certainly free to do the descent yourself in a more managed say, thus
allowing for the reprocessing of the XML from children, if desired.

Of course, there's always some exception - I've just not thought of one yet...

If we're to allow for such a case, it may be better to have an method
like "generates_xml_for_children()" which returns boolean - I just don't like
methods that return tuples of values as an interface. So this would make it
more like:

      if not obj.generates_xml_for_children():
          for child in obj.children():
              ...

The default implementation would return False - and this always traverse the
children.

The multi-return seems pretty Pythonic, which is why I suggested it, but
either way would work, I guess.

Sure it is, but I just feel it's too generic, and could cause more programming
errors in the end.

Finally, can you expand on the factors that are important to consider in
the decision between static and dynamic registration?  My assumption
would be to lean strongly towards dynamic for flexibility of this
infrastructure, but I'm guessing there are factors that I'm not considering.

I'm still thinking about this, but I think the main issue with static
registration is that it means you need access to the source code to update the
static registration file, which may not always be the possible.

Yup.

I would certainly prefer dynamic myself, the question then is just how dynamic
we should be.

One case I've been looking at is where on import of a module, it's __init__.py
would call something like:

      DataObjectCache.register_class( MyClass )

for each class that can be used to import XML in a module.

This works quite well (I've source code that tests it), but the main issue is
that something needs to import the module... But I'm thinking that the
Application, in most cases, will already be doing this, but maybe there are
cases where it doesn't...

In this case, we would need to consider something like a "signature" for a
module which says that it's an install module with objects that can import
XML. This would then require us to traverse the PYTHONPATH to find such a
signature.

This latter option introduces an time penalty at start up, but this may be
offset by the flexibility it provides.  would require

A signature that I would be thinking of is a special file like:

      __install__init__.py

which we could search for, and if it's found we would load that file and
execute it - it would then contain the register_class() methods...

I've still to look into this in more depth, and was intending on doing it as
part of the implementation, but maybe I should pick one now...

Seems important for your consumers to know which way to go sooner than
later.

Sure, and I'm working on a section about this...

3.4.2.1  A singleton here seems somewhat controversial to me.  Why isn't
it the application's responsibility to control this?  An alternate
formulation that I think accomplishes the goals here is to have the
application provide the Engine with an instance, and the Checkpoint
objects can always get it from their associated Engine.  Are there cases
that this doesn't work for?  (I didn't attempt to map this to all the
use cases so I'm not necessarily asserting it will, but it seems the
more natural solution to me so I'm wondering if you considered it).

During the prototype phase we did try something like this, having the engine
as the central point for getting a reference to the DOC, Logging, etc. but it
presented more problems where everything had to access the engine to get a
pointer to an instance, etc. - so instead we came up with each of
these being singletons in their own right, so that they could be accessed from
anywhere simply by calling XXX.get_instance() - but with the one caveat that
something (in this case the Application) needs to create the first instance
specifically - to ensure that ordering, etc is correct.

I guess I don't quite understand the problems it created; if anything I
would expect Engine to be a natural singleton and using it to access the
elements that are part of its environment seems pretty obvious to me.
My feeling is that it's limiting the DataObjectCache, which seems to be
a more generic component than an installation engine.  I can more easily
imagine an application such as an authoring tool where I might want to
use two different caches and move data from one as a starting point to
another, so that's what I'm a bit stuck on here.

We tried using the Engine to access the Logger at the time, and I think that
was where we first encountered issues - since the DOC also used the Logger, so
to get it, we had to access the Engine, but then the Engine used the DOC, and
as a result there were circular dependencies created on doing the imports
which didn't appear to be simple to remove.

3.4.2.4  Wouldn't clear() be useful in applications like the interactive
installers where the user might go back to the first step in the parade
of screens (might be useful as a variant of Use Case 1)?  Also, I didn't
grok what "dump( indented )" is supposed to mean?

True, clear() could have many uses...

As for dump() it's mainly for use in development or debug information to
generate a "simple view" of the DOC at a given point in time. So you would get
something like:

      DataObjectCache [root]
          Targets [DISCOVERED]
              Disk [c0t0d0]
                  Partition [p0]
                      ....

and it uses the str() function to generate this, so an object may better
represent it's output if it wishes.

I didn't understand specifically what the "indent" argument was meant to
do, though.  No indication was given what the developer would do with it.

Agreed, we could probably remove the indent argument.

page 25, first paragraph regarding writing out checkpoints to the
manifest: Seems like we need a mechanism for the application to inform
each checkpoint of whether it's to be written out or not.  Not sure
where that falls architecture-wise.

We seed as the checkpoints themselves knowing this (see above) - and simply
need to return None if they are not going to generate anything.

I'm doubtful of this model yet.  Say we're going to have AI dump out a
processed manifest for potential use as input to another AI; why would
any checkpoints be included in that manifest?  Many, if not most, of its
checkpoints are common with another application such as DC where the
checkpoint may well need to be dumped.  So I'm again wondering how the
checkpoint knows this itself without baking in knowledge of the
containing app, which I find objectionable.

I can see where you're coming from, and maybe the correct approach is to
always generate the XML for nodes that have an XML equivalent, and allow the
Application decide if it wants to pluck elements out of the generated XML - so
DC might leave it in, and AI would remove it - the use of XSLTs make sense
here...

Thanks,

Darren.

_______________________________________________
caiman-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/caiman-discuss

_______________________________________________
caiman-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/caiman-discuss

Re: [caiman-discuss] Request for review of Data Object Cache design document (0.5)

Reply via email to