Re: [caiman-discuss] Request for review of Data Object Cache design document (0.5)

Sarah Jelinek Tue, 25 May 2010 08:16:57 -0700


On 05/25/10 08:53 AM, Dave Miner wrote:

On 05/25/10 09:36 AM, Darren Kenny wrote:
Hi Dave, more below ...
And yet more in-line from me...
On 05/24/10 09:40 PM, Dave Miner wrote:
On 05/24/10 10:40 AM, Darren Kenny wrote:
On 05/21/10 07:54 PM, Dave Miner wrote:
On 05/19/10 10:34 AM, Darren Kenny wrote:
Hi,
We would like to ask people to please review the Data ObjectCache design
document, which can be found at:
http://hub.opensolaris.org/bin/download/Project+caiman/DataObjectCache/DataObjectCache%2DDesign%2D0.5.pdf
Overall, a very well-written document. But, of course, I havecomments :-)
Thanks, and I'm not surprised there are comments...
2.1, third bullet: s/system/application/
Sure, I just used the term system as referring to the completemechanisms, but
Application probably makes more sense.
2.2, last bullet: Would it more correct to say that it's the jobof theapplication to define the object hierarchy, and hence the datatree, andthat DOC is really just providing the infrastructure for theapplicationto do that? As written the DOC seems to be perhaps moreomniscient than
you mean for it.
Well, not really, mainly since it was thought that the DOC couldmake some useof the DTD / Schema (at a high level at least) to correctlystructure the XMLgenerated from the DOC. At least this is something we discussedwith Sarah.
The Application doesn't really know as much about the XML any more,butinstead this is being deferred to each object in the DOC to knowabout their
own area - e.g. Targets know how to map Targets from/to XML, etc.
By the DOC making use of the Schema knowledge it is able to put theXMLgenerated by various elements of the DOC into the correct places inthe
Manifest... At least that's the theory...
So how is the DOC notified of the specific schema it is supposed touse?
   I didn't see a mechanism for that.
Well, it was just going to be implemented to use the latestschema/DTD at thetime - do you think we would need to consider supporting olderschemas? I
think that this would be better serviced by using XSLT outside of it.
I would think that some of the compatibility scenarios might requireusing an older schema. This seems simple to allow for, at any rate.
This is largely being facilitated by the move towards DC and AIsharing thesame Schema, but if there are differences we are consideringpassing flags tothe to_xml(), and maybe from_xml() (but I think this should be lessnecessary)
to allow for differences e.g.:

      DataObject.to_xml( manifest_type = MANIFEST_TYPE_AI )
I think that sort of approach has potential of pushing detailed
application-specific knowledge into leaf objects.  That seems
problematic to me.
Hmm, maybe it could have some effect, but it still seems to me to be the
correct place to have the decision made since each object would bestknow
where it fits in the over all scheme of things.
The only other alternative I can think of is to allow everything togenerateXML as if it's going into the overall-schema, and then the DOC laterrun anXSLT or simlar over the generated XML tree, to produce an AI or DCvarient,
removing nodes or modifying them as appropriate...
I would think that it would be more correct for the objects to beapplication-agnostic in general, with specific applicationsimplementing subclasses if needed. Why wouldn't that be the preferredsolution?
3.2 (sub-bullet of bullet 4) regarding the pickling requirements, is
there a specific reference or more concrete example that you could
provide to help ensure our object implementors get this right?
The document at:
http://docs.python.org/library/pickle.html#what-can-be-pickled-and-unpickled
explains what isn't out-of-the-box able to be pickled, but itessentially
boils down to:

      The following types can be pickled:

          - None, True, and False
- integers, long integers, floating point numbers,complex numbers
          - normal and Unicode strings
- tuples, lists, sets, and dictionaries containing onlypicklable
objects
          - functions defined at the top level of a module
          - built-in functions defined at the top level of a module
          - classes that are defined at the top level of a module
- instances of such classes whose __dict__ or__setstate__() is
            picklable.
So essentially if you stick to normal Python types you should befairly safe.
Please put the reference and clarification in your next revision.
Will do...
3.2 bullet 5: Are there things we can do to help object implementers
meet the schema consistency and re-creation constraints?
The hope is that the person implementing a given object, will bethe mostfamiliar with what the schema dictates for it's own XML equivalent,so itshouldn't be that much of an issue if they focus only on theirspecific area.
Of course that doesn't mean there isn't something we could do tohelp, but I'mhonestly not sure what we could provide other than a possiblebreakdown of the
Schema for each area / checkpoint.
I'm concerned that we're going to have developers stumbling around here
trying to figure out how to get it right, just hoping we had a set of
basic practices that would limit that.
Do you have any suggestions on how we might provide such things? Ineed to
think a little more about it before I can come up with something.
Certainly we could document some examples, taking a snippet of theschema, and
showing how to generate it in XML from the DOC, would that suffice?
It would certainly be a start, at least.
3.3 Is there a particular reason we need to constrain these to
Consolidation Private?  (In reality, since install is not really
intended to be a separate consolidation, I'd prefer we avoided
consolidation and went with either Project Private or one of thepublic
levels).  Are you intending a later update (once you've gone further
into implementation) with an imported interface table?
There isn't really any good reason to constrain it to consolidationprivate, I
wasn't really sure what was best for something like this.
Project Private would seem TOO constrained, in that it's beenmentioned thatother it might be something people outside of the install teammight want to
utilize for adding another checkpoint (maybe?).
That seems to be an application-specific issue for DC, and theinterfacethere is a manifest, not the API's here. Overall, I believe that atthe
moment these are best regarded as private.  We can open them further as
we get experience and understand what should be stable.
OK.
Public would seem TOO open, possibly restricting things going forard.
Maybe Uncommitted would be a better middle ground to requirecontracts for
anyone wishing to use the interface outside of the project.
I would be fearful that having it Committed would be promising toomuch atthis point in time, but I would hope it would get to that pointeventually
after a couple of iterations.
I totally open to peoples preferences here - but PSARC approvedi/fs would
seem to prefer the Committed option to Uncommitted.
As for imported interfaces, it wasn't in the original documenttemplate wehad, so really didn't cross our minds - but we should be able toadd it, forwhat we know so far, but until implementation nothing really is setin stone.
That seems no different than anything else here :-)
True...
3.4.1.1 Is there a reason not to require that names within aclass be
unique?  Not being able to depend on this seems to make some of the
other interfaces where retrieval/deletion can use names less useful.
One reason that we didn't restrict this is in the case of Targets,where you
may have something like:

      Targets
          TargetDisk  [c0d0]
              Partition [p0]
              Partition [p1]
                  Slice [s0]
                  Slice [s1]
                  ...
                  Slice [s7]
          TargetDisk  [c2d0]
              Partition [p0]
                  Slice [s0]
                  Slice [s1]
                  ...
                  Slice [s7]
As you can see the names in this case (partitions/slices) wouldn'tbe uniquein themselves, but would only be considered unique if you includethe context
of the parents, i.e. c2d0/p0/s1.
I'm somewhat doubtful of that suggested taxonomy.  A slice (or
partition) seems dependent on its parent device/partition, so I would
expect the names to be fully-qualified.
I don't believe that would be the case at the moment in the schemadesign:
     ...
<target_device>
<type>
<ctd>
<name>c1t0d0</name>
<slice>
<name>0</name>
<action>
<create>
<size>1000</size>
</create>
</action>
</slice>
</ctd>
</type>
</target_device>
     ...
Well, I don't think we really have a final schema, but I wouldcertainly be looking for opportunities to make the notation moreconcise; using a fully-specified slice device directly might do that.Remember that one of the objections to usability of XML is itsperceived verbosity; I'd like to not exacerbate that unnecessarily.

I have modified the schema from the original proposal, which I agree wastoo verbose. Here is a snippet of data from this new target schema:

<target>
<target_device is_root="true">
<type>
<zpool name="sarahs_pool" action="create">
<vdev>
<mirror>
<disk>
<ctd name="c1t0d0"></ctd>
</disk>
<disk>
<ctd name ="c1t1d0"></ctd>
</disk>
</mirror>
</vdev>
<vdev>
<raidz>
<slice name="c1t2d0s0"></slice>
<slice name="c1t3d0s0"></slice>

As you can see we do have the fully qualified names for disks andslices. The groupings you see, such as vdev->mirror or vdev->raidz arethere to provide the correct encapsulation of the definition for, inthis case, a zpool, that can have multiple vdevs defined and possiblydifferent types.


You can specify a slice name without its parent in this schema.

This is an attempt to provide a flatter manifest. However, if you wantto not provide the fully qualified name, you can do that as well, forexample:


<disk>
<ctd name="c1t1d0">
<slice name="0">

If a user wants to do this. It isn't required but allowed.

So, I think that we have to allow for both of these naming schemes andprovide the ability to get the parent of a child to get the childs fullname.


thanks,
sarah

I think that the fully qualified name is certainly fetch-able (e.g.callingslice.get_device_name() ) but I don't think it should be necessaryfor a child
node to qualify it's name in itself, as in:

      Targets
          TargetDisk  [c0d0]
              Partition [c0d0p0]
              Partition [c0d0p1]
                  Slice [c0d0s0]
                  Slice [c0d0s1]
                  ...
                  Slice [c0d0s7]
          TargetDisk  [c2d0]
              Partition [c2d0p0]
                  Slice [c2d0s0]
                  Slice [c2d0s1]
                  ...
                  Slice [c2d0s7]
this seems like redundant information being repeated unnecessarily,when it's
possible to get it using full_name = name + parent.name ...
I suppose we could ask that the name would be unique in the anygiven childlist - but I don't think we could ask for it to be the case in thecompletetree of objects. This could also open up the ability to refer tochildren
using a dictionary, which might be useful...
That seems like a theoretical use, but it seems to compromise an
immediate, practical use of the Name to allow it, soI guess I'm
skeptical still.
And entitled to be ;)
3.4.1.3 to_xml(). I like the potential choice to have a parentgenerate
a tree for its children, but I'm not sure how a general child class
would know to return None if it were implemented to normallyprovide its
own representation; especially if the parent would like to use the
child's to_xml() to assist in its aggregation. Should it perhapsbe thecase that to_xml() also returns a boolean that indicates whetherdescentwithin this object's subtree should continue? Should this alsoapply to
can_handle()/from_xml() so that the behavior can be fully symmetric?
This is certainly possible to do. I'm honestly still delving intothis area in
more depth to see what the best solution would be.
But my thinking on it is that if it's likely that the parent objectwould dothe XML generation to include it's children, then most probably thecase that
the child wouldn't ever generate XML in itself.
I think that assumes a very static object hierarchy, and that's not an
assumption I'm all that comfortable with at this point.  I'm also
imagining parent objects that might wish to reprocess the children'sxml
for readability or something, but admittedly I don't have a very good
case there to suggest right now.
Understood, but I think that until there is a specific case I don'tknow if I
can really plan for it.
Do you really think that the hierarchy isn't going to be that static,fineduring the development cycle I can see it being very dynamic, but inthe end I
would think a lot of it will be minimal, and if not, quite localized.
We originally made the to_xml() work as it does here (and I've saidthis inanother e-mail to Kieth too) to avoid the requirement that everyimplementor
of to_xml() wouldn't have to always include a foreach child statement by
default, as a convenience, but maybe it's too convenient?

I think that it's simple enough to override this using the
"generates_xml_for_children()" mechanism below, if you really want tohavemore control over the descent of the tree - so if you returned Truehere, thenyou are certainly free to do the descent yourself in a more managedsay, thus
allowing for the reprocessing of the XML from children, if desired.
I think that's close enough.
Of course, there's always some exception - I've just not thought ofone yet...
If we're to allow for such a case, it may be better to have an method
like "generates_xml_for_children()" which returns boolean - I justdon't likemethods that return tuples of values as an interface. So this wouldmake it
more like:

      if not obj.generates_xml_for_children():
          for child in obj.children():
              ...
The default implementation would return False - and this alwaystraverse the
children.
The multi-return seems pretty Pythonic, which is why I suggested it,but
either way would work, I guess.
Sure it is, but I just feel it's too generic, and could cause moreprogramming
errors in the end.
Finally, can you expand on the factors that are important toconsider in
the decision between static and dynamic registration?  My assumption
would be to lean strongly towards dynamic for flexibility of this
infrastructure, but I'm guessing there are factors that I'm notconsidering.
I'm still thinking about this, but I think the main issue with static
registration is that it means you need access to the source code toupdate the
static registration file, which may not always be the possible.
Yup.
I would certainly prefer dynamic myself, the question then is justhow dynamic
we should be.
One case I've been looking at is where on import of a module, it's__init__.py
would call something like:

      DataObjectCache.register_class( MyClass )

for each class that can be used to import XML in a module.
This works quite well (I've source code that tests it), but themain issue is
that something needs to import the module... But I'm thinking that the
Application, in most cases, will already be doing this, but maybethere are
cases where it doesn't...
In this case, we would need to consider something like a"signature" for amodule which says that it's an install module with objects that canimportXML. This would then require us to traverse the PYTHONPATH to findsuch a
signature.
This latter option introduces an time penalty at start up, but thismay be
offset by the flexibility it provides.  would require

A signature that I would be thinking of is a special file like:

      __install__init__.py
which we could search for, and if it's found we would load thatfile and
execute it - it would then contain the register_class() methods...
I've still to look into this in more depth, and was intending ondoing it as
part of the implementation, but maybe I should pick one now...
Seems important for your consumers to know which way to go sooner than
later.
Sure, and I'm working on a section about this...
3.4.2.1 A singleton here seems somewhat controversial to me. Whyisn't
it the application's responsibility to control this?  An alternate
formulation that I think accomplishes the goals here is to have the
application provide the Engine with an instance, and the Checkpoint
objects can always get it from their associated Engine. Are therecases
that this doesn't work for?  (I didn't attempt to map this to all the
use cases so I'm not necessarily asserting it will, but it seems the
more natural solution to me so I'm wondering if you considered it).
During the prototype phase we did try something like this, havingthe engineas the central point for getting a reference to the DOC, Logging,etc. but itpresented more problems where everything had to access the engineto get a
pointer to an instance, etc. - so instead we came up with each of
these being singletons in their own right, so that they could beaccessed fromanywhere simply by calling XXX.get_instance() - but with the onecaveat thatsomething (in this case the Application) needs to create the firstinstance
specifically - to ensure that ordering, etc is correct.
I guess I don't quite understand the problems it created; if anything I
would expect Engine to be a natural singleton and using it to accessthe
elements that are part of its environment seems pretty obvious to me.
My feeling is that it's limiting the DataObjectCache, which seems to be
a more generic component than an installation engine. I can moreeasily
imagine an application such as an authoring tool where I might want to
use two different caches and move data from one as a starting point to
another, so that's what I'm a bit stuck on here.
We tried using the Engine to access the Logger at the time, and Ithink thatwas where we first encountered issues - since the DOC also used theLogger, soto get it, we had to access the Engine, but then the Engine used theDOC, andas a result there were circular dependencies created on doing theimports
which didn't appear to be simple to remove.
OK, that's starting to make sense :-)
3.4.2.4 Wouldn't clear() be useful in applications like theinteractiveinstallers where the user might go back to the first step in theparadeof screens (might be useful as a variant of Use Case 1)? Also, Ididn't
grok what "dump( indented )" is supposed to mean?
True, clear() could have many uses...
As for dump() it's mainly for use in development or debuginformation togenerate a "simple view" of the DOC at a given point in time. Soyou would get
something like:

      DataObjectCache [root]
          Targets [DISCOVERED]
              Disk [c0t0d0]
                  Partition [p0]
                      ....
and it uses the str() function to generate this, so an object maybetter
represent it's output if it wishes.
I didn't understand specifically what the "indent" argument wasmeant todo, though. No indication was given what the developer would dowith it.
Agreed, we could probably remove the indent argument.
page 25, first paragraph regarding writing out checkpoints to the
manifest: Seems like we need a mechanism for the application toinform
each checkpoint of whether it's to be written out or not.  Not sure
where that falls architecture-wise.
We seed as the checkpoints themselves knowing this (see above) -and simply
need to return None if they are not going to generate anything.
I'm doubtful of this model yet.  Say we're going to have AI dump out a
processed manifest for potential use as input to another AI; why would
any checkpoints be included in that manifest? Many, if not most, ofits
checkpoints are common with another application such as DC where the
checkpoint may well need to be dumped.  So I'm again wondering how the
checkpoint knows this itself without baking in knowledge of the
containing app, which I find objectionable.
I can see where you're coming from, and maybe the correct approach is to
always generate the XML for nodes that have an XML equivalent, andallow theApplication decide if it wants to pluck elements out of the generatedXML - soDC might leave it in, and AI would remove it - the use of XSLTs makesense
here...
I don't see how the application would decide what to pluck out afterthe fact, so perhaps you can elaborate on how you think that might work?
Dave


_______________________________________________
caiman-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/caiman-discuss

_______________________________________________
caiman-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/caiman-discuss

Re: [caiman-discuss] Request for review of Data Object Cache design document (0.5)

Reply via email to