Feedback below, inline.

Andi..

-------------------------------------
Defining Chandler Schemas with Python
-------------------------------------


Introduction ============

As many of you may know, I've for some time now been promoting the idea of replacing parcel XML with Python code for defining item schemas, and I created a proof-of-concept for this in the "Spike" project, found under 'internals' in the Chandler CVS.

Since the PyCon sprints, it's my understanding that there's now a broad and actionable consensus at OSAF that it is indeed desirable to move to using Python syntax in place of XML for parcels' schema definition. So, after working with Andi and Grant to get the necessary infrastructure in place within Chandler, I'd like to present my proposal for what the Python schema definitions will look like, how migration might take place, and what new possibilities for Chandler development these changes will enable.

If you haven't had a chance to look at Spike yet, you may find it helpful to read at least the "Introduction" section of this document:

http://cvs.osafoundation.org/viewcvs.cgi/internal/Spike/src/spike/schema.txt?rev=HEAD&content-type=text/vnd.viewcvs-markup

which presents a simple Python syntax for defining schemas. The actual syntax used in Chandler will be different, but the above document gives a good introduction to the concept, with lots of working examples. (In fact, the document is designed for use with Python's "doctest" module and is literally a part of Spike's unit tests. As much as is practical, I'll be using this approach for the changes to Chandler, so that the API will be documented and tested at the same time as it's developed.)

You'll notice, by the way, that the documentation doesn't talk much about Kinds, or names, paths, repository views, and parents. That's because in Spike's API, you don't need any of these things in order to create an Item. You just create the item, and until you take some action to store it, it's simply an ordinary Python object.


How it will Work ================

Here's a snippet of XML from the parcel.xml of the osaf.contentmodel package::

<Kind itsName="ContentItem">
<superKinds itemref="Item"/>
<classes key="python">osaf.contentmodel.ContentModel.ContentItem</classes>
<description>Content Item is the abstract super-kind for things like Contacts, Calendar Events, Tasks, Mail Messages, and Notes. Content Items are user-level items, which a user might file, categorize, share, and delete.</description>
<Attribute itsName="body">
<displayName>Body</displayName>
<type itemref="Lob"/>
<description>All Content Items may have a body to contain notes. It's not decided yet whether this body would instead contain the payload for resource items such as presentations or spreadsheets -- resource items haven't been nailed down yet -- but the payload may be different from the notes because payload needs to know MIME type, etc.</description>
</Attribute>


Here's the corresponding code in the proposed schema API::

   from application import schema    # not sure if this is where it will go
   from repository.schema import Types

   class ContentItem(schema.Item):
       """Base class for content items

A content item (such as a contact, note, photo, etc.) Content objects are
user-level items that a user might file, categorize, share, and delete.
"""


body = schema.One(Types.Lob,
displayName = "Body",
doc = """\
All Content Items may have a body to contain notes. It's not decided
yet whether this body would instead contain the payload for resource
items such as presentations or spreadsheets -- resource items haven't
been nailed down yet -- but the payload may be different from the notes
because payload needs to know MIME type, etc."""
)


The fundamental idea here is that Python class definitions replace Kind elements, and Python property definitions replace Attribute elements. Superkinds are defined by inheritance. Parcels are Python packages. Standard Python "import" statements replace XML namespace definitions.

This has several useful consequences. First, it makes item classes independent of parcel loading, which means they're easy to unit test. You can simply create instances of items in order to run tests on them. Second, it means that content classes don't need getKind() methods and other chicanery to get access to a Kind object, just to be able to create instances. Indeed, in all the ways that matter, items will just be normal Python objects until/unless you link them with items that are already stored in the repository (at which time they will become persistent).

This means routines that create new items will no longer need to know what repository view the item is intended for. Instead, such routines can simply create an instance of the appropriate class and return it without further ado. As soon as the caller links the new item to a persisted item, the new item will be persisted as well. (This functionality will be made possible by the "null view" and "view migration" features that Andi is adding to the repository.)

A dirty little detail to also hide, in addition to 'kind', is 'parent', that is, where in the repository the item lives. While this brings little semantic value (except for Schema) it is useful for debugging and for repository maintenance tasks that cannot rely on schema for inferring structure. In your Schema API you could add the notion of 'defaultParent' to the class declarations. That default parent would then be used when an item is instantiated. Let's talk about this next week.


Code vs. Data
-------------

Sometimes when I describe the preceding, people wonder if this use of Python means that we are giving up on being "data driven", or if we will still be able to allow users to create kinds and attributes. No, we are not giving up on data-driven, and we will be just as dynamic as before.

If you're not familiar with Python's ultra-dynamic nature, it would seem at first that writing code must be less flexible or less dynamic than writing XML, but this is not at all the case. The Python code for a schema definition is just a script that creates data objects. These data objects are no different than the data objects you would create by reading XML. The only technical difference is that the Python code doesn't have to parse the XML first! (Of course, there are aesthetic differences, too.)

Note also that just because some schema is defined by writing Python classes, it doesn't stop Chandler from allowing users to create attributes or kinds. Again, if you're used to more static languages like Java or C++, it's natural to think of a class as something fixed. But Python allows you to trivially create new classes on the fly. For example::

   def create_a_class(docstring,base_class=object):
       class aNewClass(base_class):
           __doc__ = docstring
       return aNewClass

This function returns a new, distinct class object each time it's called. Each returned class will have the name "aNewClass", but it will be a distinct class object. (And you could change its name by setting its ``__name__`` attribute, if you wanted to.)

If methods were defined in this "nested class" statement, they would have access to any parameters that were passed to ``create_a_class``, which would allow the methods to be customized for each new class created. In effect, Python is its own macro language at this level. Also note that there's no speed disadvantage here; the statements are compiled only once (when the module is compiled), no matter how many times you call the function and create new classes. They are not compiled on the fly; the statements are just the same as any other Python statements, and there is absolutely no observable distinction between the dynamically created classes and "normal" classes, because *all* Python classes are dynamically generated in exactly the same way!

So as you can see, Python is an extremely *fluid* language, and the assumption that "code" is harder to change than data doesn't really carry over from other languages. "Hard coding" *isn't*, in other words. So, it's trivial to define fresh classes and descriptors to represent user-defined kinds and attributes, and in fact the repository already does this kind of class generation today to support multiple inheritance of kinds.

What do we gain from this? Well, it won't be necessary to keep track of or look up Kinds in order to create items: just create an instance of the class. And if there's a class for every Kind that needs to be referenced "statically" in code, then you won't need to also keep track of repository paths in order to get access to a kind; just import the class and ask for its kind.

Tadaaah, I think that last paragraph sums it up very well. I do believe that the educational intro before is also very welcome. Did you bite your pen not to say 'python is not java' again ? :)

Parcel Loading
--------------

There are no plans to change the current parcel loading arrangements; parcel.xml will remain a valid way to define schemas and instances. The only change likely to be made to parcel loading is to ensure that a parcel's Python modules are imported before trying to process instances defined in the parcel.xml. This is to ensure that the kinds are present in the repository before the instances are created. Apart from this change, however, the parcel.xml format will not be impacted.

Well, parcel loading supports circular dependencies something Python is not too good at. So, changes in the parcel structure will have to be made. Note that that is a *good* development, not a bad one. I'm very worried about all the circularity we currently have in Chandler.

Existing parcels will be changed to use the new schema definition mechanism on an "inside out" basis. That is, superkinds will be changed before subkinds. This is because kinds defined in a parcel.xml can refer to kinds defined in a Python module, but not the other way around. So, likely the contentmodel parcel will be changed first.

Why could python-defined kinds not depend on repository kinds ? You do need to the core schema, no ?

There is, however, a new step that will have to be done when new kinds or attribute definitions are added to a parcel defined using Python. Each kind or attribute needs a permanent UUID assigned to it, as this UUID will be used to synchronize the Python module with the repository, and in the future it may be used to help support schema evolution. Spike has a tool that will automatically assign UUIDs for you, so that you don't have to do it by
hand::

Actually, no, I've again leaned against using the same UUIDs. You can
certainly do that if you want but it doesn't buy you much. I've added the
_uuid argument to the Item() constructor as you had requested but the
importing of schema items from view to view (or from null view to view) does not require the UUIDs to be the same. Schema items are matched across
views first using UUIDs, then parent/child paths.
Instead, items that need to be copied on export, instead of moved, are marked with a new flag, Citem.COPYEXPORT and their 'export' cloud is used to gather all the related items that are to be copied along. I already defined the 'export' clouds for schema items.


I've come to realize that hardcoding UUIDs would be a very leaky design
decision and we'd start setting UUIDs on lots of things, not just
schema items. For example, if I wanted to import an item sitting in your
'all' collection from your Chandler into mine, I sure don't want to import
your 'all' collection as well. In order to match your 'all' collection with
mine, assuming same UUIDs is very brittle and hardly possible to
coordinate. There has to be something else that is used to match items across
repositories. Lisa had written a proposal about that a year ago, and called it 'Exportable
Addresses'. A quick google search locates this document on our Wiki:
http://wiki.osafoundation.org/bin/view/Chandler/ExportableAddresses
Echoing her proposal, I proposed an implementation for that using ref
collection-based paths, also a year ago, which I sent to dev again
yesterday.
I think that the null view where items are first created with your Schema
API is just like an independant repository that persists as python
code. From this repository items can be imported into one or more regular Chandler repositories since your python code would be run by all Chandler users.


For schema items, and in my current first (or second now) implementation of
item import, parent/child paths are used for matching items across
repositories when no UUID match is found. Once we start using ref
collection-based paths as exportable addresses, the new findMatch() API can
be customized to make use of them as well. (I should add that the
implementation for ref collection-based paths was completed a year ago, and
is part of the regular findPath() API).

As for schema evolution, matching UUIDs could tell you that the schemas are
not the same but you still know nothing about which is more recent and which
was derived from what. Realizing that, I also moved against relying on UUIDs
to match schema since we need to introduce some more talkative identifier
for schema matching, including at least some version number. By the way,
while implementing item import I realized that that could very well be the
starting point for a schema evolution implementation hook as well. But
that's for later.

http://cvs.osafoundation.org/viewcvs.cgi/internal/Spike/src/spike/uuidgen.txt?rev=HEAD&content-type=text/vnd.viewcvs-markup

(Of course, it will have to be ported to work with the new Chandler schema API, because Spike doesn't currently integrate with the repository.)

If you forget to run the tool over a module whose schema has changed, and you didn't set up the UUIDs by hand, an exception will be raised when you try to create instances of the new or changed classes. There should be a reminder in the error message telling you to run the UUID generation tool to resolve the error.


API "Quick Reference" ---------------------

It is currently an open issue where the API will live. But it's going to be a module called ``schema``, such that you'll do ``from somewhere import schema``; it's just not clear yet what ``somewhere`` will be. Here are the main features of interest:

``schema.Item``
The base class for persistent items; inherit from it or a subclass. Note that your Python inheritance relationship will determine the superkind hierarchy of your newly defined kinds, so you will want to be sure that you subclass the appropriate base kind class, rather than subclassing everything directly from ``schema.Item``

Is that 'Item' class the same as repository.item.Item.Item ?

``schema.One``
Define an attribute of "single" cardinality, optionally specifying any attribute aspects like its type and display name.


``schema.Many``
Define an attribute of "set" cardinality (once this is available in the repository), optionally specifying any attribute aspects like its type and display name.


``schema.Sequence``
Define an attribute of "list" cardinality, optionally specifying any attribute aspects like its type and display name.


``schema.Mapping``
Define an attribute of "dict" cardinality, optionally specifying any attribute aspects like its type and display name.

We've already discussed the impact of terminogy changes. Just a note, for the record.

``schema.Cloud``
Define a cloud attribute. (This isn't entirely worked out yet; Spike was using a different approach to the cloud concept, so I may need some assistance from someone wise in the ways of clouds before getting a concrete API defined for this.)

There is no such thing currently, Clouds use Endpoints. An Endpoint is to a Cloud what an Attribute is to a Kind.

In order to reference types (as opposed to kinds), you'll import them from ``repository.schema.Types``. For example, ``Types.String`` to define a string attribute. For attributes that reference other kinds, you'll just import the corresponding class directly from the appropriate module.

I'm glad you're introducting a new namespace here. This is a very old issue that Katie and I could never agree on. It is now moot of course (parcel XML can be bent into that too) but I never 'fixed' it. In the old days, there were different namespaces for kinds and types but I was told to merge them, which I reluctantly did. It is time to undo that.

Attribute aspects will mostly be keyword arguments to the attribute definitions. Inverse attributes for bidirectional relationships will be specified with an ``inverse`` keyword, and as in Spike they will refer to an attribute of the other class. For example::

   class ContentItem(schema.Item):
       ...
       creator = schema.One(
           displayName = "Created By",
           doc = "Link to the contact who created the item",
       )

   class Contact(ContentItem):
       itemsCreated = schema.Many(
           ContentItem,    # sequence of ContentItem
           inverse = ContentItem.creator,
           ...
       )

Notice that the inverse need only be specified on *one* side of the bidirectional relationship -- whichever side is defined last.

Probably fine for this API but this is more restrictive from what the data model expects. The data model only matches names in a bidirectional ref, not attributes.

In Conclusion
=============

* Python class definitions offer a compact and convenient way to specify Chandler schemas that will be easier and less error-prone to use than parcel.xml, without losing any of Chandler's current or planned
flexibility.

Almost true. You're introducing two major differences/restrictions in your model. They are limited to your model, in other words, as long as they don't bleed back into the data model, they're fine. A one-to-one kind-class mapping and matching attributes instead of attribute names in bidirectional refs are new constraints you introduced here. Again, if users are fine with them and they don't bleed back, they're fine.

* parcel.xml isn't going away, and during the transition any schema components defined in parcel.xml should be able to co-exist with those defined using Python (barring any inter-dependency issues and assuming no other issues arise).

Yes, this is important to maintain for a while.

* Using Python-defined schema means that content items can be unit tested in isolation, without parcel loading overhead, making fast unit tests possible, enabling a test-driven approach to development of the non-UI portions of Chandler.

Yes, you're removing parcel loading (more or less) but I should point out that the null view I added two weeks ago should work like any other view from the standpoint of single-view unit tests. Many of our unit tests could be converted to using the null view, even before your schema API is ready. Performance improvements should be noticeable since commit would be completely shut out. Parcel loading costs would still be incurred of course.

It also reduces coupling between routines that currently have to ferry repository views or items around in order to be able to find kinds and set parents on newly created items.

Yes, hiding parent and kind in your API is a great benefit.

I hope that this was informative and helpful. I will be in OSAF's San Francisco offices next Monday through Thursday (April 18th-21st), so if you'd like to spend some time talking about any aspect of this proposal during those days, please let me know. Thanks!

Very much so, thank you for doing this !

Andi..

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Open Source Applications Foundation "Dev" mailing list
http://lists.osafoundation.org/mailman/listinfo/dev

Reply via email to