[Dev] DRAFT: Python Schema API proposal

Phillip J. Eby Fri, 15 Apr 2005 14:51:15 -0700

-------------------------------------
Defining Chandler Schemas with Python
-------------------------------------


Introduction
============

As many of you may know, I've for some time now been promoting the idea of replacing parcel XML with Python code for defining item schemas, and I created a proof-of-concept for this in the "Spike" project, found under 'internals' in the Chandler CVS.

Since the PyCon sprints, it's my understanding that there's now a broad and actionable consensus at OSAF that it is indeed desirable to move to using Python syntax in place of XML for parcels' schema definition. So, after working with Andi and Grant to get the necessary infrastructure in place within Chandler, I'd like to present my proposal for what the Python schema definitions will look like, how migration might take place, and what new possibilities for Chandler development these changes will enable.

If you haven't had a chance to look at Spike yet, you may find it helpful to read at least the "Introduction" section of this document:

http://cvs.osafoundation.org/viewcvs.cgi/internal/Spike/src/spike/schema.txt?rev=HEAD&content-type=text/vnd.viewcvs-markup

which presents a simple Python syntax for defining schemas. The actual syntax used in Chandler will be different, but the above document gives a good introduction to the concept, with lots of working examples. (In fact, the document is designed for use with Python's "doctest" module and is literally a part of Spike's unit tests. As much as is practical, I'll be using this approach for the changes to Chandler, so that the API will be documented and tested at the same time as it's developed.)

You'll notice, by the way, that the documentation doesn't talk much about Kinds, or names, paths, repository views, and parents. That's because in Spike's API, you don't need any of these things in order to create an Item. You just create the item, and until you take some action to store it, it's simply an ordinary Python object.


How it will Work
================

Here's a snippet of XML from the parcel.xml of the osaf.contentmodel package::

<Kind itsName="ContentItem"> <superKinds itemref="Item"/> <classes key="python">osaf.contentmodel.ContentModel.ContentItem</classes> <description>Content Item is the abstract super-kind for things like Contacts, Calendar Events, Tasks, Mail Messages, and Notes. Content Items are user-level items, which a user might file, categorize, share, and delete.</description> <Attribute itsName="body"> <displayName>Body</displayName> <type itemref="Lob"/> <description>All Content Items may have a body to contain notes. It's not decided yet whether this body would instead contain the payload for resource items such as presentations or spreadsheets -- resource items haven't been nailed down yet -- but the payload may be different from the notes because payload needs to know MIME type, etc.</description> </Attribute>

Here's the corresponding code in the proposed schema API::

    from application import schema    # not sure if this is where it will go
    from repository.schema import Types

    class ContentItem(schema.Item):
        """Base class for content items

A content item (such as a contact, note, photo, etc.) Content objects are user-level items that a user might file, categorize, share, and delete. """

body = schema.One(Types.Lob, displayName = "Body", doc = """\ All Content Items may have a body to contain notes. It's not decided yet whether this body would instead contain the payload for resource items such as presentations or spreadsheets -- resource items haven't been nailed down yet -- but the payload may be different from the notes because payload needs to know MIME type, etc.""" )

The fundamental idea here is that Python class definitions replace Kind elements, and Python property definitions replace Attribute elements. Superkinds are defined by inheritance. Parcels are Python packages. Standard Python "import" statements replace XML namespace definitions.

This has several useful consequences. First, it makes item classes independent of parcel loading, which means they're easy to unit test. You can simply create instances of items in order to run tests on them. Second, it means that content classes don't need getKind() methods and other chicanery to get access to a Kind object, just to be able to create instances. Indeed, in all the ways that matter, items will just be normal Python objects until/unless you link them with items that are already stored in the repository (at which time they will become persistent).

This means routines that create new items will no longer need to know what repository view the item is intended for. Instead, such routines can simply create an instance of the appropriate class and return it without further ado. As soon as the caller links the new item to a persisted item (e.g. by setting an attribute), the new item will be persisted as well. (This functionality will be made possible by the "null view" and "view migration" features that Andi has added to the repository.)


Code vs. Data
-------------

Sometimes when I describe the preceding, people wonder if this use of Python means that we are giving up on being "data driven", or if we will still be able to allow users to create kinds and attributes. No, we are not giving up on data-driven, and we will be just as dynamic as before.

If you're not familiar with Python's ultra-dynamic nature, it would seem at first that writing code must be less flexible or less dynamic than writing XML, but this is not at all the case. The Python code for a schema definition is just a script that creates data objects. These data objects are no different than the data objects you would create by reading XML. The only technical difference is that the Python code doesn't have to parse the XML first! (Of course, there are aesthetic differences, too.)

Note also that just because some schema is defined by writing Python classes, it doesn't stop Chandler from allowing users to create attributes or kinds. Again, if you're used to more static languages like Java or C++, it's natural to think of a class as something fixed. But Python allows you to trivially create new classes on the fly. For example::

    def create_a_class(docstring,base_class=object):
        class aNewClass(base_class):
            __doc__ = docstring
        return aNewClass

This function returns a new, distinct class object each time it's called. Each returned class will have the name "aNewClass", but it will be a distinct class object. (And you could change its name by setting its ``__name__`` attribute, if you wanted to.)

If methods were defined in this "nested class" statement, they would have access to any parameters that were passed to ``create_a_class``, which would allow the methods to be customized for each new class created. In effect, Python is its own macro language at this level. Also note that there's no speed disadvantage here; the statements are compiled only once (when the module is compiled), no matter how many times you call the function and create new classes. They are not compiled on the fly; the statements are just the same as any other Python statements, and there is absolutely no observable distinction between the dynamically created classes and "normal" classes, because *all* Python classes are dynamically generated in exactly the same way!

So as you can see, Python is an extremely *fluid* language, and the assumption that "code" is harder to change than data doesn't really carry over from other languages. "Hard coding" *isn't*, in other words. So, it's trivial to define fresh classes and descriptors to represent user-defined kinds and attributes, and in fact the repository already does this kind of class generation today to support multiple inheritance of kinds.

What do we gain from this? Well, it won't be necessary to keep track of or look up Kinds in order to create items: just create an instance of the class. And if there's a class for every Kind that needs to be referenced "statically" in code, then you won't need to also keep track of repository paths in order to get access to a kind; just import the class and ask for its kind.


Parcel Loading
--------------

There are no plans to change the current parcel loading arrangements; parcel.xml will remain a valid way to define schemas and instances. The only change likely to be made to parcel loading is to ensure that a parcel's Python modules are imported before trying to process instances defined in the parcel.xml. This is to ensure that the kinds are present in the repository before the instances are created. Apart from this change, however, the parcel.xml format should not be impacted.

Existing parcels will be changed to use the new schema definition mechanism on an "inside out" basis. That is, superkinds will be changed before subkinds. This is because kinds defined in a parcel.xml can refer to kinds defined in a Python module, but not the other way around. So, likely the contentmodel parcel will be changed first.

There is, however, a new step that will have to be done when new kinds or attribute definitions are added to a parcel defined using Python. Each kind or attribute needs a permanent UUID assigned to it, as this UUID will be used to synchronize the Python module with the repository, and in the future it may be used to help support schema evolution. Spike has a tool that will automatically assign UUIDs for you, so that you don't have to do it by hand::

http://cvs.osafoundation.org/viewcvs.cgi/internal/Spike/src/spike/uuidgen.txt?rev=HEAD&content-type=text/vnd.viewcvs-markup

(Of course, it will have to be ported to work with the new Chandler schema API, because Spike doesn't currently integrate with the repository.)

If you forget to run the tool over a module whose schema has changed, and you didn't set up the UUIDs by hand, an exception will be raised when you try to create instances of the new or changed classes. There should be a reminder in the error message telling you to run the UUID generation tool to resolve the error.


API "Quick Reference"
---------------------

It is currently an open issue where the API will live. But it's going to be a module called ``schema``, such that you'll do ``from somewhere import schema``; it's just not clear yet what ``somewhere`` will be. Here are the main features of interest:

``schema.Item`` The base class for persistent items; inherit from it or a subclass. Note that your Python inheritance relationship will determine the superkind hierarchy of your newly defined kinds, so you will want to be sure that you subclass the appropriate base kind class, rather than subclassing everything directly from ``schema.Item``

``schema.One`` Define an attribute of "single" cardinality, optionally specifying any attribute aspects like its type and display name.

``schema.Many`` Define an attribute of "set" cardinality (once this is available in the repository), optionally specifying any attribute aspects like its type and display name.

``schema.Sequence`` Define an attribute of "list" cardinality, optionally specifying any attribute aspects like its type and display name.

``schema.Mapping`` Define an attribute of "dict" cardinality, optionally specifying any attribute aspects like its type and display name.

``schema.Cloud`` Define a cloud attribute. (This isn't entirely worked out yet; Spike was using a different approach to the cloud concept, so I may need some assistance from someone wise in the ways of clouds before getting a concrete API defined for this.)

In order to reference types (as opposed to kinds), you'll import them from ``repository.schema.Types``. For example, ``Types.String`` to define a string attribute. For attributes that reference other kinds, you'll just import the corresponding class directly from the appropriate module.

Attribute aspects will mostly be keyword arguments to the attribute definitions. Inverse attributes for bidirectional relationships will be specified with an ``inverse`` keyword, and as in Spike they will refer to an attribute of the other class. For example::

    class ContentItem(schema.Item):
        ...
        creator = schema.One(
            displayName = "Created By",
            doc = "Link to the contact who created the item",
        )

    class Contact(ContentItem):
        itemsCreated = schema.Many(
            ContentItem,    # sequence of ContentItem
            inverse = ContentItem.creator,
            ...
        )

Notice that the inverse need only be specified on *one* side of the bidirectional relationship -- whichever side is defined last.


Implementation Tasks
====================

1. Update Spike's code generator tests to use the repository's new "null view" instead of a memory repository. (DONE; this yielded a 40% speed improvement for the tests, dropping pack load time from roughly 1.3 seconds to about 0.8 seconds.)

2. Add Spike tests to prototype programmatic creation of repository Kinds and Attributes, and setting their UUIDs at construction time.

3. Test subclassing the repository's new C-based descriptor types and adding Spike-style metadata to them.

4. Implement the actual schema API and doctests in the main Chandler codebase for Kinds and Attributes. (This is pending a decision of where the API should live in the Chandler package namespace; maybe that decision can be wrapped next week while I'm in SFO.)

5. Define and implement a cloud-definition API (probably needs some input from persons Wise in the Ways of Clouds)

6. Port Spike's UUID generation tool (and docs) to work with modules using the Chandler schema API

7. Attempt a port of the ``contentmodel`` parcel using the API, possibly w/participation by others. (Note: Andi would need to have completed the repository auto-import feature before this would actually be usable in the Chandler application.)

8. Modify the parcel loading facilities to ensure that modules defining kinds are imported before loading parcel.xml files that define instances of those kinds. (This might need to be done by someone other than me; it might also require some minor changes to existing parcels or to the rules for how parcel loading is sequenced.)

9. Investigate possible synergy between the descriptor-level aspect caching that Andi wants to do for performance reasons, and the aspect setting that the schema API needs to do for schema definition reasons. (This will probably actually happen while I'm in SFO next week; it's only at the bottom of this list because it's optional in the general scheme of things.)

10. Investigate the feasibility of implementing Spike's ``schema.Relationship`` concept for Chandler, to allow creation of global attributes that don't appear in a class' static API, allowing parcels to expand/extend existing parcels.


In Conclusion
=============

* Python class definitions offer a compact and convenient way to specify Chandler schemas that will be easier and less error-prone to use than parcel.xml, without losing any of Chandler's current or planned flexibility.

* parcel.xml isn't going away, and during the transition any schema components defined in parcel.xml should be able to co-exist with those defined using Python (barring any inter-dependency issues and assuming no other issues arise).

* Using Python-defined schema means that content items can be unit tested in isolation, without parcel loading overhead, making fast unit tests possible, enabling a test-driven approach to development of the non-UI portions of Chandler. It also reduces coupling between routines that currently have to ferry repository views or items around in order to be able to find kinds and set parents on newly created items.

I hope that this was informative and helpful. I will be in OSAF's San Francisco offices next Monday through Thursday (April 18th-21st), so if you'd like to spend some time talking about any aspect of this proposal during those days, please let me know. Thanks!

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Open Source Applications Foundation "Dev" mailing list
http://lists.osafoundation.org/mailman/listinfo/dev

[Dev] DRAFT: Python Schema API proposal

Reply via email to