[Chandler-dev] Dump and reload sketch

Phillip J. Eby Thu, 11 May 2006 17:16:06 -0700

This email is more of a preliminary brain dump than it is a proposal. It'sjust to work out some of the requirements and design tradeoffs that I'vebeen mulling over in the last few weeks, trying to get to a point wheredetailed API design can begin. Please feel free to jump in with questionsor comments about any of it. Thanks.


Background and Requirements Overview
------------------------------------

We want to be able to dump user data from a Chandler repository into astable external format, and be able to reload the resulting backup into thesame or newer versions of Chandler. The intended uses are for backups andto support evolution of both the repository implementation and our domainmodels without requiring users to discard their data.

This overall strategy was chosen after a discussion during the PyConsprints; a writeup of the analysis can be found here:


http://lists.osafoundation.org/pipermail/chandler-dev/2006-February/005301.html

By "stable external format", I mean a format that does not changesignificantly from one Chandler release to the next, and which allows forversion detection of the format itself, as well as providing version andschema information for the parcels whose data is contained in the format.

As I see it, there are several forces to be resolved in the design, most ofwhich compete with all of the others:


Ability to perform upgrades flexibly

In the common case, schema upgrades will consist mostly of simpleadditions or renames. But in uncommon cases, entire families of objectsmay be restructured, or data might be moved from one Kind to another.


Implementation Complexity

We don't have the resources to do original research in schemaevolution techniques, nor can we afford to essentially duplicate theexisting repository implementation in some other form. If possible, theactual data format shouldn't reinvent wheels either.


API Complexity

In principle, we could have have a simple API by just asking eachparcel to write its data to a provided file. However, this is just askingeach parcel author to invent their own format while revisiting the sametradeoffs. Ideally, our API should allow a parcel developer to write aminimum of code to support simple upgrades, and it should make complexupgrades possible.


Effective Support for Parcel Modularity

Chandler isn't a monolithic application, and parcels can be upgradedindependently of one another. Effective dump-and-reload thus requires thatparcels' data be managed in a way that allows multiple upgrades to occurduring the same load operation, and that dumps occur in a way that allowseach parcel's schema or version information to be recordedseparately. Ideally, this modularity would extend to the data as well asthe schema, so that a single parcel's data could be backed up or restored,subject to the parcel's dependencies. (Meaning that we could perhaps atsome point allow upgrading a single parcel's schema without requiring acomplete dump and reload.)


Performance

Dumping should be fast. Reloading shouldn't be slow for simple commoncases, but it's okay to be slow if a complex schema change occurs.



Design Implications
-------------------

There are several places where these forces resolve to fairlystraightforward preliminary conclusions about how the system should work:

* The external format should deal in relatively simple data structurescomposed of elementary values (of a relatively small number of types)arranged in records. Blobs should also be supported. (Forces: reduceimplementation complexity, increase dump performance)

* The format shouldn't use nesting structures that would require temporarystorage of extremely large sequences. (Forces: reduce implementationcomplexity, increase load performance)

* Notifications of all kinds (including onValueChanged) must be disabledduring a load operation -- and that includes refresh() notifications inother views; load operations should be globally exclusive, with no otheractivity taking place. Which also means no repository-based UI componentsbeing active. In other words, "Chandler as we know it" *can't be runningduring a load operation*.

* To allow for simple upgrades, parcels should be allowed access to thestream of records being loaded, so as to transform them directly intorecords that match the parcel's current schema.

* To allow for complex upgrades, parcels should be given "before" and"after" hooks. This allows them to use the repository itself to storeworking data during a complex schema change. Without this feature, itwould be necessary to provide query facilities in the dump/reload systemitself, increasing implementation complexity. But with this feature,simple upgrade reloads and non-upgrade reloads can be fast, and only verycomplex upgrades pay a penalty in performance and implementation complexity.


But the list of things that *aren't* so neatly resolved is bigger.

For example, what dumping order should be used? The modularity requirementargues in favor of dumping on a per-parcel basis, but this means thatannotated kinds will be scanned multiple times, once for each parcel thatannotates the kinds in question. So dumping performance seems to arguethat it would be better to just walk the entire repository and write dataas you find it. (At least, if you're going to be dumping most of therepository contents, anyway.)

Reloading performance, on the other hand, seems to argue that the datashould be broken up by record types, in order to avoid repeated dispatchingoverhead. That is, if you iterate over a sequence of identical records,you only have to look up the filtering code once, and you can even write itas a generator to avoid extra call overhead. (Of course, these performanceissues could easily be dominated by the cost of writing the data to therepository.)

Hm. Actually, I've just thought of a way to get the advantages of bothapproaches, by writing interleaved records at dump time and reading them inan interleaved way at reload time. Making it work for iterators would be apain, but doable. Okay, so scratch that issue off the "unresolved" list. :)

Using iterators does have another big advantage, though. If a parcel canprovide an iterator to do record transformation (simple schema upgrades),the iterator can also run code at the beginning and end of the process --which means it can also do complex upgrades that need setup and cleanup steps.

And writing these steps at the beginning and end of a loop that processesloaded records is a simple and "obvious" way to do it, without having tolearn multiple APIs for upgrading the schema. The only tricky bit in theAPI is that we'd have to guarantee relative ordering for thesetransformation functions, so that the parcel author knows what order theiterators will be called in. But that's mostly an implementation detail.

So, if we could reduce every item to records, then each Item subclass wouldjust need a 'load_records()' classmethod that iterated over an input andyielded transformed (or untransformed) records. The default implementationof this classmethod would simply yield untransformed records as long astheir schema matched the current schema, and raise an error otherwise.

This concept appears like it would work for an overall repository dump andreload of items by themselves. It does not address certain issuesregarding many-to-many and ordered relationships (which I'll get to later),nor does it handle the problem of upgrading parcel-specific data structures(e.g. block trees and copies thereof).

The issue with parcel-specific data structures is that these are not thingsdescribed by the parcel's own schema. For example, if you have a tree ofblocks, they're going to be described by block framework schemas. So,there seems to be a need for a post-reload hook that gets called in aparcel to allow fixup of such structures. Perhaps the simplest way tosupport that would be to invoke installParcel() again, with oldVersion setto the reloaded version.

These installParcel() recalls would have to happen in an order that ensuresit has been already called for dependent parcels. That is, if parcel foodepends on parcel bar, then bar's installParcel() must be called before foo's.

(Note that this means we *must* eliminate any remaining dependency cyclesbetween parcels in order to implement a viable dump/reload system.)



Storing Relationships
---------------------

In order to "flatten" items into simple data records, references to otherobjects have to be reduced to elementary types or records thereof. Thismeans, for example, that inter-item references might need to be reduced toa UUID.

In the simplest case of one-to-one relationships, this is straightforwardsince it's just representing the attribute as a UUID. Even simpleone-to-many relationships are easily handled by representing only the "one"side of the relationship in the records, and allowing the "many" side to berebuilt automatically via the biref machinery when the records are reloaded.

There is a problem with this approach, however, because all birefs arecurrently *ordered* collections, even in cases where the order ismeaningless. I had hoped during the schema API implementation last yearthat we could migrate to using ``schema.Many()`` in places where an orderedcollection was unnecessary, but I didn't realize at the time that "set"cardinality in the repository could not be part on a bidirectionalreference, so ``schema.Many()`` has mostly gone unused.

Why is this important? Because external representation of data as a streamof records requires additional fields or records to reconstruct thesequence in a relationship. Either an additional field is required tostore a key that indicates the order, or there has to be a sequence ofrecords whose sole purpose is to represent the ordering. (This isespecially complex in the case of order-preserving many-to-many relationships!)

So, a key step in getting our schema ready for dump-and-reload support isgoing to be examining our current use of schema.Sequence to see whetherthese use cases actually require order-preservation. In many cases, we areactually using indexes based on other attribute values, so the unindexedorder isn't necessary and it would be redundant to write it out in adump. We currently have about 120 Sequence attributes that would need tobe reviewed.

Since we have only around 3 uses of ``schema.Many()``, I would suggest wecreate a new descriptor to use in these cases, to free up the use of``schema.Many`` for ``Sequence`` attributes that don't really need to besequences. That would leave the following as possible relationship types,(ignoring mirror images):


* One - Many
* One - One
* Many - Many
* One - Sequence
* Many - Sequence
* Sequence - Sequence

One-to-many and one-to-one relationships can be represented in externalform without introducing any new fields or records; records on the "one"side would just record the identity of the referened object.

Most of the other relationships would require an extra series of records tobe written out, each record listing the identity of the objects on eitherside of the link. In the One-Sequence and Many-Sequence cases, the orderof the records would indicate the order of the links on the "Sequence" sideof the relationship.

Sequence-Sequence relationships, however, are unstable, in that there is nosimple way to read and write them without including data to indicate thesequence on both sides, and having a cleanup pass to fixup the loadedsequences. (Or by having special support from the repository to allow eachside to be loaded indepdently without the normal automatic linking.)

Thus, if possible, I'd like to abolish Sequence-to-Sequence relationships,allowing only the other five kinds of bidirectional relationships toexist. A sequence-to-sequence relationship is inherently unstable giventhe nature of birefs, anyway. Even if you're adding things to one side ina particular controlled order, the other side likely won't be in ameaningful order of its own. That we have these relationships in theschema now is largely due to "many" and "sequence" being spelled the sameway (i.e. ``schema.Sequence``) because the repository's currentimplementation doesn't offer a non-order-preserving cardinality that worksas half of a biref.

This would require a schema review, but I think it's going to be importantto make this distinction in the schema anyway, as it ultimately gives moreimplementation flexibility to the repository for future performanceimprovements, if we only have to implement sequences when sequences arereally required. (Today's repository will not care, of course, so thedistinction will be purely for documentation of intent and for possiblesimplification of the dump/reload process.)

Note that I've been making an assumption in all of the above that recordswould not contain any sequences or complex data structures themselves; butthis assumption needn't apply to small "many" or "sequence" attributes. Itcould be an optimization to write such links inline, but it would requirehint information of some kind in the schema to suggest which side is betterto write this information out on. I'm somewhat inclined against thisapproach, though, because we will also have to support large sequences(e.g. from UI-level item collections) and I'd just as soon not have twodifferent implementations.

But that's more of an API distinction; the API can to some extent representthings as indepdendent records, even if the underlying storage formatinterleaves the data to some extent. And conversely, the API couldtransform a non-interleaved format to an interleaved one, albeit at thecost of additional memory.



Wrap-up -- for now
------------------

At this point I haven't covered much actual API detail, or anything at allabout the actual external format. I don't actually care much about theexternal format, since it's not a requirement that it be processed by otherprograms, and parcel writers will never see it directly. The API will onlyexpose streams of records of elementary types, and provide a way for parcelwriters to transform individual records as the streams go by, and to dopre- and post-processing on the repository contents.

There's still a lot of work to be done to take these high-level thoughtsand turn them into a concrete API proposal, but at this point I'd likeinput and feedback on the points I've presented so far, in order to makesure that the next steps are going in the right direction.


_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Open Source Applications Foundation "chandler-dev" mailing list
http://lists.osafoundation.org/mailman/listinfo/chandler-dev

[Chandler-dev] Dump and reload sketch

Reply via email to