Re: [Chandler-dev] [Sum] The Great Architecture Discussion of 2007

Phillip J. Eby Tue, 09 Oct 2007 18:38:27 -0700

At 05:12 PM 10/9/2007 -0700, Andi Vajda wrote:

On Tue, 9 Oct 2007, Phillip J. Eby wrote:
At 04:12 PM 10/9/2007 -0700, Andi Vajda wrote:
On Tue, 9 Oct 2007, Phillip J. Eby wrote:
1. application-level code meddling in storage-level details
Could you give some examples ?
Any place where the application is creating collections or workingwith indexes in order to achieve performance compared to "naive"iteration or queries.
I see. Creating a collection is like creating a query.

Sometimes yes and sometimes no. I'm referring to creatingcollections for explicit caching or performance purposes -- whatwould correspond to a materialized view in an RDBMS. In Chandler,access to such "views" is explicit, rather than hidden at another layer.

But maybe this thread isn't about relational vs object - as I'mafraid it is - but perhaps about better app layering ?

Primarily layering, yes. We are missing layers where there should belayers, and have layers where we shouldn't. Also, currently, theapplication is shaped around the repository and the idea oftransparent persistence, but this actually makes the domain modelless efficient and modular than it would otherwise be.

For example, we currently emulate relational features at theapplication layer using stamps and annotations, for example, but thisarchitecture is to some extent foreign to the repository. Where in arelational database, an annotation or stamp row would be anindependent record, the repository considers the whole thing to beone "item" -- which affects loading, indexing, and so on.

We could certainly refactor to use separate items for these things,but the point is that we wouldn't have to -- if we had a logical tophysical mapping (like how Hibernate can be used in Cosmo).

The thing is, once you look at this from the app layeringperspective, the mismatch between the relatively simple things theapp is trying to do, and the very powerful generality of therepository, becomes more apparent.

3. no indirection between the application's logical schema andits physical storage schema
Seems incorrect. I can change the physical storage schema (coreschema or even repo format) without affecting app code. Or am Imisunderstanding something ?
Sorry, I am using the relational meaning of logical andphysical. A logical schema does not include indexes or views,while a physical schema does. I'm also extending this to refer tothe lack of distinction between our preferred form of data asencapsulated objects, versus the best divisions of data from aperformance point of view.
In chandler we've had for a long time the distinction betweencapital 'I' Items and lowercase 'i' items. This distinction has mostmaterialized with the dump/reload/eim work which is a way to export'I' Items. The repository deals with 'i' items on the other hand.Isn't this equivalent to what you're talking about ?

Yes. The key distinction vis-a-vis relational vs. repository, isthat we would now be adding another Python layer, in addition to thethose that already exist. Whereas, if we used a Python ORM, we wouldhave just the mapping and an all-C backend that we don't have tomaintain. In fact, the mapping layer might also be maintained bysomeone else, if we use one of the many O-R mappers for Python suchas SQLObject, Storm, Axiom, DejaVu, SQLAlchemy, Mother,... andprobably others I've forgotten about.

As for indexes, yes you're correct. They're not part of the logicalschema. They're performance implementation details that are chosenby the app just like in a relational app where the app has toultimately know about table layout, keys, indexes, put kludges intostored procedures, to make efficient queries.

We need to distinguish between "app" in the sense of "all ofChandler" and "app" in the sense of "domain/interaction code". Thedomain/interaction code should most definitely *not* know about suchthings; that is the job of the storage layer to specify a mappingbetween the logical and physical schema, just like in EIM. (It wouldbe nice if we could reuse EIM for this, or have a way toautomatically map to and from EIM.)

4. implementing a generic database inside another generic database
That was the goal, originally.
Not quite; having a generic database was the goal, not that it beimplemented *inside* another generic database. It is one thing tohave a BerkeleyDB persistence layer driven by the application'sdynamic schema, and another one altogether to implement a databaseon top of a fixed BerkeleyDB schema.
For comparison purposes, consider OpenLDAP: it is a generic,hierarchical, networked database implemented atopBerkeleyDB. However, instead of having a fixed schema for storingvalues, items, etc., in BerkeleyDB, it is dynamically extended asattribute types and indexes are added. So the database is*represented* in BerkeleyDB, rather than being implemented *inside* BerkeleyDB.
I think we disagree or misunderstand each other here. Or maybe I'msimply not following you. While it's not relational, the chandlerrepository has to go through the same hoops as OpenLDAP or MySQL tostore anything in Berkeley DB. Berkeley DB can only store key/valuepairs of byte string in b-trees, hashes, queues, and a fourthstructure whose name escapes me at the moment.

But you don't make a new BerkeleyDB index every time we index anotherattribute in Chandler, right? That's the difference. When you add anew attribute or index in OpenLDAP, this in fact creates separateBerkeleyDB-level files. And the same is true for MySQL as well.

In other words OpenLDAP and MySQL express their dynamic schemas usingBerkeleyDB, rather than using a static BerkeleyDB schema which isthen used as a meta-level to express the dynamic schema. It could becompared to the difference between interpretation and compiling; therepository is in effect an "interpreted" BerkeleyDB app, whereOpenLDAP and MySQL are "compiled".

(By the way, ZODB once used a similar implementation strategy to therepository for BerkeleyDB, and it had comparable performanceissues. It does a lot better with its logfile-based storage format,that at least doesn't add in BerkeleyDB's paging overhead.)

I'm not sure what you mean by "hard compiled". Nothing stops usfrom having a relational schema that's extensible by parcels, orfrom doing so dynamically. In truth, the schemas we use with therepository today are no less "hard compiled". If we at some futuretime allow user-defined fields, there are still ways to representthem within such a relatively-static schema, or to simply modifythe schema at runtime.
Once you've worked hard at extracting performance from your staticschema, so that queries and joins are not too massive, any extensionthrows the (whole)
effort into question over and over again.

We might ask our Cosmo brethren if they have found this to be thecase. Ideally, however, if you have a mapping layer like Hibernatethat lets you specify the physical model separately from the logicalmodel, then no application-layer code should change.

I think, however, that your concern about joins is unwarranted. In a"table per class or annotation" mapping with lazy loading, pluginsadding new data does *not* affect query performance -- which if Iunderstand correctly *is* the case with the repository. Queries thatare part of Chandler's static model are only going to display stufffrom base tables, with no need to do joins at all. For example, Isee no reason why the dashboard and calendar views of Chandler can'tbe done with a single modest table with a few indexes. Anyextensions or annotations will be in separate tables, and the data inthem would only be needed when accessing the object in the detailview, or doing other "full object" operations like sharing, dump/reload, etc.

Yes, that's a tradeoff meaning it takes more time to load anindividual item in full, vs. less to access items in bulk. However,proper separation of responsibilities to distinct interfaces canoften prevent the need to access individual items in the firstplace. For example, if there's a mapping from any subset of thetables to EIM records, you can generate the EIM records in bulk,letting the RDBMS do most of the heavy lifting, rather than loadingitems one at a time to generate their EIM.

Any plugin developer will have to understand this. This was the mainreason why we didn't choose this route five years ago. Maybe now wedon't care anymore about this aspect as much.

Five years ago, Python ORM's and embedded RDBMS were neither sonumerous nor ubiquitous; it's not clear that this route *would* havemade sense five years ago.

Nowadays, if you have recent web development experience with inPython, odds are you've at least tried an ORM or two. And for simplethings, you usually don't write much SQL directly with these tools.

For example, in conversations I've had with Grant, he comparedChandler with Mail.app and iCal.app which have such static schemasand can perform much better in their specific domains than moregeneric chandler.
If that's the route we'd like to take Chandler to, fine. That shouldbe clearly stated.

I believe Katie has already stated it, even as far back as thecreation of the schema API. But I imagine she will clarify it again if needed.

I'm not exactly against it either, just a lot less excited about it.
It'd be a different product, albeit with a lot of the same visible0.7/1.0 features of today but a dead-end nonetheless. Chandler wouldonly ever do what it's hardcoded to do (from a schema standpoint).

As with Grant, I don't understand what you mean. We "hardcode" thesethings already, and were doing so before I even joined the project.

Perhaps you are referring to features the repository has, that theapplication does not use? If so, which ones?

The last five years of work would be pretty much wasted, except fortheir "what not to do" aspect :)

Not really; Chandler needed *some* form of persistence to get whereit is today, and the tools did not exist back then. ZODB was reallythe only reasonable Python competitor at the time, IIRC.

Of course, the actual application direction had certainly changed by2004 or early 2005, and I have been bringing up these pointsintermittently ever since. So if we'd refactored sooner, we couldpotentially have wasted less work in the interim. At the same time,the application is a lot different today than in 2004 -- so it's notclear we could have avoided wasting some work, somewhere. Preview isan important milestone because we're saying that, "this is prettymuch where we're going", so now is a good time to consider the bestchoices for the requirements.

It may not be as interesting with respect to the repository, butthere are still plenty of interesting development opportunities inand around Chandler, as you've shown with JCC -- which I think iscool and wish *I* had time to play with it. :)

5. implementing generic indexes inside of generic indexes
How so ? What are you thinking about ?
The skip list system is the main one I have in mind, but if Icorrectly understand how versions and values are stored, then thosewould be included too.
Yes, a skiplist implements the structure behind repository indexes.What is the 'generic indexes' that skiplist are implemented inyou're talking about ?

I mean that by implementing a skiplist *inside* of BerkeleyDB ratherthan using a native BerkeleyDB structure, we're adding an"interpretation" layer there.


_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Open Source Applications Foundation "chandler-dev" mailing list
http://lists.osafoundation.org/mailman/listinfo/chandler-dev

Re: [Chandler-dev] [Sum] The Great Architecture Discussion of 2007

Reply via email to