At 05:12 PM 10/9/2007 -0700, Andi Vajda wrote:

On Tue, 9 Oct 2007, Phillip J. Eby wrote:

At 04:12 PM 10/9/2007 -0700, Andi Vajda wrote:

On Tue, 9 Oct 2007, Phillip J. Eby wrote:

1. application-level code meddling in storage-level details
Could you give some examples ?

Any place where the application is creating collections or working with indexes in order to achieve performance compared to "naive" iteration or queries.

I see. Creating a collection is like creating a query.

Sometimes yes and sometimes no. I'm referring to creating collections for explicit caching or performance purposes -- what would correspond to a materialized view in an RDBMS. In Chandler, access to such "views" is explicit, rather than hidden at another layer.


But maybe this thread isn't about relational vs object - as I'm afraid it is - but perhaps about better app layering ?

Primarily layering, yes. We are missing layers where there should be layers, and have layers where we shouldn't. Also, currently, the application is shaped around the repository and the idea of transparent persistence, but this actually makes the domain model less efficient and modular than it would otherwise be.

For example, we currently emulate relational features at the application layer using stamps and annotations, for example, but this architecture is to some extent foreign to the repository. Where in a relational database, an annotation or stamp row would be an independent record, the repository considers the whole thing to be one "item" -- which affects loading, indexing, and so on.

We could certainly refactor to use separate items for these things, but the point is that we wouldn't have to -- if we had a logical to physical mapping (like how Hibernate can be used in Cosmo).

The thing is, once you look at this from the app layering perspective, the mismatch between the relatively simple things the app is trying to do, and the very powerful generality of the repository, becomes more apparent.


3. no indirection between the application's logical schema and its physical storage schema
Seems incorrect. I can change the physical storage schema (core schema or even repo format) without affecting app code. Or am I misunderstanding something ?

Sorry, I am using the relational meaning of logical and physical. A logical schema does not include indexes or views, while a physical schema does. I'm also extending this to refer to the lack of distinction between our preferred form of data as encapsulated objects, versus the best divisions of data from a performance point of view.

In chandler we've had for a long time the distinction between capital 'I' Items and lowercase 'i' items. This distinction has most materialized with the dump/reload/eim work which is a way to export 'I' Items. The repository deals with 'i' items on the other hand. Isn't this equivalent to what you're talking about ?

Yes. The key distinction vis-a-vis relational vs. repository, is that we would now be adding another Python layer, in addition to the those that already exist. Whereas, if we used a Python ORM, we would have just the mapping and an all-C backend that we don't have to maintain. In fact, the mapping layer might also be maintained by someone else, if we use one of the many O-R mappers for Python such as SQLObject, Storm, Axiom, DejaVu, SQLAlchemy, Mother,... and probably others I've forgotten about.


As for indexes, yes you're correct. They're not part of the logical schema. They're performance implementation details that are chosen by the app just like in a relational app where the app has to ultimately know about table layout, keys, indexes, put kludges into stored procedures, to make efficient queries.

We need to distinguish between "app" in the sense of "all of Chandler" and "app" in the sense of "domain/interaction code". The domain/interaction code should most definitely *not* know about such things; that is the job of the storage layer to specify a mapping between the logical and physical schema, just like in EIM. (It would be nice if we could reuse EIM for this, or have a way to automatically map to and from EIM.)


4. implementing a generic database inside another generic database
That was the goal, originally.

Not quite; having a generic database was the goal, not that it be implemented *inside* another generic database. It is one thing to have a BerkeleyDB persistence layer driven by the application's dynamic schema, and another one altogether to implement a database on top of a fixed BerkeleyDB schema.

For comparison purposes, consider OpenLDAP: it is a generic, hierarchical, networked database implemented atop BerkeleyDB. However, instead of having a fixed schema for storing values, items, etc., in BerkeleyDB, it is dynamically extended as attribute types and indexes are added. So the database is *represented* in BerkeleyDB, rather than being implemented *inside* BerkeleyDB.

I think we disagree or misunderstand each other here. Or maybe I'm simply not following you. While it's not relational, the chandler repository has to go through the same hoops as OpenLDAP or MySQL to store anything in Berkeley DB. Berkeley DB can only store key/value pairs of byte string in b-trees, hashes, queues, and a fourth structure whose name escapes me at the moment.

But you don't make a new BerkeleyDB index every time we index another attribute in Chandler, right? That's the difference. When you add a new attribute or index in OpenLDAP, this in fact creates separate BerkeleyDB-level files. And the same is true for MySQL as well.

In other words OpenLDAP and MySQL express their dynamic schemas using BerkeleyDB, rather than using a static BerkeleyDB schema which is then used as a meta-level to express the dynamic schema. It could be compared to the difference between interpretation and compiling; the repository is in effect an "interpreted" BerkeleyDB app, where OpenLDAP and MySQL are "compiled".

(By the way, ZODB once used a similar implementation strategy to the repository for BerkeleyDB, and it had comparable performance issues. It does a lot better with its logfile-based storage format, that at least doesn't add in BerkeleyDB's paging overhead.)


I'm not sure what you mean by "hard compiled". Nothing stops us from having a relational schema that's extensible by parcels, or from doing so dynamically. In truth, the schemas we use with the repository today are no less "hard compiled". If we at some future time allow user-defined fields, there are still ways to represent them within such a relatively-static schema, or to simply modify the schema at runtime.

Once you've worked hard at extracting performance from your static schema, so that queries and joins are not too massive, any extension throws the (whole)
effort into question over and over again.

We might ask our Cosmo brethren if they have found this to be the case. Ideally, however, if you have a mapping layer like Hibernate that lets you specify the physical model separately from the logical model, then no application-layer code should change.

I think, however, that your concern about joins is unwarranted. In a "table per class or annotation" mapping with lazy loading, plugins adding new data does *not* affect query performance -- which if I understand correctly *is* the case with the repository. Queries that are part of Chandler's static model are only going to display stuff from base tables, with no need to do joins at all. For example, I see no reason why the dashboard and calendar views of Chandler can't be done with a single modest table with a few indexes. Any extensions or annotations will be in separate tables, and the data in them would only be needed when accessing the object in the detail view, or doing other "full object" operations like sharing, dump/reload, etc.

Yes, that's a tradeoff meaning it takes more time to load an individual item in full, vs. less to access items in bulk. However, proper separation of responsibilities to distinct interfaces can often prevent the need to access individual items in the first place. For example, if there's a mapping from any subset of the tables to EIM records, you can generate the EIM records in bulk, letting the RDBMS do most of the heavy lifting, rather than loading items one at a time to generate their EIM.


Any plugin developer will have to understand this. This was the main reason why we didn't choose this route five years ago. Maybe now we don't care anymore about this aspect as much.

Five years ago, Python ORM's and embedded RDBMS were neither so numerous nor ubiquitous; it's not clear that this route *would* have made sense five years ago.

Nowadays, if you have recent web development experience with in Python, odds are you've at least tried an ORM or two. And for simple things, you usually don't write much SQL directly with these tools.


For example, in conversations I've had with Grant, he compared Chandler with Mail.app and iCal.app which have such static schemas and can perform much better in their specific domains than more generic chandler.

If that's the route we'd like to take Chandler to, fine. That should be clearly stated.

I believe Katie has already stated it, even as far back as the creation of the schema API. But I imagine she will clarify it again if needed.


I'm not exactly against it either, just a lot less excited about it.

It'd be a different product, albeit with a lot of the same visible 0.7/1.0 features of today but a dead-end nonetheless. Chandler would only ever do what it's hardcoded to do (from a schema standpoint).

As with Grant, I don't understand what you mean. We "hardcode" these things already, and were doing so before I even joined the project.

Perhaps you are referring to features the repository has, that the application does not use? If so, which ones?


The last five years of work would be pretty much wasted, except for their "what not to do" aspect :)

Not really; Chandler needed *some* form of persistence to get where it is today, and the tools did not exist back then. ZODB was really the only reasonable Python competitor at the time, IIRC.

Of course, the actual application direction had certainly changed by 2004 or early 2005, and I have been bringing up these points intermittently ever since. So if we'd refactored sooner, we could potentially have wasted less work in the interim. At the same time, the application is a lot different today than in 2004 -- so it's not clear we could have avoided wasting some work, somewhere. Preview is an important milestone because we're saying that, "this is pretty much where we're going", so now is a good time to consider the best choices for the requirements.

It may not be as interesting with respect to the repository, but there are still plenty of interesting development opportunities in and around Chandler, as you've shown with JCC -- which I think is cool and wish *I* had time to play with it. :)


5. implementing generic indexes inside of generic indexes
How so ? What are you thinking about ?

The skip list system is the main one I have in mind, but if I correctly understand how versions and values are stored, then those would be included too.

Yes, a skiplist implements the structure behind repository indexes. What is the 'generic indexes' that skiplist are implemented in you're talking about ?

I mean that by implementing a skiplist *inside* of BerkeleyDB rather than using a native BerkeleyDB structure, we're adding an "interpretation" layer there.

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

Open Source Applications Foundation "chandler-dev" mailing list
http://lists.osafoundation.org/mailman/listinfo/chandler-dev

Reply via email to