At 05:12 PM 10/9/2007 -0700, Andi Vajda wrote:
On Tue, 9 Oct 2007, Phillip J. Eby wrote:
At 04:12 PM 10/9/2007 -0700, Andi Vajda wrote:
On Tue, 9 Oct 2007, Phillip J. Eby wrote:
1. application-level code meddling in storage-level details
Could you give some examples ?
Any place where the application is creating collections or working
with indexes in order to achieve performance compared to "naive"
iteration or queries.
I see. Creating a collection is like creating a query.
Sometimes yes and sometimes no. I'm referring to creating
collections for explicit caching or performance purposes -- what
would correspond to a materialized view in an RDBMS. In Chandler,
access to such "views" is explicit, rather than hidden at another layer.
But maybe this thread isn't about relational vs object - as I'm
afraid it is - but perhaps about better app layering ?
Primarily layering, yes. We are missing layers where there should be
layers, and have layers where we shouldn't. Also, currently, the
application is shaped around the repository and the idea of
transparent persistence, but this actually makes the domain model
less efficient and modular than it would otherwise be.
For example, we currently emulate relational features at the
application layer using stamps and annotations, for example, but this
architecture is to some extent foreign to the repository. Where in a
relational database, an annotation or stamp row would be an
independent record, the repository considers the whole thing to be
one "item" -- which affects loading, indexing, and so on.
We could certainly refactor to use separate items for these things,
but the point is that we wouldn't have to -- if we had a logical to
physical mapping (like how Hibernate can be used in Cosmo).
The thing is, once you look at this from the app layering
perspective, the mismatch between the relatively simple things the
app is trying to do, and the very powerful generality of the
repository, becomes more apparent.
3. no indirection between the application's logical schema and
its physical storage schema
Seems incorrect. I can change the physical storage schema (core
schema or even repo format) without affecting app code. Or am I
misunderstanding something ?
Sorry, I am using the relational meaning of logical and
physical. A logical schema does not include indexes or views,
while a physical schema does. I'm also extending this to refer to
the lack of distinction between our preferred form of data as
encapsulated objects, versus the best divisions of data from a
performance point of view.
In chandler we've had for a long time the distinction between
capital 'I' Items and lowercase 'i' items. This distinction has most
materialized with the dump/reload/eim work which is a way to export
'I' Items. The repository deals with 'i' items on the other hand.
Isn't this equivalent to what you're talking about ?
Yes. The key distinction vis-a-vis relational vs. repository, is
that we would now be adding another Python layer, in addition to the
those that already exist. Whereas, if we used a Python ORM, we would
have just the mapping and an all-C backend that we don't have to
maintain. In fact, the mapping layer might also be maintained by
someone else, if we use one of the many O-R mappers for Python such
as SQLObject, Storm, Axiom, DejaVu, SQLAlchemy, Mother,... and
probably others I've forgotten about.
As for indexes, yes you're correct. They're not part of the logical
schema. They're performance implementation details that are chosen
by the app just like in a relational app where the app has to
ultimately know about table layout, keys, indexes, put kludges into
stored procedures, to make efficient queries.
We need to distinguish between "app" in the sense of "all of
Chandler" and "app" in the sense of "domain/interaction code". The
domain/interaction code should most definitely *not* know about such
things; that is the job of the storage layer to specify a mapping
between the logical and physical schema, just like in EIM. (It would
be nice if we could reuse EIM for this, or have a way to
automatically map to and from EIM.)
4. implementing a generic database inside another generic database
That was the goal, originally.
Not quite; having a generic database was the goal, not that it be
implemented *inside* another generic database. It is one thing to
have a BerkeleyDB persistence layer driven by the application's
dynamic schema, and another one altogether to implement a database
on top of a fixed BerkeleyDB schema.
For comparison purposes, consider OpenLDAP: it is a generic,
hierarchical, networked database implemented atop
BerkeleyDB. However, instead of having a fixed schema for storing
values, items, etc., in BerkeleyDB, it is dynamically extended as
attribute types and indexes are added. So the database is
*represented* in BerkeleyDB, rather than being implemented *inside* BerkeleyDB.
I think we disagree or misunderstand each other here. Or maybe I'm
simply not following you. While it's not relational, the chandler
repository has to go through the same hoops as OpenLDAP or MySQL to
store anything in Berkeley DB. Berkeley DB can only store key/value
pairs of byte string in b-trees, hashes, queues, and a fourth
structure whose name escapes me at the moment.
But you don't make a new BerkeleyDB index every time we index another
attribute in Chandler, right? That's the difference. When you add a
new attribute or index in OpenLDAP, this in fact creates separate
BerkeleyDB-level files. And the same is true for MySQL as well.
In other words OpenLDAP and MySQL express their dynamic schemas using
BerkeleyDB, rather than using a static BerkeleyDB schema which is
then used as a meta-level to express the dynamic schema. It could be
compared to the difference between interpretation and compiling; the
repository is in effect an "interpreted" BerkeleyDB app, where
OpenLDAP and MySQL are "compiled".
(By the way, ZODB once used a similar implementation strategy to the
repository for BerkeleyDB, and it had comparable performance
issues. It does a lot better with its logfile-based storage format,
that at least doesn't add in BerkeleyDB's paging overhead.)
I'm not sure what you mean by "hard compiled". Nothing stops us
from having a relational schema that's extensible by parcels, or
from doing so dynamically. In truth, the schemas we use with the
repository today are no less "hard compiled". If we at some future
time allow user-defined fields, there are still ways to represent
them within such a relatively-static schema, or to simply modify
the schema at runtime.
Once you've worked hard at extracting performance from your static
schema, so that queries and joins are not too massive, any extension
throws the (whole)
effort into question over and over again.
We might ask our Cosmo brethren if they have found this to be the
case. Ideally, however, if you have a mapping layer like Hibernate
that lets you specify the physical model separately from the logical
model, then no application-layer code should change.
I think, however, that your concern about joins is unwarranted. In a
"table per class or annotation" mapping with lazy loading, plugins
adding new data does *not* affect query performance -- which if I
understand correctly *is* the case with the repository. Queries that
are part of Chandler's static model are only going to display stuff
from base tables, with no need to do joins at all. For example, I
see no reason why the dashboard and calendar views of Chandler can't
be done with a single modest table with a few indexes. Any
extensions or annotations will be in separate tables, and the data in
them would only be needed when accessing the object in the detail
view, or doing other "full object" operations like sharing, dump/reload, etc.
Yes, that's a tradeoff meaning it takes more time to load an
individual item in full, vs. less to access items in bulk. However,
proper separation of responsibilities to distinct interfaces can
often prevent the need to access individual items in the first
place. For example, if there's a mapping from any subset of the
tables to EIM records, you can generate the EIM records in bulk,
letting the RDBMS do most of the heavy lifting, rather than loading
items one at a time to generate their EIM.
Any plugin developer will have to understand this. This was the main
reason why we didn't choose this route five years ago. Maybe now we
don't care anymore about this aspect as much.
Five years ago, Python ORM's and embedded RDBMS were neither so
numerous nor ubiquitous; it's not clear that this route *would* have
made sense five years ago.
Nowadays, if you have recent web development experience with in
Python, odds are you've at least tried an ORM or two. And for simple
things, you usually don't write much SQL directly with these tools.
For example, in conversations I've had with Grant, he compared
Chandler with Mail.app and iCal.app which have such static schemas
and can perform much better in their specific domains than more
generic chandler.
If that's the route we'd like to take Chandler to, fine. That should
be clearly stated.
I believe Katie has already stated it, even as far back as the
creation of the schema API. But I imagine she will clarify it again if needed.
I'm not exactly against it either, just a lot less excited about it.
It'd be a different product, albeit with a lot of the same visible
0.7/1.0 features of today but a dead-end nonetheless. Chandler would
only ever do what it's hardcoded to do (from a schema standpoint).
As with Grant, I don't understand what you mean. We "hardcode" these
things already, and were doing so before I even joined the project.
Perhaps you are referring to features the repository has, that the
application does not use? If so, which ones?
The last five years of work would be pretty much wasted, except for
their "what not to do" aspect :)
Not really; Chandler needed *some* form of persistence to get where
it is today, and the tools did not exist back then. ZODB was really
the only reasonable Python competitor at the time, IIRC.
Of course, the actual application direction had certainly changed by
2004 or early 2005, and I have been bringing up these points
intermittently ever since. So if we'd refactored sooner, we could
potentially have wasted less work in the interim. At the same time,
the application is a lot different today than in 2004 -- so it's not
clear we could have avoided wasting some work, somewhere. Preview is
an important milestone because we're saying that, "this is pretty
much where we're going", so now is a good time to consider the best
choices for the requirements.
It may not be as interesting with respect to the repository, but
there are still plenty of interesting development opportunities in
and around Chandler, as you've shown with JCC -- which I think is
cool and wish *I* had time to play with it. :)
5. implementing generic indexes inside of generic indexes
How so ? What are you thinking about ?
The skip list system is the main one I have in mind, but if I
correctly understand how versions and values are stored, then those
would be included too.
Yes, a skiplist implements the structure behind repository indexes.
What is the 'generic indexes' that skiplist are implemented in
you're talking about ?
I mean that by implementing a skiplist *inside* of BerkeleyDB rather
than using a native BerkeleyDB structure, we're adding an
"interpretation" layer there.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Open Source Applications Foundation "chandler-dev" mailing list
http://lists.osafoundation.org/mailman/listinfo/chandler-dev