On Tue, 9 Oct 2007, Phillip J. Eby wrote:
At 04:12 PM 10/9/2007 -0700, Andi Vajda wrote:
On Tue, 9 Oct 2007, Phillip J. Eby wrote:
1. application-level code meddling in storage-level details
Could you give some examples ?
Any place where the application is creating collections or working with
indexes in order to achieve performance compared to "naive" iteration or
queries.
I see. Creating a collection is like creating a query. In the relational
world you propose that the app not write queries ?
On indexes I see your point, I think. An index is a query's cache. Not
something one would want to expose to the app. Funnily, indexes _became_ this
later, they started as a way to access collections by row number for
displaying in the UI. Then later they became a query's cache (when we
implemented abstract sets and collections) and then they were used as a way of
persisting sort order.
In any case, I don't see how this is different in a relational model.
Once you work on extracting performance from a relational app, you end-up
writing hardcoded queries that have very specific app knowledge.
But maybe this thread isn't about relational vs object - as I'm afraid it is -
but perhaps about better app layering ?
2. lack of sufficient domain-specific query APIs
Again, please give an example of what you'd like ?
This isn't a repository problem - it's a domain-layer problem. If the places
where we're doing #1 were at least consolidated to single points of
reference, #1 wouldn't be so bad.
I think the app has done a pretty good job at moving a lot of the index
maintenance code to a specific area. I'm thinking of the dashboard indexes
here.
3. no indirection between the application's logical schema and its
physical storage schema
Seems incorrect. I can change the physical storage schema (core schema or
even repo format) without affecting app code. Or am I misunderstanding
something ?
Sorry, I am using the relational meaning of logical and physical. A logical
schema does not include indexes or views, while a physical schema does. I'm
also extending this to refer to the lack of distinction between our preferred
form of data as encapsulated objects, versus the best divisions of data from
a performance point of view.
In chandler we've had for a long time the distinction between capital 'I'
Items and lowercase 'i' items. This distinction has most materialized with the
dump/reload/eim work which is a way to export 'I' Items. The repository deals
with 'i' items on the other hand. Isn't this equivalent to what you're talking
about ?
As for indexes, yes you're correct. They're not part of the logical schema.
They're performance implementation details that are chosen by the app just
like in a relational app where the app has to ultimately know about table
layout, keys, indexes, put kludges into stored procedures, to make efficient
queries.
The core schema and repo format aren't a factor in this, as they're at an
even lower level than the "physical" schema I'm talking about. In the
repository today, the "physical" schema consists of whatever sets/collections
and indexes you create, which is rather analagous to creating indexes or
materialized views in an RDBMS, only without the same transparency. In an
RDBMS, if you add an index or a materialized view, it doesn't change how you
retrieve your data: it just goes faster. So you can do application specific
tuning without changing your application.
Same with the repository. It just goes faster. You don't have to change the
way you access your data once you've created indexes. Except for random row
number-based access for which I didn't dare writing the iterating APIs. But if
you look in the collection code, it takes the slow route if it can't find an
index and the fast route if it can for iteration, appartenance, etc... No need
to change the access code at the app level to take advantage of the indexes.
A repository index is a materialized view of a collection in relational terms.
4. implementing a generic database inside another generic database
That was the goal, originally.
Not quite; having a generic database was the goal, not that it be implemented
*inside* another generic database. It is one thing to have a BerkeleyDB
persistence layer driven by the application's dynamic schema, and another one
altogether to implement a database on top of a fixed BerkeleyDB schema.
For comparison purposes, consider OpenLDAP: it is a generic, hierarchical,
networked database implemented atop BerkeleyDB. However, instead of having a
fixed schema for storing values, items, etc., in BerkeleyDB, it is
dynamically extended as attribute types and indexes are added. So the
database is *represented* in BerkeleyDB, rather than being implemented
*inside* BerkeleyDB.
I think we disagree or misunderstand each other here. Or maybe I'm simply not
following you. While it's not relational, the chandler repository has to go
through the same hoops as OpenLDAP or MySQL to store anything in Berkeley DB.
Berkeley DB can only store key/value pairs of byte string in b-trees, hashes,
queues, and a fourth structure whose name escapes me at the moment.
So, when I say it is implemented "inside" another database, I mean it in the
sense that the schema of the repository is not reflected in the schema of its
back-end storage, and thus cannot fully utilize the back-end's features to
maximum performance.
Can you give a specific example that would help me understand what you mean ?
I'm not sure what you mean by "hard compiled". Nothing stops us from having
a relational schema that's extensible by parcels, or from doing so
dynamically. In truth, the schemas we use with the repository today are no
less "hard compiled". If we at some future time allow user-defined fields,
there are still ways to represent them within such a relatively-static
schema, or to simply modify the schema at runtime.
Once you've worked hard at extracting performance from your static schema, so
that queries and joins are not too massive, any extension throws the (whole)
effort into question over and over again. Any plugin developer will have to
understand this. This was the main reason why we didn't choose this route five
years ago. Maybe now we don't care anymore about this aspect as much.
For example, in conversations I've had with Grant, he compared Chandler with
Mail.app and iCal.app which have such static schemas and can perform much
better in their specific domains than more generic chandler.
If that's the route we'd like to take Chandler to, fine. That should be
clearly stated. I'm not exactly against it either, just a lot less excited
about it.
It'd be a different product, albeit with a lot of the same visible 0.7/1.0
features of today but a dead-end nonetheless. Chandler would only ever do what
it's hardcoded to do (from a schema standpoint).
The last five years of work would be pretty much wasted, except for their
"what not to do" aspect :)
5. implementing generic indexes inside of generic indexes
How so ? What are you thinking about ?
The skip list system is the main one I have in mind, but if I correctly
understand how versions and values are stored, then those would be included
too.
Yes, a skiplist implements the structure behind repository indexes. What is
the 'generic indexes' that skiplist are implemented in you're talking about ?
Andi..
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Open Source Applications Foundation "chandler-dev" mailing list
http://lists.osafoundation.org/mailman/listinfo/chandler-dev