Hi, [ I'm posting to both -user and -dev because IMO it's relevant to both ]

I'm in the middle of a trial port of a system from postgresql to couchdb. The system has a distribution/replication requirement due primarily to unreliable connectivity and limited end-to-end bandwidth (and high latency) over a geographically dispersed set of users.

I have two requirements that don't fit well with the basic couchdb model. I've developed solutions for which I'd appreciate some feedback. Also it's a request for the notification mechanism to be improved, something like _external to moved to the core, and/or an in- process erlang plugin model that could replace notification/_external- based techniques.

My solution is a generalization of the GeoCouch technique: http://vmx.cx/cgi-bin/blog/index.cgi/geocouch-geospatial-queries-with-couchdb , and it allows me to mix SQL and couchdb access, all with the couchdb model and without breaking replication.

Requirement 1: I have a User/Role/Permission model for security. Users have roles and roles have permissions, and the primary (and very common) operation is to check for a transitive (user, permission) pair, which involves a join. It would be wrong to invert the user-role relationship and put both a list of users and a list of permissions into the role document because it is the user that joins roles and the maintenance of the user-role relationship is done when editing the user.

Requirement 2: I have a metadata search mechanism that allows the user to construct arbitrary queries (in the problem domain, not SQL). The transformation of the model from relational to document-based has eliminated any joins, but the problem isn't amenable to the oft- described key-based virtual joins, and I need to avoid retrieving potentially large sets of results and doing set operations in the client because it doesn't scale.

My solution is to use the notification mechanism to maintain (multiple) SQLite databases containing the document keys I need to search over. In the URP example, I store (db, src, dest) records. I also store the last seqno, so that I can do incremental updates to SQLite.

Then, using _external, I can make queries of SQLite to get me either (user, permission) pairs using a self-join, or in the case of the arbitrary metadata queries, a list of document ids.

The primary difficulties I have with this simple model are:

1) The notification mechanism doesn't give me enough information. Currently I have to do a _all_docs_by_seq and check for deletions by attempting to get each document, which I have to do for every document in any case (unless I use transparent id's) to determine if I'm interested in it, and then get the data. I presume this technique works because deletion is actually a form of update until compaction (I copied it from GeoCouch).

** SUGGESTION ** I would prefer an update interface that gave me a stream of (old, new) document pairs, which covers add/update/delete, plus a (from, to) seqno pair. Have a 'from' seqno lets me know when I have to trigger a full rescan, which I need to do in a variety of circumstances such as configuration change.

2) The process is not as performant as it could be because of the JSON marshalling.

** SUGGESTION ** I would prefer to write the notification process in erlang, and be passed the internal representation of the pre/post documents to avoid marshalling. Same for map/reduce. I know I can modify couchdb explicitly to do this, but that commits me to git tracking/merging etc. A in-process plugin option would be better.

3) Allowing the notification and _external handlers to communicate is more work than it should be. I want to do this so that the _external handler can cache the queries and invalidate the cache in response to updates.

As an alternative, for the URP data, I could use memcached and let the notification process invalidate the cache, which would mean that the _external handler doesn't have to cache at all because the client would only hit it on a memcached miss. The ultimate extension of this would be to have no _external handler, and have the notification process not use SQLite at all and instead let the client maintain the data in memcached using normal couchdb requests to calculate the transitive closure, which would be invalidated by the notification process. Unless the client passivates this data to disk it would mean recreating it (on-demand, incrementally) each time the client process starts. This is a difficult problem with multiple clients (from a development coordination perspective), so maybe an _external process is still required, even if it doesn't use on-disk passivation.

That solution doesn't work for the generalized metadata queries, but that's not a problem.

** SUGGESTION ** I think the _external mechanism (or something better) should be integrated as a matter of priority. IMO this is a key mechanism for augmenting the map/reduce model to satisfy more use-cases.

** SUGGESTION ** A plugin mechanism should allow a plugin to both listen to notifications and respond to e.g. _external.

Antony Blakey
--------------------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787

Human beings, who are almost unique in having the ability to learn from the experience of others, are also remarkable for their apparent disinclination to do so.
  -- Douglas Adams


Reply via email to