Hi, [ I'm posting to both -user and -dev because IMO it's relevant to
both ]
I'm in the middle of a trial port of a system from postgresql to
couchdb. The system has a distribution/replication requirement due
primarily to unreliable connectivity and limited end-to-end bandwidth
(and high latency) over a geographically dispersed set of users.
I have two requirements that don't fit well with the basic couchdb
model. I've developed solutions for which I'd appreciate some
feedback. Also it's a request for the notification mechanism to be
improved, something like _external to moved to the core, and/or an in-
process erlang plugin model that could replace notification/_external-
based techniques.
My solution is a generalization of the GeoCouch technique: http://vmx.cx/cgi-bin/blog/index.cgi/geocouch-geospatial-queries-with-couchdb
, and it allows me to mix SQL and couchdb access, all with the couchdb
model and without breaking replication.
Requirement 1: I have a User/Role/Permission model for security. Users
have roles and roles have permissions, and the primary (and very
common) operation is to check for a transitive (user, permission)
pair, which involves a join. It would be wrong to invert the user-role
relationship and put both a list of users and a list of permissions
into the role document because it is the user that joins roles and the
maintenance of the user-role relationship is done when editing the user.
Requirement 2: I have a metadata search mechanism that allows the user
to construct arbitrary queries (in the problem domain, not SQL). The
transformation of the model from relational to document-based has
eliminated any joins, but the problem isn't amenable to the oft-
described key-based virtual joins, and I need to avoid retrieving
potentially large sets of results and doing set operations in the
client because it doesn't scale.
My solution is to use the notification mechanism to maintain
(multiple) SQLite databases containing the document keys I need to
search over. In the URP example, I store (db, src, dest) records. I
also store the last seqno, so that I can do incremental updates to
SQLite.
Then, using _external, I can make queries of SQLite to get me either
(user, permission) pairs using a self-join, or in the case of the
arbitrary metadata queries, a list of document ids.
The primary difficulties I have with this simple model are:
1) The notification mechanism doesn't give me enough information.
Currently I have to do a _all_docs_by_seq and check for deletions by
attempting to get each document, which I have to do for every document
in any case (unless I use transparent id's) to determine if I'm
interested in it, and then get the data. I presume this technique
works because deletion is actually a form of update until compaction
(I copied it from GeoCouch).
** SUGGESTION ** I would prefer an update interface that gave me a
stream of (old, new) document pairs, which covers add/update/delete,
plus a (from, to) seqno pair. Have a 'from' seqno lets me know when I
have to trigger a full rescan, which I need to do in a variety of
circumstances such as configuration change.
2) The process is not as performant as it could be because of the JSON
marshalling.
** SUGGESTION ** I would prefer to write the notification process in
erlang, and be passed the internal representation of the pre/post
documents to avoid marshalling. Same for map/reduce. I know I can
modify couchdb explicitly to do this, but that commits me to git
tracking/merging etc. A in-process plugin option would be better.
3) Allowing the notification and _external handlers to communicate is
more work than it should be. I want to do this so that the _external
handler can cache the queries and invalidate the cache in response to
updates.
As an alternative, for the URP data, I could use memcached and let the
notification process invalidate the cache, which would mean that the
_external handler doesn't have to cache at all because the client
would only hit it on a memcached miss. The ultimate extension of this
would be to have no _external handler, and have the notification
process not use SQLite at all and instead let the client maintain the
data in memcached using normal couchdb requests to calculate the
transitive closure, which would be invalidated by the notification
process. Unless the client passivates this data to disk it would mean
recreating it (on-demand, incrementally) each time the client process
starts. This is a difficult problem with multiple clients (from a
development coordination perspective), so maybe an _external process
is still required, even if it doesn't use on-disk passivation.
That solution doesn't work for the generalized metadata queries, but
that's not a problem.
** SUGGESTION ** I think the _external mechanism (or something better)
should be integrated as a matter of priority. IMO this is a key
mechanism for augmenting the map/reduce model to satisfy more use-cases.
** SUGGESTION ** A plugin mechanism should allow a plugin to both
listen to notifications and respond to e.g. _external.
Antony Blakey
--------------------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787
Human beings, who are almost unique in having the ability to learn
from the experience of others, are also remarkable for their apparent
disinclination to do so.
-- Douglas Adams