Half-baked idea: incremental virtual databases

Nathan Vander Wilt Tue, 29 Jan 2013 10:45:24 -0800

# The problem

It's a fairly common "complaint" that CouchDB's database model does not support 
fine-grained control over reads. The canonical solution is a database per user:
http://wiki.apache.org/couchdb/PerDocumentAuthorization#Database_per_user
http://stackoverflow.com/a/4731514/179583


This does not scale.

1. It complicates formerly simple backup/redundancy: now I need to make sure N 
replications stay working, N databases have correct permissions, instead of 
just one "main" database. Okay, write some scripts, deploy some cronjobs, can 
be made to work...

2. ...however, if data needs to be shared between users, this model *completely 
falls apart*. Bi-directional continuous filtered replication between a "hub" 
and each user database is extremely resource intensive.

I naïvely followed the Best Practices and ended up with a system that can 
barely support 100 users to a machine due to replication overhead. Now if I 
want to continue doing it "The Right Way" I need to cobble together some sort 
of rolling replication hack at best.

It's apparent the real answer for CouchDB security, right now, is to hide the 
database underneath some middleware boilerplate crap running as DB root. This 
is a well-explored pattern, by which I mean the database ends up with as many 
entry points as a sewer system has grates.


# An improvement?

What if CouchDB let you define virtual databases, that shared the underlying 
document data when possible, that updated incrementally (when queried) rather 
than continuously, that could even internally be implemented in a fanout 
fashion?

- virtual databases would basically be part of the internal b-tree key 
hierarchy, sort of like multiple root nodes sharing the branches as much as 
possible
- sharing the underlying document data would almost halve the amount of disk 
needed versus a "master" database storing all the data which is then copied to 
each user
- updating incrementally would put less continuous memory pressure on the system
- haven't actually done the maths, so I may be missing something, but wouldn't 
fanning out changes internally from a master database through intermediate 
partitions reduce the processing load?

Basically, rather than each time a user updates a document, copying it to a 
master database, then filtering every M updates through N instances of couchjs; 
instead internally CouchDB could build a tree of combined filters — say, master 
database filters to log(N) hidden partitions at the first level and accepted 
changes would trickle through only relevant further layers. (In a way, this is 
kind of at odds with the incremental nature — maybe it does make sense to pay 
an amortized cost on write rather than on reads.)


# The urgency

Maybe this *particular* solution isn't really a solution, but we need one:

If replicating amongst per-user databases is the only correct way to implement 
document-level read permissions, CouchDB **NEEDS** built-in support for a 
scalable way of doing so.

There are plenty of other feature requests I could troll the list with 
regarding CouchApps. But this one is key; everything else I've been able to 
work around behind a little reverse proxy here and in front of an external 
process there. Without scalable read-level security, I see no particular raison 
d'être for Apache CouchDB — if CouchDB can't support direct HTTP access in 
production in general, then it's just another centralized database.


thanks,
-natevw

Half-baked idea: incremental virtual databases

Reply via email to