Hi Randall, cool!  I can chime in on a couple of the questions ...

On Mar 29, 2009, at 8:59 PM, Randall Leeds wrote:

1) What's required to make CouchDB a full OTP application? Isn't it using gen_server already?

Yes, in fact CouchDB is already an OTP application using supervisors, gen_servers, and gen_events. There are situations in which it could do a better job of adhering to OTP principles, and it could probably also use some refactoring to make the partitioning code fit in easily.

2) What about _all_docs and seq-num?

I presume _all_docs gets merged like any other view. _all_docs_by_seq is a different story. In the current code the sequence number is incremented by one for every update. If we want to preserve that behavior in partitioned databases we need some sort of consensus algorithm or master server to choose the next sequence number. It could easily turn into a bottleneck or single point-of-failure if we're not careful.

The alternatives are to a) abandon the current format for update sequences in favor of vector clocks or something more opaque, or b) have _all_docs_by_seq be strictly a node-local query. I'd prefer the former, as I think it will be useful for e.g. external indexers to treat the partitioned database just like a single server one. If we do the latter, I think it means that either the external indexers have to be installed on every node, or at least they have to be aware of all the partitions.

One other thing that bothers me is the merge-sort required for every view lookup. In *really* large clusters it won't be good if queries for a single key in a view have to hit each partition. We could have an alternative structure where each view gets partitioned much like the document data while its built. I worry that a view partitioned in this way may need frequent rebalancing during the build, since view keys are probably not going to be uniformly distributed. Nevertheless, I think the benefit of having many view queries only hit a small subset of nodes in the cluster is pretty huge.

3) Can we agree on a proposed solution to the layout of partition nodes? I like the tree solution, as long as it is extremely flexible wrt tree depth.

I'm not sure we're ready to do that. In fact, I think we may need to implement a couple of different topologies and profile them to see what works best. The tree topology is an interesting idea, but it may turn out that passing view results up the tree is slower than just sending them directly to the final destination and having that server do the rest of the work. Off the cuff I think trees may be a great choice for computationally intensive reduce functions, but other views where the size of the data is large relative to the computational requirements may be better off minimizing the number of copies of the data that need to be made.

4) Should the consistent hashing algorithm map ids to leaf nodes or just to children? I lean toward children because it encapsulates knowledge about the
layout of subtrees at each tree level.

If the algorithm maps to children, does that mean every document lookup has to traverse the tree? I'm not sure that's a great idea. Up to ~100 nodes I think it may be better to have all document lookups take O(1) hops. I think distributed Erlang can keep a system of that size globally connected without too much trouble.

Cheers, Adam

Reply via email to