Re: CouchDB Cluster/Partition GSoC

Adam Kocoloski Sun, 29 Mar 2009 19:12:50 -0700

Hi Randall, cool!  I can chime in on a couple of the questions ...


On Mar 29, 2009, at 8:59 PM, Randall Leeds wrote:

1) What's required to make CouchDB a full OTP application? Isn't itusing gen_server already?

Yes, in fact CouchDB is already an OTP application using supervisors,gen_servers, and gen_events. There are situations in which it coulddo a better job of adhering to OTP principles, and it could probablyalso use some refactoring to make the partitioning code fit in easily.

2) What about _all_docs and seq-num?

I presume _all_docs gets merged like any other view. _all_docs_by_seqis a different story. In the current code the sequence number isincremented by one for every update. If we want to preserve thatbehavior in partitioned databases we need some sort of consensusalgorithm or master server to choose the next sequence number. Itcould easily turn into a bottleneck or single point-of-failure ifwe're not careful.

The alternatives are to a) abandon the current format for updatesequences in favor of vector clocks or something more opaque, or b)have _all_docs_by_seq be strictly a node-local query. I'd prefer theformer, as I think it will be useful for e.g. external indexers totreat the partitioned database just like a single server one. If wedo the latter, I think it means that either the external indexers haveto be installed on every node, or at least they have to be aware ofall the partitions.

One other thing that bothers me is the merge-sort required for everyview lookup. In *really* large clusters it won't be good if queriesfor a single key in a view have to hit each partition. We could havean alternative structure where each view gets partitioned much likethe document data while its built. I worry that a view partitioned inthis way may need frequent rebalancing during the build, since viewkeys are probably not going to be uniformly distributed.Nevertheless, I think the benefit of having many view queries only hita small subset of nodes in the cluster is pretty huge.

3) Can we agree on a proposed solution to the layout of partitionnodes? I like the tree solution, as long as it is extremely flexiblewrt tree depth.

I'm not sure we're ready to do that. In fact, I think we may need toimplement a couple of different topologies and profile them to seewhat works best. The tree topology is an interesting idea, but it mayturn out that passing view results up the tree is slower than justsending them directly to the final destination and having that serverdo the rest of the work. Off the cuff I think trees may be a greatchoice for computationally intensive reduce functions, but other viewswhere the size of the data is large relative to the computationalrequirements may be better off minimizing the number of copies of thedata that need to be made.

4) Should the consistent hashing algorithm map ids to leaf nodes orjust tochildren? I lean toward children because it encapsulates knowledgeabout the
layout of subtrees at each tree level.

If the algorithm maps to children, does that mean every documentlookup has to traverse the tree? I'm not sure that's a great idea.Up to ~100 nodes I think it may be better to have all document lookupstake O(1) hops. I think distributed Erlang can keep a system of thatsize globally connected without too much trouble.


Cheers, Adam

Re: CouchDB Cluster/Partition GSoC

Reply via email to