Re: CouchDB Cluster/Partition GSoC

Adam Kocoloski Wed, 01 Apr 2009 08:37:36 -0700

On Apr 1, 2009, at 11:03 AM, Chris Anderson wrote:

 2) What about _all_docs and seq-num?
I presume _all_docs gets merged like any other view._all_docs_by_seq is adifferent story. In the current code the sequence number isincremented byone for every update. If we want to preserve that behavior inpartitioneddatabases we need some sort of consensus algorithm or masterserver tochoose the next sequence number. It could easily turn into abottleneck or
single point-of-failure if we're not careful.
The alternatives are to a) abandon the current format for updatesequences
in favor of vector clocks or something more opaque, or b) have
_all_docs_by_seq be strictly a node-local query. I'd prefer theformer, as
I think it will be useful for e.g. external indexers to treat the
partitioned database just like a single server one. If we do thelatter, Ithink it means that either the external indexers have to beinstalled on
every node, or at least they have to be aware of all the partitions.
If at all possible I think we should have the entire partitiongroup appearas a single server from the outside. One thing that comes to mindhere is aquestion about sequence numbers. Vector clocks only guarantee apartialordering, but I'm under the impression that currently seq numbershave a
strict ordering.
Database sequence numbers are used in replication and indetermining whetherviews need refreshing. Anything else I'm missing? Currently itseems thereis no tracking of which updates actually change a view index(comment online 588 of couch_httpd_view.erl on trunk). Improving this would bea nice
win. See my answer to number (3).
The easy way to manage seq numbers is to let one node be the writemasterfor any cluster. (The root node of any partition group couldactually be acluster, but if writes always go through a master the master canmaintain
the global sequence number for the partition group).
The problem with this approach is that the main use-case for
partitioning is when your incoming writes exceed the capacity of a
single node. By partitioning the key-space, you can get more
write-throughput.

I think Randall was saying requests just have to originate at themaster node. That master node could do nothing more than assign asequence number, choose a node, and proxy the request down the treefor the heavy lifting. I bet we could get pretty good throughput, butI still worry about this approach for availability reasons.

I'm not sure that an update-seq per node is such a bad thing, as it
will require any external indexers to be deployed in a 1-to-1
relationship to nodes, which automatically balances the load for the
indexer as well. With a merged seq-id, users would be encouraged to
partition CouchDB without bothering to partition indexers. Maybe this
is acceptable in some cases, but not in the general case.

So, the vector clock approach still has a per-node update sequence foreach node's local clock, it just does the best job possible ofglobally ordering those per-node sequences. We could easily offerlocal update sequences as well via some query string parameter. Iunderstand the desire to encourage partitioned indexers, but I believethat won't always be possible. Bottom line, I think we should supportglobal indexing of a partitioned DB.

One other thing that bothers me is the merge-sort required forevery viewlookup. In *really* large clusters it won't be good if queriesfor a singlekey in a view have to hit each partition. We could have analternativestructure where each view gets partitioned much like the documentdata whileits built. I worry that a view partitioned in this way may needfrequentrebalancing during the build, since view keys are probably notgoing to beuniformly distributed. Nevertheless, I think the benefit ofhaving manyview queries only hit a small subset of nodes in the cluster ispretty huge.
I agree that the merge-sort is something we need to look atcarefully. Weshould never hit a node in a view query unless it has data we need.Wecertainly can't avoid merging altogether, but we can make an effortto do
smart rebalancing later on.
I think rebalancing aka shuffling will turn out to be one of those
devil-in-the-details things. Because any document can emit any key, in
the case of rebalancing, if you have to rebuild part of an index due
to node-failure, you'd need to re-request from every other node, any
view rows that might fit in that range. This requires every node to
know about every other node.

If view data is stored with document data, then nodes need only know
about their child nodes, in the tree structure. Recovering from
node-failure is easy: just swap in the failed node's hot-backup, and
regenerate the views on it.

I agree that the cost of merge sort will be ongoing, but I think the
simplicity of this approach at least indicates that we should take it
for the initial work. If we consider rebalancing an optimization, we
can add it later.

I think a better optimization would be to have inner nodes of the tree
lazily cache the view rows of their children. This way the computation
is spread out but the hops for popular queries can be mostly
eliminated.

Ok, +1 from me. I totally agree that rebalancing can get hairy, buthey, that's what makes this fun!

4) Should the consistent hashing algorithm map ids to leaf nodesor just
to
children? I lean toward children because it encapsulatesknowledge about
the
layout of subtrees at each tree level.
If the algorithm maps to children, does that mean every documentlookup hasto traverse the tree? I'm not sure that's a great idea. Up to~100 nodes Ithink it may be better to have all document lookups take O(1)hops. I thinkdistributed Erlang can keep a system of that size globallyconnected without
too much trouble.
I like the strict tree approach. I'd translate Adam's comment as:
distributed Erlang can probably handle a tree of depth=1, even with
~100 nodes.

I'd like to hear more about how we implement redundancy and handlenode failures in the tree structure. In a pure consistent hashingring, whether globally connected (Dynamo) or not (Chord), there areclear procedures for dealing with node failures, usually involvingstoring copies of the data at adjacent nodes along the ring. Do wehave an analogue of that in the tree? I'm especially worried aboutwhat happens when inner nodes go down.


Best, Adam

Re: CouchDB Cluster/Partition GSoC

Reply via email to