Re: CouchDB Cluster/Partition GSoC

Adam Kocoloski Wed, 01 Apr 2009 08:20:15 -0700

On Apr 1, 2009, at 1:59 AM, Randall Leeds wrote:

On Sun, Mar 29, 2009 at 22:12, Adam Kocoloski <[email protected]>wrote:
Hi Randall, cool!  I can chime in on a couple of the questions ...
Adam, thanks for your quick reply and thorough comments. The morepeoplechime in on this discussion the better I can make the proposal, bothinterms of likelihood for acceptance by a mentor/Google and the valueof theresulting work for the community. I will aim to post a formalizeddraft ofthe proposal on my GSoC registration page sometime tomorrow and openit up
for comments. Submission deadline is Friday.


Sounds like a plan to me.

2) What about _all_docs and seq-num?
I presume _all_docs gets merged like any other view._all_docs_by_seq is adifferent story. In the current code the sequence number isincremented byone for every update. If we want to preserve that behavior inpartitioneddatabases we need some sort of consensus algorithm or master servertochoose the next sequence number. It could easily turn into abottleneck or
single point-of-failure if we're not careful.
The alternatives are to a) abandon the current format for updatesequences
in favor of vector clocks or something more opaque, or b) have
_all_docs_by_seq be strictly a node-local query. I'd prefer theformer, as
I think it will be useful for e.g. external indexers to treat the
partitioned database just like a single server one. If we do thelatter, Ithink it means that either the external indexers have to beinstalled on
every node, or at least they have to be aware of all the partitions.
If at all possible I think we should have the entire partition groupappearas a single server from the outside. One thing that comes to mindhere is aquestion about sequence numbers. Vector clocks only guarantee apartialordering, but I'm under the impression that currently seq numbershave a
strict ordering.

Yes, that's a good point. Vector clocks may not be sufficient here.On the other hand, do we absolutely need a strict ordering of events?If the purpose of these sequence numbers is to ensure that replicatorsand indexers don't miss any updates, can't we just interpret GET_all_docs_by_seq as "give me all the updates that *might* havehappened after X"? That's a request we can answer with vector clocks,it's just the set of all updates such that VC(X') >= VC(X). Ofcourse, it's less efficient in that we may send duplicate updates in awrite-heavy scenario.

Caveat: I haven't given much thought to how we'd efficiently store oldversions of the vector clock at all nodes.

Database sequence numbers are used in replication and in determiningwhether
views need refreshing. Anything else I'm missing?

Any external indexers (couchdb-lucene, for instance) also need thesequence numbers.

Currently it seems there is no tracking of which updates actuallychange a view index (comment online 588 of couch_httpd_view.erl on trunk). Improving this would bea nice
win. See my answer to number (3).

That's only partially true. You're right that the Etags aren't yetsmart enough to know when a view stayed the same. However, wedefinitely do track relationships between documents and view keys in aseparate btree -- we have to if we want to expire the old view rowswhen a document is updated. I think we should eventually be able toleverage this information to be smarter about view Etags.

The easy way to manage seq numbers is to let one node be the writemasterfor any cluster. (The root node of any partition group couldactually be acluster, but if writes always go through a master the master canmaintain
the global sequence number for the partition group).

Yeah, this scares me a little bit. I assume by a write master youmean a node that's responsible for ordering all of the updates to adatabase, regardless of where those documents are actually stored ondisk. I'm sure we can build a performant implementation (it's just acounter, after all), but I worry about the availability of such asystem. I guess that's what supervision trees are for ... but I'dprefer to try to solve these problems in a decentralized manner ifpossible. My $.02.

3) Can we agree on a proposed solution to the layout of partitionnodes? I
like the tree solution, as long as it is extremely flexible wrttree depth.
I'm not sure we're ready to do that.  In fact, I think we may need to
implement a couple of different topologies and profile them to seewhat
works best.  The tree topology is an interesting idea, but it may
That was a silly question. I didn't expect these questions to beeasy. Thatshould have read as a discussion prompt rather than a call forconsensus.
We should probably clarify the impetus for a tree structure.Computationallyintensive reduces is the primary use case and the tree is a good wayto getspeedup here. In the case of a map-only view, we still need to mergeandaggregate the results from each shard. This merge needs to happensomewhere,likely either at the node that's servicing the request orrecursively up the
tree. In either case, we agree that there's not much win if every view
request has to hit every node. Therefore, I think we may need to start
tracking which updates affect the view index.

Good point -- caching the map-only views from leaf nodes could be anice win for the tree structure. It hadn't clicked for me until justnow. Best,


Adam

So, we need a consistent hash implementation. I will include this inthe
proposal.
From there, where should we go?

Thanks in advance,
Randall

Re: CouchDB Cluster/Partition GSoC

Reply via email to