On Apr 1, 2009, at 1:59 AM, Randall Leeds wrote:

On Sun, Mar 29, 2009 at 22:12, Adam Kocoloski <[email protected]> wrote:

Hi Randall, cool!  I can chime in on a couple of the questions ...


Adam, thanks for your quick reply and thorough comments. The more people chime in on this discussion the better I can make the proposal, both in terms of likelihood for acceptance by a mentor/Google and the value of the resulting work for the community. I will aim to post a formalized draft of the proposal on my GSoC registration page sometime tomorrow and open it up
for comments. Submission deadline is Friday.

Sounds like a plan to me.

2) What about _all_docs and seq-num?

I presume _all_docs gets merged like any other view. _all_docs_by_seq is a different story. In the current code the sequence number is incremented by one for every update. If we want to preserve that behavior in partitioned databases we need some sort of consensus algorithm or master server to choose the next sequence number. It could easily turn into a bottleneck or
single point-of-failure if we're not careful.

The alternatives are to a) abandon the current format for update sequences
in favor of vector clocks or something more opaque, or b) have
_all_docs_by_seq be strictly a node-local query. I'd prefer the former, as
I think it will be useful for e.g. external indexers to treat the
partitioned database just like a single server one. If we do the latter, I think it means that either the external indexers have to be installed on
every node, or at least they have to be aware of all the partitions.

If at all possible I think we should have the entire partition group appear as a single server from the outside. One thing that comes to mind here is a question about sequence numbers. Vector clocks only guarantee a partial ordering, but I'm under the impression that currently seq numbers have a
strict ordering.

Yes, that's a good point. Vector clocks may not be sufficient here. On the other hand, do we absolutely need a strict ordering of events? If the purpose of these sequence numbers is to ensure that replicators and indexers don't miss any updates, can't we just interpret GET _all_docs_by_seq as "give me all the updates that *might* have happened after X"? That's a request we can answer with vector clocks, it's just the set of all updates such that VC(X') >= VC(X). Of course, it's less efficient in that we may send duplicate updates in a write-heavy scenario.

Caveat: I haven't given much thought to how we'd efficiently store old versions of the vector clock at all nodes.

Database sequence numbers are used in replication and in determining whether
views need refreshing. Anything else I'm missing?

Any external indexers (couchdb-lucene, for instance) also need the sequence numbers.

Currently it seems there is no tracking of which updates actually change a view index (comment on line 588 of couch_httpd_view.erl on trunk). Improving this would be a nice
win. See my answer to number (3).

That's only partially true. You're right that the Etags aren't yet smart enough to know when a view stayed the same. However, we definitely do track relationships between documents and view keys in a separate btree -- we have to if we want to expire the old view rows when a document is updated. I think we should eventually be able to leverage this information to be smarter about view Etags.

The easy way to manage seq numbers is to let one node be the write master for any cluster. (The root node of any partition group could actually be a cluster, but if writes always go through a master the master can maintain
the global sequence number for the partition group).

Yeah, this scares me a little bit. I assume by a write master you mean a node that's responsible for ordering all of the updates to a database, regardless of where those documents are actually stored on disk. I'm sure we can build a performant implementation (it's just a counter, after all), but I worry about the availability of such a system. I guess that's what supervision trees are for ... but I'd prefer to try to solve these problems in a decentralized manner if possible. My $.02.



3) Can we agree on a proposed solution to the layout of partition nodes? I
like the tree solution, as long as it is extremely flexible wrt tree depth.


I'm not sure we're ready to do that.  In fact, I think we may need to
implement a couple of different topologies and profile them to see what
works best.  The tree topology is an interesting idea, but it may


That was a silly question. I didn't expect these questions to be easy. That should have read as a discussion prompt rather than a call for consensus.

We should probably clarify the impetus for a tree structure. Computationally intensive reduces is the primary use case and the tree is a good way to get speedup here. In the case of a map-only view, we still need to merge and aggregate the results from each shard. This merge needs to happen somewhere, likely either at the node that's servicing the request or recursively up the
tree. In either case, we agree that there's not much win if every view
request has to hit every node. Therefore, I think we may need to start
tracking which updates affect the view index.

Good point -- caching the map-only views from leaf nodes could be a nice win for the tree structure. It hadn't clicked for me until just now. Best,

Adam

So, we need a consistent hash implementation. I will include this in the
proposal.
From there, where should we go?

Thanks in advance,
Randall

Reply via email to