On Apr 1, 2009, at 1:59 AM, Randall Leeds wrote:
On Sun, Mar 29, 2009 at 22:12, Adam Kocoloski <[email protected]>
wrote:
Hi Randall, cool! I can chime in on a couple of the questions ...
Adam, thanks for your quick reply and thorough comments. The more
people
chime in on this discussion the better I can make the proposal, both
in
terms of likelihood for acceptance by a mentor/Google and the value
of the
resulting work for the community. I will aim to post a formalized
draft of
the proposal on my GSoC registration page sometime tomorrow and open
it up
for comments. Submission deadline is Friday.
Sounds like a plan to me.
2) What about _all_docs and seq-num?
I presume _all_docs gets merged like any other view.
_all_docs_by_seq is a
different story. In the current code the sequence number is
incremented by
one for every update. If we want to preserve that behavior in
partitioned
databases we need some sort of consensus algorithm or master server
to
choose the next sequence number. It could easily turn into a
bottleneck or
single point-of-failure if we're not careful.
The alternatives are to a) abandon the current format for update
sequences
in favor of vector clocks or something more opaque, or b) have
_all_docs_by_seq be strictly a node-local query. I'd prefer the
former, as
I think it will be useful for e.g. external indexers to treat the
partitioned database just like a single server one. If we do the
latter, I
think it means that either the external indexers have to be
installed on
every node, or at least they have to be aware of all the partitions.
If at all possible I think we should have the entire partition group
appear
as a single server from the outside. One thing that comes to mind
here is a
question about sequence numbers. Vector clocks only guarantee a
partial
ordering, but I'm under the impression that currently seq numbers
have a
strict ordering.
Yes, that's a good point. Vector clocks may not be sufficient here.
On the other hand, do we absolutely need a strict ordering of events?
If the purpose of these sequence numbers is to ensure that replicators
and indexers don't miss any updates, can't we just interpret GET
_all_docs_by_seq as "give me all the updates that *might* have
happened after X"? That's a request we can answer with vector clocks,
it's just the set of all updates such that VC(X') >= VC(X). Of
course, it's less efficient in that we may send duplicate updates in a
write-heavy scenario.
Caveat: I haven't given much thought to how we'd efficiently store old
versions of the vector clock at all nodes.
Database sequence numbers are used in replication and in determining
whether
views need refreshing. Anything else I'm missing?
Any external indexers (couchdb-lucene, for instance) also need the
sequence numbers.
Currently it seems there is no tracking of which updates actually
change a view index (comment on
line 588 of couch_httpd_view.erl on trunk). Improving this would be
a nice
win. See my answer to number (3).
That's only partially true. You're right that the Etags aren't yet
smart enough to know when a view stayed the same. However, we
definitely do track relationships between documents and view keys in a
separate btree -- we have to if we want to expire the old view rows
when a document is updated. I think we should eventually be able to
leverage this information to be smarter about view Etags.
The easy way to manage seq numbers is to let one node be the write
master
for any cluster. (The root node of any partition group could
actually be a
cluster, but if writes always go through a master the master can
maintain
the global sequence number for the partition group).
Yeah, this scares me a little bit. I assume by a write master you
mean a node that's responsible for ordering all of the updates to a
database, regardless of where those documents are actually stored on
disk. I'm sure we can build a performant implementation (it's just a
counter, after all), but I worry about the availability of such a
system. I guess that's what supervision trees are for ... but I'd
prefer to try to solve these problems in a decentralized manner if
possible. My $.02.
3) Can we agree on a proposed solution to the layout of partition
nodes? I
like the tree solution, as long as it is extremely flexible wrt
tree depth.
I'm not sure we're ready to do that. In fact, I think we may need to
implement a couple of different topologies and profile them to see
what
works best. The tree topology is an interesting idea, but it may
That was a silly question. I didn't expect these questions to be
easy. That
should have read as a discussion prompt rather than a call for
consensus.
We should probably clarify the impetus for a tree structure.
Computationally
intensive reduces is the primary use case and the tree is a good way
to get
speedup here. In the case of a map-only view, we still need to merge
and
aggregate the results from each shard. This merge needs to happen
somewhere,
likely either at the node that's servicing the request or
recursively up the
tree. In either case, we agree that there's not much win if every view
request has to hit every node. Therefore, I think we may need to start
tracking which updates affect the view index.
Good point -- caching the map-only views from leaf nodes could be a
nice win for the tree structure. It hadn't clicked for me until just
now. Best,
Adam
So, we need a consistent hash implementation. I will include this in
the
proposal.
From there, where should we go?
Thanks in advance,
Randall