Re: CouchDB Cluster/Partition GSoC

Brad Anderson Wed, 01 Apr 2009 10:47:31 -0700

On Apr 1, 2009, at 11:58 AM, Chris Anderson wrote:

On Wed, Apr 1, 2009 at 8:37 AM, Adam Kocoloski <[email protected]>wrote:
On Apr 1, 2009, at 11:03 AM, Chris Anderson wrote:
 2) What about _all_docs and seq-num?
I presume _all_docs gets merged like any other view._all_docs_by_seq
is a
different story. In the current code the sequence number isincremented
by
one for every update.  If we want to preserve that behavior in
partitioned
databases we need some sort of consensus algorithm or masterserver tochoose the next sequence number. It could easily turn into abottleneck
or
single point-of-failure if we're not careful.

The alternatives are to a) abandon the current format for update
sequences
in favor of vector clocks or something more opaque, or b) have
_all_docs_by_seq be strictly a node-local query. I'd prefer theformer,
as
I think it will be useful for e.g. external indexers to treat the
partitioned database just like a single server one.  If we do the
latter, I
think it means that either the external indexers have to beinstalled onevery node, or at least they have to be aware of all thepartitions.
If at all possible I think we should have the entire partitiongroup
appear
as a single server from the outside. One thing that comes to mindhere is
a
question about sequence numbers. Vector clocks only guarantee apartialordering, but I'm under the impression that currently seq numbershave a
strict ordering.
Database sequence numbers are used in replication and indetermining
whether
views need refreshing. Anything else I'm missing? Currently itseems
there
is no tracking of which updates actually change a view index(comment online 588 of couch_httpd_view.erl on trunk). Improving this wouldbe a
nice
win. See my answer to number (3).
The easy way to manage seq numbers is to let one node be thewrite masterfor any cluster. (The root node of any partition group couldactually be
a
cluster, but if writes always go through a master the master canmaintain
the global sequence number for the partition group).
The problem with this approach is that the main use-case for
partitioning is when your incoming writes exceed the capacity of a
single node. By partitioning the key-space, you can get more
write-throughput.
I think Randall was saying requests just have to originate at themasternode. That master node could do nothing more than assign asequence number,choose a node, and proxy the request down the tree for the heavylifting. I
bet we could get pretty good throughput, but I still worry about this
approach for availability reasons.
Yes, I agree. I think vector clocks are a good compromise. I hadn't
considered that since index updates are idempotent, we can allow a
little slop in the global clock. This makes everything much easier.
I'm not sure that an update-seq per node is such a bad thing, as it
will require any external indexers to be deployed in a 1-to-1
relationship to nodes, which automatically balances the load for the
indexer as well. With a merged seq-id, users would be encouraged to
partition CouchDB without bothering to partition indexers. Maybethis
is acceptable in some cases, but not in the general case.
So, the vector clock approach still has a per-node update sequencefor eachnode's local clock, it just does the best job possible of globallyorderingthose per-node sequences. We could easily offer local updatesequences aswell via some query string parameter. I understand the desire toencouragepartitioned indexers, but I believe that won't always be possible.Bottom
line, I think we should support global indexing of a partitioned DB.
I think you're right. As long as partitioned indexers are possible, I
have nothing against making global indexers as well.
I'd like to hear more about how we implement redundancy and handlenodefailures in the tree structure. In a pure consistent hashing ring,whetherglobally connected (Dynamo) or not (Chord), there are clearprocedures fordealing with node failures, usually involving storing copies of thedata atadjacent nodes along the ring. Do we have an analogue of that inthe tree?
 I'm especially worried about what happens when inner nodes go down.
I like to think of partitioning and redundancy as orthogonal. If each
node has a hot-failover "twin", then parent nodes can track for all of
their children, the children's twins as well, and swap them out in
case of unavailability.

I'm not so hot on the Chord / Dynamo style of storing parts of
partitions on other partitions. Even just saying that is confusing.

Because physical nodes need not map directly to logical nodes, we just
need to be sure that each node's twin lives on a different physical
node (which it can share with other logical nodes).

The end result is that we can have N duplicates of the entire tree,
and even load-balance across them. It'd be a small optimization to
allow you to read from both twins and write to just one.

Chris

So I have not poked around CouchDB for a while, but recently began tomonitor the ML's again, so forgive me if this has been hashed outalready... Do we have to choose one way or another on the issuesdiscussed in this thread? Or is it a better 'core-couch' designdecision to make these things pluggable, a la emacs, eclipse, trac?

i.e. if you want chord / peer-to-peer storage, use that plugin, or ifyou want vector clocks, use that plugin. Or different indexers /strategies, use appropriate plugin. Core couch need only provide well-reasoned stubs or extension points for these plugins. Or does thisdecouple existing functionality and design goals too much? I coulddefinitely see a way in this design to get to a custom erlang termstorage engine, no json impedence mismatch, and a native erlang viewengine.


Cheers,
BA

Re: CouchDB Cluster/Partition GSoC

Reply via email to