On 07/02/2009, at 4:47 AM, Jim McCoy wrote:

Returning, as always, to Brewer's CAP conjecture, you can have any two
of consistency, availability, or partition-tolerance.  Scalaris went
with paxos sync and selected consistency and partition-tolerance.
This means that if you lose a majority of your Scalaris nodes it
becomes impossible to log a write to the system. In CouchDB
availability and partition-tolerance were prioritized ("eventual
consistency" being that pleasant term we use to play down the fact
that no consistency assurances can actually be made beyond a
best-effort attempt to fix things after the fact.)

I think we're talking at cross-purposes here. I'm talking about transactions over a couch cluster, not over an arbitrary set of peers. The proposed mods to the API are to allow the cluster to present the illusion of a single server even if it's implemented over multiple nodes. I have yet to see the definition of multi-node that has prompted this change - it's alternately referred to as multi-node and partitioned, where partitioned means data-partitioning, not node connectivity, but talk about data-partitioning and the fact that cluster-side code is required, gives some clues.

What distinguishes such a cluster from an arbitrary group of peers? Well, for one, there is some additional server code/functionality that allows a single entry point to the cluster, and provides the semantics and appearance of a single server. And the rest of the paragraph you refer to is:

Why compromise the API for something that is so ephemeral that conflict management isn't feasible? What IS feasible? View consistency? MVCC semantics. If I write a document and then read it, do I maybe see an earlier document. What about in the view? Because if views are going to have a consistency guarantee wrt. writes, then that looks to me like distributed MVCC. Is this not 2PC? And if views aren't consistent, then why bother? Why not have a client layer that does content-directed load balancing.

My point being that when talking about a cluster you probably already have some consistency constraints above and beyond an arbitrary set of peers. So, don't these extra constraints already provide an environment where an ACID operation can be provided? And if there are no extra constraints, then I think such a cluster devolves to something that could be done using a load balancing layer.

I believe the "unbounded number of servers" that Damien was mentioning
was referring to the idea that with multi-node operations the activity
over which you want an ACID commit could become a very large set if
the number of participating nodes grows significantly.  This would
mean that a large system would require an ever-growing number of
messages to be passed around, which is a scalability bottleneck.

It is simply not possible to offer what you want within CouchDB unless
either availability or partition-tolerance are sacrificed.

I'm making the presumption, on the basis of this discussion referring to data partitioning, that multi-node Couch clusters are not partition tolerant. As opposed to e.g. p2p replication meshes. It's the difference between these two scenarios, and the way that difference can be taken advantage of within Couch applications, that prompts my comments about the nature of the UI that is presented for local (clustered) vs. distributed operations e.g.

Regardless, this discussion is also about whether supporting a single-node style of operation is useful, because CouchDB had an ACID _bulk_docs. IMO it's also about, the danger/cost of abstracting over operational models with considerably different operational characteristics - c.f. transparent distribution in object models.

and:

The use-case of someone doing a local operation e.g. submitting a web form, is very different than resolving replication conflicts. Conflict during a local operation is a matter of application concurrency, whereas conflict during replication is driven by the overall system model. It has different temporal, administrative and UI boundaries.

In short, I think it is a mistake to try and hide the different characteristics of local (even clustered) operations, and replication. You may disagree, but if the system distinguishes between these two fundamentally different things (distinguished by their partition-tolerance), you can code as though every operation leads to conflict if you wish, but I can't take advantage of the difference if the system doesn't distinguish between those two cases. Distinguishing between the two cases allows for a wider range of uses and application models.

Antony Blakey
--------------------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787

Isn't it enough to see that a garden is beautiful without having to believe that there are fairies at the bottom of it too?
  -- Douglas Adams

Reply via email to