On 07/02/2009, at 4:47 AM, Jim McCoy wrote:
Returning, as always, to Brewer's CAP conjecture, you can have any two
of consistency, availability, or partition-tolerance. Scalaris went
with paxos sync and selected consistency and partition-tolerance.
This means that if you lose a majority of your Scalaris nodes it
becomes impossible to log a write to the system. In CouchDB
availability and partition-tolerance were prioritized ("eventual
consistency" being that pleasant term we use to play down the fact
that no consistency assurances can actually be made beyond a
best-effort attempt to fix things after the fact.)
I think we're talking at cross-purposes here. I'm talking about
transactions over a couch cluster, not over an arbitrary set of peers.
The proposed mods to the API are to allow the cluster to present the
illusion of a single server even if it's implemented over multiple
nodes. I have yet to see the definition of multi-node that has
prompted this change - it's alternately referred to as multi-node and
partitioned, where partitioned means data-partitioning, not node
connectivity, but talk about data-partitioning and the fact that
cluster-side code is required, gives some clues.
What distinguishes such a cluster from an arbitrary group of peers?
Well, for one, there is some additional server code/functionality that
allows a single entry point to the cluster, and provides the semantics
and appearance of a single server. And the rest of the paragraph you
refer to is:
Why compromise the API for something that is so ephemeral that
conflict management isn't feasible? What IS feasible? View
consistency? MVCC semantics. If I write a document and then read
it, do I maybe see an earlier document. What about in the view?
Because if views are going to have a consistency guarantee wrt.
writes, then that looks to me like distributed MVCC. Is this not
2PC? And if views aren't consistent, then why bother? Why not have
a client layer that does content-directed load balancing.
My point being that when talking about a cluster you probably already
have some consistency constraints above and beyond an arbitrary set of
peers. So, don't these extra constraints already provide an
environment where an ACID operation can be provided? And if there are
no extra constraints, then I think such a cluster devolves to
something that could be done using a load balancing layer.
I believe the "unbounded number of servers" that Damien was mentioning
was referring to the idea that with multi-node operations the activity
over which you want an ACID commit could become a very large set if
the number of participating nodes grows significantly. This would
mean that a large system would require an ever-growing number of
messages to be passed around, which is a scalability bottleneck.
It is simply not possible to offer what you want within CouchDB unless
either availability or partition-tolerance are sacrificed.
I'm making the presumption, on the basis of this discussion referring
to data partitioning, that multi-node Couch clusters are not partition
tolerant. As opposed to e.g. p2p replication meshes. It's the
difference between these two scenarios, and the way that difference
can be taken advantage of within Couch applications, that prompts my
comments about the nature of the UI that is presented for local
(clustered) vs. distributed operations e.g.
Regardless, this discussion is also about whether supporting a
single-node style of operation is useful, because CouchDB had an
ACID _bulk_docs. IMO it's also about, the danger/cost of
abstracting over operational models with considerably different
operational characteristics - c.f. transparent distribution in
object models.
and:
The use-case of someone doing a local operation e.g. submitting a
web form, is very different than resolving replication conflicts.
Conflict during a local operation is a matter of application
concurrency, whereas conflict during replication is driven by the
overall system model. It has different temporal, administrative and
UI boundaries.
In short, I think it is a mistake to try and hide the different
characteristics of local (even clustered) operations, and
replication. You may disagree, but if the system distinguishes
between these two fundamentally different things (distinguished by
their partition-tolerance), you can code as though every operation
leads to conflict if you wish, but I can't take advantage of the
difference if the system doesn't distinguish between those two
cases. Distinguishing between the two cases allows for a wider
range of uses and application models.
Antony Blakey
--------------------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787
Isn't it enough to see that a garden is beautiful without having to
believe that there are fairies at the bottom of it too?
-- Douglas Adams