Anthony,
Firstly, a fascinating email (and topic) and one close to my problem
space also.
My read of the consistency guarantee paper was that we could add any
combination of the four properties even to a system weaker (in the
technical sense) than couchdb
It seems to suffice for the client to maintain a version vector and
for operations to verify dominance as shown ok the pseudo code
My though was to put the client version vector in an http cookie and
add some logic at each server to do the check
For any one shard of a multi-node couchdb deployment, version vectors
and sticky load balancing should permit clients to achieve these
session guarantees. At least, we can know when they are violated and
tell the client
Practically, a client is mostly sticky to one replica per shard but
can seemlessly failover to another iff the alternate replicas is up to
date. That determination seems to me, and please correct me here, to
only need to compare the client vector to the servers, and each
servers version is just it's update sequence. No deeper per document
vector is needed, though it may allow faster failover or a higher
probability of finding a server that preserves session guarantees than
otherwise.
For my problem, as yours, bayou-style sessions over standalone couchdb
installations looks very compelling.
As an aside, I was contemplating driving replication via consistent
hashing. A single, agreed node would hold the ring, since I deploy to
a data center. Any scheme (stonith, say) suffices to make that mode
fault-tolerant.
In my mind, that adds up to a Beautiful Thing. Ymmv, tip your waiter,
etc.
B.
Sent from my orbiting laser.
On 16 Feb 2009, at 02:30, Antony Blakey <[email protected]> wrote:
I've recently been considering replication models, and looking at
relevant prior art. I'd like to start a discussion about the best
replication model for CouchDB, in the hope of garnering both support
and help in implementing a replication model that provides a
stronger form of weak consistency under replication that CouchDB
currently provides. This can be done without sacrificing any of the
pre-determined goals of CouchDB.
There are two research streams that I've been following. The first
is Bayou, for which this: http://www2.parc.com/csl/projects/bayou/
is a good entry point. Bayou is somewhat more powerful than CouchDB
because it provides consistency guarantees while reading from groups
of replicas.
The second is PRACTI, which starts here: http://www.usenix.org/event/nsdi06/tech/full_papers/belaramani/belaramani_html
. The interesting thing about PRACTI from my point of view is how it
extends weak-consistency to partial replication.
There's also an interesting set of papers here: http://highscalability.com/links/weblink/83
, although most of them aren't directly applicable.
Firstly though, it's worth considering the characteristics of
CouchDB's current replication system.
The background to this issue is the CAP dilemma, described and
analysed here: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.20.1495
The PRACTI paper summarizes this as "the CAP dilemma states that a
replication system that provides sequential Consistency cannot
simultaneously provide 100% Availability in an environment that can
be Partitioned".
CouchDB is a virtually-always-partitioned system that provides 100%
availability (at any given node). Nodes themselves are not partition
tolerant, and hence can provide arbitrary consistency guarantees,
including sequential Consistency as represented by serializable
transactions. It is intended however that CouchDB provide a cluster
architecture. Although the only extant suggestion for this presumes
partition tolerant clustering (http://horicky.blogspot.com/2008/10/couchdb-cluster.html
), this is but one model of a cluster architecture. I would argue
that this is little more than a load-balancing proxy, and that there
are alternative cluster architectures that provide significant
benefits, although this may be begging the question.
For the purposes of initial discussion, the cluster issue isn't
relevant, although it is an issue when considering isolated write
sequences, which are roughly analgous to Bayou's sessions, and are a
very useful replacement for traditional ACID transactions.
The key issue is that there are forms of consistency that, while
less than 'sequential consistency' i.e. distributed transactions,
are still useful. Specifically, Bayou provides the following:
1. Read Your Writes - read operations reflect previous writes.
2. Monotonic Reads - successive reads reflect a non-decreasing set
of writes.
3. Writes Follow Reads - writes are propagated after reads on which
they depend.
4. Monotonic Writes - writes are propagated after writes that
logically precede them.
Monotonic Writes, sometimes called write-ordering, is the specific
form of weak-consistency that interests me in the context of CouchDB.
Consider two documents, A and B, with write versions indicated by
numeric suffixes e.g. A0, A1, A2 etc. A local application makes a
series of writes:
[ A0, B0, A1 ]
Couch currently replicates this as
[ A0-, B0, A1 ]
where A0- indicates that the document is replication without it's
data. The replicator chooses not to provide the data for A0-, only
noting that the revision exists. If the database is compacted
however, then the replicator no longer has any choice - the data for
A0 no longer exists.
It might seem that this doesn't matter, but because replication
isn't atomic, the replication target can, at any time and for any
length of time (possibly permanently) see an arbitrary prefix of the
replication stream, such as this:
[ B0 ]
As far as I can tell, it won't see A0- until it sees A1, although
this doesn't affect this discussion. The point is that the target
doesn't see writes in the order that they occur in the source, and
state-consistency is only reached when the replication reaches the
source write-point, which, ignoring the read/write ratio, is by no
means guaranteed in an unreliable environment.
To make this more concrete - imagine that A is a blog post and B is
a comment. It's possible for running code to see a comment without a
blog post. This isn't the end of the world in this example, but it
does complicate applications which use this data, and unnecessarily,
as Bayou and PRACTI show. In the face of failure, either temporary
(comms) or permanent node failure, the target will see a view of the
source that possibly cannot be made write-order consistent. Write-
order consistency is a very useful and greatly simplifying feature
for applications.
This situation is exacerbated by per-document revision stemming.
<TENTATIVE>
One the surface, the simplest solution to this is to retain and
replicate each revision of a document, in MVCC commit order. The
result of this is that every intermediate state that the target sees
during replication is consistent with the write ordering in the
source. Incremental replication this maintains write-order
consistency, even in the face of failure.
An obvious optimisation is that this: [ ... A0, A1, A2 ... ] can be
replicated as this [ ... A2 ... ] because the intermediate writes
aren't relevant, although see my caveat.
If you allow for *local* multi-document ACID commits then you can
significantly optimise replication, with the added advantage of
being able to provide a weak-consistency equivalent to transactions.
The idea is that you can group writes into an isolation group e.g.
[ ... A1, B3, C2 .... ]
Concurrent access on the local node cannot see any intermediate
state e.g. the three writes are ACID. Note that the 'C' in ACID
doesn't mean that the write will fail if there are conflicts - you
can choose for that to be the case on a per-group-write basis on the
local node, but when it's replicated you don't have that choice - it
will commit regardless. The key property here is really Isolation,
rather than Consistency.
It's not difficult to replication such isolation groups - you simply
wrap the writes in a start/end group in the replication stream, and
replication uses local ACID with commit-on-conflict semantics. If
the replication stream doesn't see the end group marker because of
comms failure, then the group isn't written.
This allows the replication process itself to be adaptively
optimised even if such isolation groups aren't exposed to the user.
Consider a replication stream:
[ ... A1, B3, C2, A2, B4, A3 ... ]
This can be replicated as:
[ ... { C2, B3-, B4, A1-, A2-, A3 } ... ]
or
[ ... { C2, B4, A3 } ... ]
where { } delimit isolation groups. Once again though, see the caveat.
Finally, the existing compact and proposed revision stemming are
incompatible with write-ordering consistency. Bayou uses a simple
point-in-time truncation of the history e.g. linear in the db, and
when it gets a replication request that requires more history that
it has, it synchronizes the entire database. This is an issue for
availability because the target needs to be locked while the missing
history prefix is synchronised to ensure that the target doesn't see
an inconsistent write-ordering.
</TENTATIVE>
<CAVEAT>
The reason the above is tentative, is that it only considers two
peers. Multiple peers can have write dependencies caused by multiple
replications between arbitrary peers. I haven't thought through that
yet. This paper has some information of this issue in a slightly
more challenging context: http://www2.parc.com/csl/projects/bayou/pubs/sg-pdis-94/SessionGuaranteesPDIS.ps
</CAVEAT>
And that's as far as my thinking has progressed. Write-order
consistency in the face of partial replication introduces some new
requirements.
Antony Blakey
-------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787
The fact that an opinion has been widely held is no evidence
whatever that it is not utterly absurd.
-- Bertrand Russell