Re: Replication and forms of weak consistency

Robert Newson Sun, 15 Feb 2009 19:55:48 -0800

Anthony,

Firstly, a fascinating email (and topic) and one close to my problemspace also.

My read of the consistency guarantee paper was that we could add anycombination of the four properties even to a system weaker (in thetechnical sense) than couchdb

It seems to suffice for the client to maintain a version vector andfor operations to verify dominance as shown ok the pseudo code

My though was to put the client version vector in an http cookie andadd some logic at each server to do the check

For any one shard of a multi-node couchdb deployment, version vectorsand sticky load balancing should permit clients to achieve thesesession guarantees. At least, we can know when they are violated andtell the client

Practically, a client is mostly sticky to one replica per shard butcan seemlessly failover to another iff the alternate replicas is up todate. That determination seems to me, and please correct me here, toonly need to compare the client vector to the servers, and eachservers version is just it's update sequence. No deeper per documentvector is needed, though it may allow faster failover or a higherprobability of finding a server that preserves session guarantees thanotherwise.

For my problem, as yours, bayou-style sessions over standalone couchdbinstallations looks very compelling.

As an aside, I was contemplating driving replication via consistenthashing. A single, agreed node would hold the ring, since I deploy toa data center. Any scheme (stonith, say) suffices to make that modefault-tolerant.

In my mind, that adds up to a Beautiful Thing. Ymmv, tip your waiter,etc.


B.

Sent from my orbiting laser.

On 16 Feb 2009, at 02:30, Antony Blakey <[email protected]> wrote:

I've recently been considering replication models, and looking atrelevant prior art. I'd like to start a discussion about the bestreplication model for CouchDB, in the hope of garnering both supportand help in implementing a replication model that provides astronger form of weak consistency under replication that CouchDBcurrently provides. This can be done without sacrificing any of thepre-determined goals of CouchDB.
There are two research streams that I've been following. The firstis Bayou, for which this: http://www2.parc.com/csl/projects/bayou/is a good entry point. Bayou is somewhat more powerful than CouchDBbecause it provides consistency guarantees while reading from groupsof replicas.
The second is PRACTI, which starts here: http://www.usenix.org/event/nsdi06/tech/full_papers/belaramani/belaramani_html. The interesting thing about PRACTI from my point of view is how itextends weak-consistency to partial replication.
There's also an interesting set of papers here: http://highscalability.com/links/weblink/83, although most of them aren't directly applicable.
Firstly though, it's worth considering the characteristics ofCouchDB's current replication system.
The background to this issue is the CAP dilemma, described andanalysed here: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.20.1495
The PRACTI paper summarizes this as "the CAP dilemma states that areplication system that provides sequential Consistency cannotsimultaneously provide 100% Availability in an environment that canbe Partitioned".
CouchDB is a virtually-always-partitioned system that provides 100%availability (at any given node). Nodes themselves are not partitiontolerant, and hence can provide arbitrary consistency guarantees,including sequential Consistency as represented by serializabletransactions. It is intended however that CouchDB provide a clusterarchitecture. Although the only extant suggestion for this presumespartition tolerant clustering (http://horicky.blogspot.com/2008/10/couchdb-cluster.html), this is but one model of a cluster architecture. I would arguethat this is little more than a load-balancing proxy, and that thereare alternative cluster architectures that provide significantbenefits, although this may be begging the question.
For the purposes of initial discussion, the cluster issue isn'trelevant, although it is an issue when considering isolated writesequences, which are roughly analgous to Bayou's sessions, and are avery useful replacement for traditional ACID transactions.
The key issue is that there are forms of consistency that, whileless than 'sequential consistency' i.e. distributed transactions,are still useful. Specifically, Bayou provides the following:
1. Read Your Writes - read operations reflect previous writes.
2. Monotonic Reads - successive reads reflect a non-decreasing setof writes.3. Writes Follow Reads - writes are propagated after reads on whichthey depend.4. Monotonic Writes - writes are propagated after writes thatlogically precede them.
Monotonic Writes, sometimes called write-ordering, is the specificform of weak-consistency that interests me in the context of CouchDB.
Consider two documents, A and B, with write versions indicated bynumeric suffixes e.g. A0, A1, A2 etc. A local application makes aseries of writes:
 [ A0, B0, A1 ]

Couch currently replicates this as

 [ A0-, B0, A1 ]
where A0- indicates that the document is replication without it'sdata. The replicator chooses not to provide the data for A0-, onlynoting that the revision exists. If the database is compactedhowever, then the replicator no longer has any choice - the data forA0 no longer exists.
It might seem that this doesn't matter, but because replicationisn't atomic, the replication target can, at any time and for anylength of time (possibly permanently) see an arbitrary prefix of thereplication stream, such as this:
 [ B0 ]
As far as I can tell, it won't see A0- until it sees A1, althoughthis doesn't affect this discussion. The point is that the targetdoesn't see writes in the order that they occur in the source, andstate-consistency is only reached when the replication reaches thesource write-point, which, ignoring the read/write ratio, is by nomeans guaranteed in an unreliable environment.
To make this more concrete - imagine that A is a blog post and B isa comment. It's possible for running code to see a comment without ablog post. This isn't the end of the world in this example, but itdoes complicate applications which use this data, and unnecessarily,as Bayou and PRACTI show. In the face of failure, either temporary(comms) or permanent node failure, the target will see a view of thesource that possibly cannot be made write-order consistent. Write-order consistency is a very useful and greatly simplifying featurefor applications.
This situation is exacerbated by per-document revision stemming.

<TENTATIVE>
One the surface, the simplest solution to this is to retain andreplicate each revision of a document, in MVCC commit order. Theresult of this is that every intermediate state that the target seesduring replication is consistent with the write ordering in thesource. Incremental replication this maintains write-orderconsistency, even in the face of failure.
An obvious optimisation is that this: [ ... A0, A1, A2 ... ] can bereplicated as this [ ... A2 ... ] because the intermediate writesaren't relevant, although see my caveat.
If you allow for *local* multi-document ACID commits then you cansignificantly optimise replication, with the added advantage ofbeing able to provide a weak-consistency equivalent to transactions.The idea is that you can group writes into an isolation group e.g.
 [ ... A1, B3, C2 .... ]
Concurrent access on the local node cannot see any intermediatestate e.g. the three writes are ACID. Note that the 'C' in ACIDdoesn't mean that the write will fail if there are conflicts - youcan choose for that to be the case on a per-group-write basis on thelocal node, but when it's replicated you don't have that choice - itwill commit regardless. The key property here is really Isolation,rather than Consistency.
It's not difficult to replication such isolation groups - you simplywrap the writes in a start/end group in the replication stream, andreplication uses local ACID with commit-on-conflict semantics. Ifthe replication stream doesn't see the end group marker because ofcomms failure, then the group isn't written.
This allows the replication process itself to be adaptivelyoptimised even if such isolation groups aren't exposed to the user.Consider a replication stream:
 [ ... A1, B3, C2, A2, B4, A3 ... ]

This can be replicated as:

 [ ... { C2, B3-, B4, A1-, A2-, A3 } ... ]

or

 [ ... { C2, B4, A3 } ... ]

where { } delimit isolation groups. Once again though, see the caveat.
Finally, the existing compact and proposed revision stemming areincompatible with write-ordering consistency. Bayou uses a simplepoint-in-time truncation of the history e.g. linear in the db, andwhen it gets a replication request that requires more history thatit has, it synchronizes the entire database. This is an issue foravailability because the target needs to be locked while the missinghistory prefix is synchronised to ensure that the target doesn't seean inconsistent write-ordering.
</TENTATIVE>

<CAVEAT>
The reason the above is tentative, is that it only considers twopeers. Multiple peers can have write dependencies caused by multiplereplications between arbitrary peers. I haven't thought through thatyet. This paper has some information of this issue in a slightlymore challenging context: http://www2.parc.com/csl/projects/bayou/pubs/sg-pdis-94/SessionGuaranteesPDIS.ps
</CAVEAT>
And that's as far as my thinking has progressed. Write-orderconsistency in the face of partial replication introduces some newrequirements.
Antony Blakey
-------------
CTO, Linkuistics Pty Ltd
Ph: 0438 840 787
The fact that an opinion has been widely held is no evidencewhatever that it is not utterly absurd.
 -- Bertrand Russell

Re: Replication and forms of weak consistency

Reply via email to