proposed replication rev history changes

Damien Katz Sat, 07 Feb 2009 13:35:30 -0800

Part of the larger replication security work (branches/rep_security)is to allow rev histories to be trimmed back to an arbitrary length.Without this, revision histories must grow and grow, each update to adoc adds a new revision to the history. So if a document is edited 1million times, there is a 1 million rev history that must be tracked.

But with it, it allows for unlimited to edits to documents with only afixed size history. The catch is it's possible to have spuriousconflicts if the trimmed revision history for a later edit isreplicated to a database without overlapping revs.

The new revs look like this: "4-3693042815". The format is pretty mucharbitrary, it just needs to be a parseable representation of aninteger and second string value. The first number is the sequentialrevseq (shown is the 4th revision), the second is a randomly generatedid (which eventually should be deterministically generated based ondoc content, making idempotent updates possible and completelytransparent to clients).

However, when representing a rev in Erlang it is a tuple like this {4,<<"3693042815">>}, we need to convert back and forth between stringformat for json. Representing it as string in json instead of acomplex structure has the least impact on couchdb clients.

This will also simplify partial replication support in the future, aswe can track the rev seq a field or attachment when changed, andduring replication only send those parts that have changed since aprevious revision that available in the target db. The main benefitbeing saving network IO by not sending fields and attachments thathaven't changed.


-Spurious Conflicts-

The issue with spurious conflicts is if you have non-overlappingrevision histories you don't know if you have a conflict or not.CouchDB will always report there is a conflict in the case.

Example id database a with have document with this revision history(I'm using string for rev ids where it would normally be number):

Doc on DbA - ["1-foo" "2-bar" "3-baz" "4-biz"]
Doc on DbB - ["1-foo"]

Lets say the revision history on A is trimmed and it now looks likethis:


Doc on DbA - ["2-bar" "3-baz" "4-biz"]

When we replicate DbA with DbB, we get a spurious conflict, because itcan't tell if "4-biz" is actually a later revision of "1-foo":


Doc on DbB - winner: ["2-bar" "3-baz" "4-biz"]  conflict: ["1-foo"]

But if on DbC we still have the full history of that doc:
Doc on DbC - ["1-foo" "2-bar" "3-baz" "4-biz"]

When it replicates back with DbB, the missing part of the revisionhistory is sent and the spurious conflict automatically eliminated:


Doc on DbB - ["1-foo" "2-bar" "3-baz" "4-biz"]

-What Breaks-

This change won't break application code, so long as they treat the_rev field as an opaque string and aren't converting it to integers orsomething.

This change *does* break replication with previous versions ofCouchDB, and changes the file format. So a dump and import will berequired for existing database files.

As of yet, I've not actually coded the parts that trim back the oldrevs. That will likely be a "max rev history" setting in the DB, butother suggestions welcome.


-Damien

proposed replication rev history changes

Reply via email to