Part of the larger replication security work (branches/rep_security)
is to allow rev histories to be trimmed back to an arbitrary length.
Without this, revision histories must grow and grow, each update to a
doc adds a new revision to the history. So if a document is edited 1
million times, there is a 1 million rev history that must be tracked.
But with it, it allows for unlimited to edits to documents with only a
fixed size history. The catch is it's possible to have spurious
conflicts if the trimmed revision history for a later edit is
replicated to a database without overlapping revs.
The new revs look like this: "4-3693042815". The format is pretty much
arbitrary, it just needs to be a parseable representation of an
integer and second string value. The first number is the sequential
revseq (shown is the 4th revision), the second is a randomly generated
id (which eventually should be deterministically generated based on
doc content, making idempotent updates possible and completely
transparent to clients).
However, when representing a rev in Erlang it is a tuple like this {4,
<<"3693042815">>}, we need to convert back and forth between string
format for json. Representing it as string in json instead of a
complex structure has the least impact on couchdb clients.
This will also simplify partial replication support in the future, as
we can track the rev seq a field or attachment when changed, and
during replication only send those parts that have changed since a
previous revision that available in the target db. The main benefit
being saving network IO by not sending fields and attachments that
haven't changed.
-Spurious Conflicts-
The issue with spurious conflicts is if you have non-overlapping
revision histories you don't know if you have a conflict or not.
CouchDB will always report there is a conflict in the case.
Example id database a with have document with this revision history
(I'm using string for rev ids where it would normally be number):
Doc on DbA - ["1-foo" "2-bar" "3-baz" "4-biz"]
Doc on DbB - ["1-foo"]
Lets say the revision history on A is trimmed and it now looks like
this:
Doc on DbA - ["2-bar" "3-baz" "4-biz"]
When we replicate DbA with DbB, we get a spurious conflict, because it
can't tell if "4-biz" is actually a later revision of "1-foo":
Doc on DbB - winner: ["2-bar" "3-baz" "4-biz"] conflict: ["1-foo"]
But if on DbC we still have the full history of that doc:
Doc on DbC - ["1-foo" "2-bar" "3-baz" "4-biz"]
When it replicates back with DbB, the missing part of the revision
history is sent and the spurious conflict automatically eliminated:
Doc on DbB - ["1-foo" "2-bar" "3-baz" "4-biz"]
-What Breaks-
This change won't break application code, so long as they treat the
_rev field as an opaque string and aren't converting it to integers or
something.
This change *does* break replication with previous versions of
CouchDB, and changes the file format. So a dump and import will be
required for existing database files.
As of yet, I've not actually coded the parts that trim back the old
revs. That will likely be a "max rev history" setting in the DB, but
other suggestions welcome.
-Damien