Re: proposed replication rev history changes

Patrick Antivackis Sat, 07 Feb 2009 16:05:54 -0800

Thank you for the preview.

I will just think loud in order to comment your explaination.


As for background how it's working today:

version of a document : a version of a document is identified by an id and a
revision. The version of the document contains all the fileds/values as they
were at this specific revision

revision history : the list of all the revisions of the document beginning
by the most recent

Compaction of a database removes all previous versions of a document but
keep the revision history, so a revs=true for a document will return the
revision history of this document but of course I will not be able to see
what contains this document revision

Replication only replicate the last version of a document. if I replicate
baseA to an empty baseB, baseB will contains the last version of all non
deleted documents, and for each documents the full revision history. So for
each document i'm able to do a revs=true, but I am unable to see the content
of a previous version.

Spurious conflicts :  As a document contains the full list of its revision
history, the replication process is sure to be able to find if a document
with the same id on source and destination bases represent the same document
(by comparing both revision histories)

What you describe is :
1) to have the possibility, through a flag, to specify the number of
revisions that the revision history of each document will keep in order to
not make the revision history grow and grow. The drawback is the
introduction of spurious conflicts during replication as overlapping
revision for a same document id can be missing.

2) to change the revision format from a generated number to a tuple
containing the sequential number of the document revision and a generated id
(string)

On your first point, I see more inconvenients than advantages, as spurious
conflicts can appear. Except if you make "intelligent" trimming, by keeping
the first revision in the revision history. So in your example, Doc on DbA -
["1-foo" "2-bar" "3-baz" "4-biz"] would be Doc on DbA - ["1-foo"  "3-baz"
"4-biz"] after trimming.

On the second point, it's interesting as you have a sequential number fo the
revision so a count of the number of revision made (thing that you have
already today if you count the number of element in the revision history
list), but to be honest i don't really catch the interest to know that's
it's the 10000th revision, without being able to access individually each
version.

Why not going further, because what is the aim to have revision history for
version than you cannot access. So why not just keeping in the revision
history, the first revision (to prevent spurious conflicts) and the revision
corresponding to the version of the document you can access on the node.
Like this the sequential revision number brings interest as we know what
versions are missing on one node and could may be found on another node.

Aside your explainaition, I think it would be interesting, to have a
possibility  to also replicate all version of the documents, so you could
see whatever version on whatever node of your bases. In this case, a
revision history trim based on period would be interesting.

Just some loud thinkings,
Regards,

2009/2/7 Damien Katz <[email protected]>

> Part of the larger replication security work (branches/rep_security) is to
> allow rev histories to be trimmed back to an arbitrary length. Without this,
> revision histories must grow and grow, each update to a doc adds a new
> revision to the history. So if a document is edited 1 million times, there
> is a 1 million rev history that must be tracked.
>
> But with it, it allows for unlimited to edits to documents with only a
> fixed size history. The catch is it's possible to have spurious conflicts if
> the trimmed revision history for a later edit is replicated to a database
> without overlapping revs.
>
> The new revs look like this: "4-3693042815". The format is pretty much
> arbitrary, it just needs to be a parseable representation of an integer and
> second string value. The first number is the sequential revseq (shown is the
> 4th revision), the second is a randomly generated id (which eventually
> should be deterministically generated based on doc content, making
> idempotent updates possible and completely transparent to clients).
>
> However, when representing a rev in Erlang it is a tuple like this {4,
> <<"3693042815">>}, we need to convert back and forth between string format
> for json. Representing it as string in json instead of a complex structure
> has the least impact on couchdb clients.
>
> This will also simplify partial replication support in the future, as we
> can track the rev seq a field or attachment when changed, and during
> replication only send those parts that have changed since a previous
> revision that available in the target db. The main benefit being saving
> network IO by not sending fields and attachments that haven't changed.
>
> -Spurious Conflicts-
>
> The issue with spurious conflicts is if you have non-overlapping revision
> histories you don't know if you have a conflict or not. CouchDB will always
> report there is a conflict in the case.
>
> Example id database a with have document with this revision history (I'm
> using string for rev ids where it would normally be number):
> Doc on DbA - ["1-foo" "2-bar" "3-baz" "4-biz"]
> Doc on DbB - ["1-foo"]
>
> Lets say the revision history on A is trimmed and it now looks like this:
>
> Doc on DbA - ["2-bar" "3-baz" "4-biz"]
>
> When we replicate DbA with DbB, we get a spurious conflict, because it
> can't tell if "4-biz" is actually a later revision of "1-foo":
>
> Doc on DbB - winner: ["2-bar" "3-baz" "4-biz"]  conflict: ["1-foo"]
>
> But if on DbC we still have the full history of that doc:
> Doc on DbC - ["1-foo" "2-bar" "3-baz" "4-biz"]
>
> When it replicates back with DbB, the missing part of the revision history
> is sent and the spurious conflict automatically eliminated:
>
> Doc on DbB - ["1-foo" "2-bar" "3-baz" "4-biz"]
>
> -What Breaks-
>
> This change won't break application code, so long as they treat the _rev
> field as an opaque string and aren't converting it to integers or something.
>
> This change *does* break replication with previous versions of CouchDB, and
> changes the file format. So a dump and import will be required for existing
> database files.
>
> As of yet, I've not actually coded the parts that trim back the old revs.
> That will likely be a "max rev history" setting in the DB, but other
> suggestions welcome.
>
> -Damien
>

Re: proposed replication rev history changes

Reply via email to