Re: Replication with <2.0, third party servers, and Automerging conflicts

Jan Lehnardt Wed, 30 Mar 2016 11:03:58 -0700

Heya Michael,

thanks for taking the time to write this up!


Could the same* be achieved by taking _revs out of the _rev calculation?

*minus the variable digest/signature idea, which could be discussed separately.

My question would be: how often does the “same doc, but different _revs
history”-scenario happen as opposed to other conflicts?

I’m thinking, since content conflicts (_digest/_signature mismatch) still
have to be handled outside of CouchDB, and while writing that logic, doing
a content-equivalence check as a first shortcut in a conflict resolution
function, isn’t the added overhead (more _fields, more keeping track of
stuff, more entries in indexes etc) maybe not worth it, if clients have
an easy way of doing their own autoresolve?

I might be missing something obvious though!

Best
Jan
--




> On 27 Mar 2016, at 10:08, Michael Fair <[email protected]> wrote:
> 
> Greetings all!
> 
> Sorry if this one is a bit long; you really only need to read the first few
> paragraphs, the rest of this is rationalization on why it's sound/sane and
> some proposed implementation details.
> 
> I've been thinking about replication with third party servers and the
> problems of evolving binary formats and revision id algorithms and believe
> that by adding a simple optional change to the way conflicts are handled,
> eventual consistency will make most of the associated problems go away.
> 
> Basically, instead of trying to create the one true canonical JSON
> representation, add a set of "digests" to the current leaves of a document
> and follow some rules:
> 
> 1) the JSON format used between two end points is either the original JSON
> document; or a negotiated format that preserves the fidelity of the
> original JSON
> 
> 2) The deterministic algorithm for selecting which revision to use in the
> presence of conflicting branches is honored
> 
> and
> 
> 3) The following proposed enhancement for merging revisions with the same
> digest data (which centers on the concept that two documents with the same
> _id and contents are in fact the same document, and conflicting branches
> with the same JSON content should be eliminated/merged (not preserved) when
> the receiving server detects them.)
> 
> 
> When a conflicting leaf of a document is updated to have the same contents
> (as determined by a message digest of the contents) of another current
> leaf; this should be regarded as a MERGE operation between those leaves.
> 
> 
> The determined revision id between the two revisions would be returned and
> the losing leaf be automatically deleted (thereby resolving that conflict
> as the contents now match).
> 
> 
> A record stored on the deleted document to track the revision id the
> deleted document was merged into would make for nicer revision history
> graphs but is completely unnecessary.
> 
> 
> Further, when replicating and receiving _bulk_docs/all_or_nothing, if two
> documents are detected with the same _id and different revision ids but
> having the same digest, a conflict should not be created at all.  The same
> merge algorithm should apply; the determined revision id of the two would
> be kept and the other automatically marked as deleted.
> 
> 
> It's important to note that the digest here is a separate computation from
> the revision id (and could use its own algorithm). The revision id here
> could have been randomly generated.  This proposal is saying "check the
> contents (via a digest) before creating/persisting a conflict based on
> revision id".
> 
> 
> Here's the rationalization:
> 
> This keeps with the philosophy one _id, one document; it makes sense within
> the context of what people consider a document revision to be (a version of
> the document's contents); and it supports people and applications in
> resolving document conflicts in a meaningful way.
> 
> 
> I realize doing any conflict resolution is something new for the CouchDB
> code.  For added context on why this keeps with the Couch philosophy,
> here's a small snippet from the docs:
> 
> [
> 
> Here, (r4a, r3b, r3c) are the set of conflicting revisions. The way you
> resolve a conflict is to delete the leaf nodes along the other branches. So
> when you combine (r4a+r3b+r3c) into a single merged document, you would
> replace r4a and delete r3b and r3c.
> 
> ]
> 
> 
> This statement is all about bringing the document branches to a common
> place and terminating the lower branches. This proposal isn't doing
> something new by attempting to merge/resolve documents with different
> information; it's aiding what's already the defined procedure.
> 
> 
> This proposal provides an alternative method to accomplishing the same end
> result as described; you can PUT into r4a, r3b, and r3c the merged content
> (or say if r3b already has the right info, then just update r4a and r3c to
> match r3b's contents); under this proposal the result would be identical to
> updating r4a with the corrected info and deleting r3b and r3c as described.
> 
> 
> The example in the docs is all about managing a contact record.  This is a
> great example of a document that can arrive at the same end state but take
> many paths internally on local device databases.
> 
> 
> When syncing with another database, the fact the document took a different
> path isn't something to preserve a conflict over.  The contents at the time
> the document is synced is.
> 
> 
> And while it's tempting to say "you might want to track those histories
> separately" and that "just because they match contents doesn't make them
> the same branch", those assertions go against the idea of one _id, one
> document thinking.  It advocates for keeping conflicts as a lightweight
> form of revision history tracking.  If the contents of two leaves of a
> document can be shown to be the same, those branches should be merged and
> that conflict within the _id resolved.  The number of times a document was
> saved, or its values path, in a local device database is not important to
> the current state.
> 
> 
> Said another way, if it has the same _id and the same contents, excluding
> the revision id, it's the same document.  And once they're detected as the
> same document, it needs a single revision id, so the deterministic
> algorithm selects the winner.
> 
> 
> Assuming the concept is acceptable, then to avoid breaking anything
> existing, I propose this new message digest value be stored in a new string
> field called "MD5/CouchDB-2.0" as part of a new optional document object
> field called "_digests".
> 
> 
> Its value would be the same as the md5 portion of the revision id.  Adding
> "_digests"  as an object allows for different digest algorithms to be added
> to the same doc by other applications, or by future algorithms without
> breaking anything existing.
> 
> 
> It also doesn't break anything for a server to throw out the _digests field
> on a doc and not store it (it means more CPU work during replication for
> revision id conflicts but that might be desired over storing the data on
> the doc).
> 
> 
> This proposal also resolves the third party revision id problem.
> 
> 
> When a server (Couch or otherwise) receives documents and detects a
> revision id conflict, by using its own supported digest algorithms on the
> document it can detect and resolve conflicts where the JSON content is the
> same but the revision id was calculated using a different method.
> 
> 
> Because the revision id selection algorithm is determinstic each receiving
> server will pick the same revision to keep and the same revision to delete.
> 
> 
> The only important thing is that the server's chosen digest algorithm
> generate different digests for different documents and the same digest for
> the same document.  It doesn't matter if the two severs used the same
> digest algorithm.
> 
> 
> A server may ignore this proposal and produce a conflict revision instead
> of merging the revision.  Replication with a server that does honor this
> proposal would detect and resolve those conflicts or applications and
> humans might resolve it the way it's already done.
> 
> 
> A server should preserve all digests listed in the _digests object on the
> document, however it may preserve only its own, some of them, or throw out
> the object (as mentioned earlier).
> 
> 
> A server may populate as many digest algorithms as it wishes and knows how
> to compute.
> 
> 
> Only digests for the currently active leaves of a document need be
> preserved (and even then, only for documents that have active conflicts).
> Historical digests add no value for this purpose.
> 
> 
> A digest algorithm can be provided as part of a design document or
> map/reduce view definition to enable other servers to compute a preferred
> digest.  (doing this also helps ensure the same algorithm is used as the
> design docs/map view definitions can be replicated.)
> 
> 
> When a server receives a document that doesn't have its own algorithm
> listed in the "_digests" object it will have to compute it should a
> revision id conflict be detected.
> 
> 
> I believe this can be implemented as an erlang plugin for existing 1.6.1
> servers.
> 
> 
> And lastly, this sets up the preferred application method for resolving
> conflicts in an application to be download all the existing conflicting
> revisions; massage the JSON contents; upload the same document to all the
> now resolved revisions.  This to me seems easier to code and follow than
> having the application decide which revision id is the right one to update
> and which one it should delete.  "Just update them all" is an easier
> approach. :)
> 
> 
> Thanks everyone, thoughts on the matter are obviously welcomed,
> 
> Mike

-- 
Professional Support for Apache CouchDB:
https://neighbourhood.ie/couchdb-support/

Re: Replication with <2.0, third party servers, and Automerging conflicts

Reply via email to