Heya Michael, thanks for taking the time to write this up!
Could the same* be achieved by taking _revs out of the _rev calculation? *minus the variable digest/signature idea, which could be discussed separately. My question would be: how often does the “same doc, but different _revs history”-scenario happen as opposed to other conflicts? I’m thinking, since content conflicts (_digest/_signature mismatch) still have to be handled outside of CouchDB, and while writing that logic, doing a content-equivalence check as a first shortcut in a conflict resolution function, isn’t the added overhead (more _fields, more keeping track of stuff, more entries in indexes etc) maybe not worth it, if clients have an easy way of doing their own autoresolve? I might be missing something obvious though! Best Jan -- > On 27 Mar 2016, at 10:08, Michael Fair <[email protected]> wrote: > > Greetings all! > > Sorry if this one is a bit long; you really only need to read the first few > paragraphs, the rest of this is rationalization on why it's sound/sane and > some proposed implementation details. > > I've been thinking about replication with third party servers and the > problems of evolving binary formats and revision id algorithms and believe > that by adding a simple optional change to the way conflicts are handled, > eventual consistency will make most of the associated problems go away. > > Basically, instead of trying to create the one true canonical JSON > representation, add a set of "digests" to the current leaves of a document > and follow some rules: > > 1) the JSON format used between two end points is either the original JSON > document; or a negotiated format that preserves the fidelity of the > original JSON > > 2) The deterministic algorithm for selecting which revision to use in the > presence of conflicting branches is honored > > and > > 3) The following proposed enhancement for merging revisions with the same > digest data (which centers on the concept that two documents with the same > _id and contents are in fact the same document, and conflicting branches > with the same JSON content should be eliminated/merged (not preserved) when > the receiving server detects them.) > > > When a conflicting leaf of a document is updated to have the same contents > (as determined by a message digest of the contents) of another current > leaf; this should be regarded as a MERGE operation between those leaves. > > > The determined revision id between the two revisions would be returned and > the losing leaf be automatically deleted (thereby resolving that conflict > as the contents now match). > > > A record stored on the deleted document to track the revision id the > deleted document was merged into would make for nicer revision history > graphs but is completely unnecessary. > > > Further, when replicating and receiving _bulk_docs/all_or_nothing, if two > documents are detected with the same _id and different revision ids but > having the same digest, a conflict should not be created at all. The same > merge algorithm should apply; the determined revision id of the two would > be kept and the other automatically marked as deleted. > > > It's important to note that the digest here is a separate computation from > the revision id (and could use its own algorithm). The revision id here > could have been randomly generated. This proposal is saying "check the > contents (via a digest) before creating/persisting a conflict based on > revision id". > > > Here's the rationalization: > > This keeps with the philosophy one _id, one document; it makes sense within > the context of what people consider a document revision to be (a version of > the document's contents); and it supports people and applications in > resolving document conflicts in a meaningful way. > > > I realize doing any conflict resolution is something new for the CouchDB > code. For added context on why this keeps with the Couch philosophy, > here's a small snippet from the docs: > > [ > > Here, (r4a, r3b, r3c) are the set of conflicting revisions. The way you > resolve a conflict is to delete the leaf nodes along the other branches. So > when you combine (r4a+r3b+r3c) into a single merged document, you would > replace r4a and delete r3b and r3c. > > ] > > > This statement is all about bringing the document branches to a common > place and terminating the lower branches. This proposal isn't doing > something new by attempting to merge/resolve documents with different > information; it's aiding what's already the defined procedure. > > > This proposal provides an alternative method to accomplishing the same end > result as described; you can PUT into r4a, r3b, and r3c the merged content > (or say if r3b already has the right info, then just update r4a and r3c to > match r3b's contents); under this proposal the result would be identical to > updating r4a with the corrected info and deleting r3b and r3c as described. > > > The example in the docs is all about managing a contact record. This is a > great example of a document that can arrive at the same end state but take > many paths internally on local device databases. > > > When syncing with another database, the fact the document took a different > path isn't something to preserve a conflict over. The contents at the time > the document is synced is. > > > And while it's tempting to say "you might want to track those histories > separately" and that "just because they match contents doesn't make them > the same branch", those assertions go against the idea of one _id, one > document thinking. It advocates for keeping conflicts as a lightweight > form of revision history tracking. If the contents of two leaves of a > document can be shown to be the same, those branches should be merged and > that conflict within the _id resolved. The number of times a document was > saved, or its values path, in a local device database is not important to > the current state. > > > Said another way, if it has the same _id and the same contents, excluding > the revision id, it's the same document. And once they're detected as the > same document, it needs a single revision id, so the deterministic > algorithm selects the winner. > > > Assuming the concept is acceptable, then to avoid breaking anything > existing, I propose this new message digest value be stored in a new string > field called "MD5/CouchDB-2.0" as part of a new optional document object > field called "_digests". > > > Its value would be the same as the md5 portion of the revision id. Adding > "_digests" as an object allows for different digest algorithms to be added > to the same doc by other applications, or by future algorithms without > breaking anything existing. > > > It also doesn't break anything for a server to throw out the _digests field > on a doc and not store it (it means more CPU work during replication for > revision id conflicts but that might be desired over storing the data on > the doc). > > > This proposal also resolves the third party revision id problem. > > > When a server (Couch or otherwise) receives documents and detects a > revision id conflict, by using its own supported digest algorithms on the > document it can detect and resolve conflicts where the JSON content is the > same but the revision id was calculated using a different method. > > > Because the revision id selection algorithm is determinstic each receiving > server will pick the same revision to keep and the same revision to delete. > > > The only important thing is that the server's chosen digest algorithm > generate different digests for different documents and the same digest for > the same document. It doesn't matter if the two severs used the same > digest algorithm. > > > A server may ignore this proposal and produce a conflict revision instead > of merging the revision. Replication with a server that does honor this > proposal would detect and resolve those conflicts or applications and > humans might resolve it the way it's already done. > > > A server should preserve all digests listed in the _digests object on the > document, however it may preserve only its own, some of them, or throw out > the object (as mentioned earlier). > > > A server may populate as many digest algorithms as it wishes and knows how > to compute. > > > Only digests for the currently active leaves of a document need be > preserved (and even then, only for documents that have active conflicts). > Historical digests add no value for this purpose. > > > A digest algorithm can be provided as part of a design document or > map/reduce view definition to enable other servers to compute a preferred > digest. (doing this also helps ensure the same algorithm is used as the > design docs/map view definitions can be replicated.) > > > When a server receives a document that doesn't have its own algorithm > listed in the "_digests" object it will have to compute it should a > revision id conflict be detected. > > > I believe this can be implemented as an erlang plugin for existing 1.6.1 > servers. > > > And lastly, this sets up the preferred application method for resolving > conflicts in an application to be download all the existing conflicting > revisions; massage the JSON contents; upload the same document to all the > now resolved revisions. This to me seems easier to code and follow than > having the application decide which revision id is the right one to update > and which one it should delete. "Just update them all" is an easier > approach. :) > > > Thanks everyone, thoughts on the matter are obviously welcomed, > > Mike -- Professional Support for Apache CouchDB: https://neighbourhood.ie/couchdb-support/
