[
https://issues.apache.org/jira/browse/COUCHDB-988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12971396#action_12971396
]
Adam Kocoloski commented on COUCHDB-988:
----------------------------------------
Thanks, Paul, this is a good writeup. I didn't realize we tried to patch the
Erlang VM to support arbitrarily large revision trees. Few other quick notes:
* I'm certain the constraint you listed is already imposed by the way we
generate revisions.
* BigCouch changes the notion of a sequence, but not a document revision. In
fact the patches in COUCHDB-968 all apply more or less cleanly to BigCouch.
* I'm sure you can understand that I'm reluctant to toss the current
couch_key_tree implementation out given that I just invested so many hours into
understanding and fixing it.
> Rewrite couch_key_tree.erl
> --------------------------
>
> Key: COUCHDB-988
> URL: https://issues.apache.org/jira/browse/COUCHDB-988
> Project: CouchDB
> Issue Type: Improvement
> Components: Database Core
> Reporter: Paul Joseph Davis
> Priority: Minor
>
> The current key tree module is a fairly complicated piece of code with enough
> subtlety to fill three romance novels. This ticket is a proposal to rewrite
> the current module with an algorithm that will be more easy to reason about.
> Here I'll write a brief explanation of the current algorithms, and then a
> short proposal for a simpler algorithm.
> A key tree is used to represent the edit history of a document. Each node of
> the tree represents a particular version of the document. Relations between
> nodes represent the order that these edits were applied. For instance, a
> simple tree with a single linear path A->B->C means that the edit C was based
> on the version B which was in turn based on A. In a world without replication
> (and no ability to disable MVCC checks), all histories would be forced to be
> linear lists of edits due to constraints imposed by MVCC (ie, new edits must
> be based on the current version). However, we have replication, so we must
> deal with not so easy cases.
> Consider a document in state A. This doc is replicated to a second node. We
> then edit the document one each node leaving it in two different states, B
> and C. We now have two key trees, A->B and A->C. When we go to replicate a
> second time, the key tree must combine these two trees which gives us
> A->(B|C). For the astute reader, this is how conflicts are introduced. In
> terms of the key tree, we say that we have two leaves (B and C) that are not
> deleted. Hence, conflict. To remove a conflict, one of the edits (B or C) can
> be deleted, which results in, A->(B|C->D) where D is an edit that is
> specially marked with a deleted=true flag.
> Also, of note which will help us down the line, remember that there's another
> completely different possible mode of operation. Given two couchdb instances,
> we could *create* two different docs with the same id. Now we have two
> documents with edit histories E and F. Now after replication we have edit
> history (E|D) but really, this isn't a tree. Its two roots to two different
> trees. Our algorithms still work, we just have to consider multiple trees
> when we apply them. Remember this after this next aside.
> Hopefully from that brief description of simple operations its fairly
> intuitive how we can end up with arbitrarily deep and branching trees due to
> distributed edits. Of note here is that this operation parallels the nature
> of Git's commit model. Each commit sha1 depends on the parent sha1 creating a
> sort of inverted merkle tree.
> At this point I'll mention a quick point on the implementation of these
> trees. In Erlang, they are implemented as a list of nested tuples with the
> roots being the most shallow node. For instance, A->B->C could be represented
> as roughly, [{A, [{B, [{C}]}]}] which in english: "a single element list of a
> two element tuple, where the first element is the key, and the second element
> is a single element list of a two tuple....turtles."
> The reason I make this note on implementation is that it should be apparent
> that the depth of this data structure is going to be linearly dependent on
> the edit history of a document. As we found out quickly, Erlang also
> apparently has some core C functions that are written recursively. As
> everyone knows about recursion in C, there is a point where one more turtle
> is one too many. Or, in other words, we run out of room on the stack to
> allocate more frames.
> At this point we had a couple options. Patch Erlang (we did, it was rejected
> IIRC). Change our representation (could work, but would break some of the
> algorithmic assumptions, especially with how Erlang's SSA works). Or, add
> pruning to the revision history. This is what revision stemming is.
> Revision stemming in a nutshell means "Keep only the last N edits". But then
> its a tree, so its actually, "keep the last N edits per leaf." The tradeoff
> here is that stemming may introduce a conflict by accident. For instance,
> Given A->B on one node, replicate to a second, edit it so that its
> A->B->C->D. Then applying stemming with an N of 2, we end up with A->B on
> node 1, and C->D on node 2. After replication, this turns into (A->B|C->D)
> ie, we weren't able to see that C was a descendant of B. This is generally
> considered an acceptable tradeoff. You just need to make sure you replicate
> more often than the N rev_stemming edits.
> The second important note about rev_stemming is that it changes our storage
> of trees. The current implementation keeps track of how many edits have been
> discarded. Ie, in revisions you see in documents of the format "Number-Hash",
> that number is basically how deep that revision is in the path from root. We
> then leverage this information when merging stemmed trees so that we know how
> deep we need to look to start matching edits.
> One final minor piece of implementation. Each key in the tree also carries a
> value. This can either be a document body about to be written, or perhaps the
> location in the file where that revision can be read.
> The bug that was uncovered in COUCHDB-968 was that the start-of-tree position
> information was used in such a way that we ended up choosing the wrong value.
> Ie, we were choosing to re-write the same document instead of discarding it
> for the version that was already on disk. This ended up cascading through
> about 3 critical sections causing the end result of dupes in _changes and
> _all_docs after compaction.
> The fix that Adam has put together was to first fix the decision making
> process by making a specific preference for which tree's value is used for
> identical revisions. Due to more than a few subtleties this has take a bit to
> get right with all the intervening key tree code. Hence, my desire to rewrite
> it.
> The last thing I'll write up is an off the cuff approach that I thought of
> this morning. I don't necessarily think this is the only way to rewrite the
> key tree, but it seems easier (before coding). If anyone has ideas for other
> approaches that's just as well as the only goal here is to reduce the
> complexity factor.
> The current code was originally designed around the concept of rooted trees
> of edits. Ie, Git. As we added features like stemming, the actual problem has
> shifted towards considering a larger set of possibly unrelated trees. If we
> add one constraint, it is possible for us to reconsider the algorithms in
> terms of more general graph approaches. That one extra constraint:
> 1. keys (ie, revisions) must be unique within the set of trees (ie,
> single docid).
> The general algorithm for merging N trees is something like:
> 1. For each tree, create a list of parent/child edges.
> 2. Find the union of all edges.
> 3. Rebuild the tree.
> There'd be more to work out of course. Depth would disappear, which would
> affect revision patterns visible to the client. Though BigCouch has changed
> revisions as well. I'm sure there are probably other things to work out.
> So, that's all I've got right now. Its a bit of a twinkle in the eye, but I
> figured I'd get my thoughts on the current code down on paper so I don't have
> to re-think them in the future.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.