Rewrite couch_key_tree.erl
--------------------------

                 Key: COUCHDB-988
                 URL: https://issues.apache.org/jira/browse/COUCHDB-988
             Project: CouchDB
          Issue Type: Improvement
          Components: Database Core
            Reporter: Paul Joseph Davis
            Priority: Minor


The current key tree module is a fairly complicated piece of code with enough 
subtlety to fill three romance novels. This ticket is a proposal to rewrite the 
current module with an algorithm that will be more easy to reason about. Here 
I'll write a brief explanation of the current algorithms, and then a short 
proposal for a simpler algorithm.

A key tree is used to represent the edit history of a document. Each node of 
the tree represents a particular version of the document. Relations between 
nodes represent the order that these edits were applied. For instance, a simple 
tree with a single linear path A->B->C means that the edit C was based on the 
version B which was in turn based on A. In a world without replication (and no 
ability to disable MVCC checks), all histories would be forced to be linear 
lists of edits due to constraints imposed by MVCC (ie, new edits must be based 
on the current version). However, we have replication, so we must deal with not 
so easy cases.

Consider a document in state A. This doc is replicated to a second node. We 
then edit the document one each node leaving it in two different states, B and 
C. We now have two key trees, A->B and A->C. When we go to replicate a second 
time, the key tree must combine these two trees which gives us A->(B|C). For 
the astute reader, this is how conflicts are introduced. In terms of the key 
tree, we say that we have two leaves (B and C) that are not deleted. Hence, 
conflict. To remove a conflict, one of the edits (B or C) can be deleted, which 
results in, A->(B|C->D) where D is an edit that is specially marked with a 
deleted=true flag.

Also, of note which will help us down the line, remember that there's another 
completely different possible mode of operation. Given two couchdb instances, 
we could *create* two different docs with the same id. Now we have two 
documents with edit histories E and F. Now after replication we have edit 
history (E|D) but really, this isn't a tree. Its two roots to two different 
trees. Our algorithms still work, we just have to consider multiple trees when 
we apply them. Remember this after this next aside.

Hopefully from that brief description of simple operations its fairly intuitive 
how we can end up with arbitrarily deep and branching trees due to distributed 
edits. Of note here is that this operation parallels the nature of Git's commit 
model. Each commit sha1 depends on the parent sha1 creating a sort of inverted 
merkle tree.

At this point I'll mention a quick point on the implementation of these trees. 
In Erlang, they are implemented as a list of nested tuples with the roots being 
the most shallow node. For instance, A->B->C could be represented as roughly, 
[{A, [{B, [{C}]}]}] which in english: "a single element list of a two element 
tuple, where the first element is the key, and the second element is a single 
element list of a two tuple....turtles."

The reason I make this note on implementation is that it should be apparent 
that the depth of this data structure is going to be linearly dependent on the 
edit history of a document. As we found out quickly, Erlang also apparently has 
some core C functions that are written recursively. As everyone knows about 
recursion in C, there is a point where one more turtle is one too many. Or, in 
other words, we run out of room on the stack to allocate more frames.

At this point we had a couple options. Patch Erlang (we did, it was rejected 
IIRC). Change our representation (could work, but would break some of the 
algorithmic assumptions, especially with how Erlang's SSA works). Or, add 
pruning to the revision history. This is what revision stemming is.

Revision stemming in a nutshell means "Keep only the last N edits". But then 
its a tree, so its actually, "keep the last N edits per leaf." The tradeoff 
here is that stemming may introduce a conflict by accident. For instance, Given 
A->B on one node, replicate to a second, edit it so that its A->B->C->D. Then 
applying stemming with an N of 2, we end up with A->B on node 1, and C->D on 
node 2. After replication, this turns into (A->B|C->D) ie, we weren't able to 
see that C was a descendant of B. This is generally considered an acceptable 
tradeoff. You just need to make sure you replicate more often than the N 
rev_stemming edits.

The second important note about rev_stemming is that it changes our storage of 
trees. The current implementation keeps track of how many edits have been 
discarded. Ie, in revisions you see in documents of the format "Number-Hash", 
that number is basically how deep that revision is in the path from root. We 
then leverage this information when merging stemmed trees so that we know how 
deep we need to look to start matching edits.

One final minor piece of implementation. Each key in the tree also carries a 
value. This can either be a document body about to be written, or perhaps the 
location in the file where that revision can be read.

The bug that was uncovered in COUCHDB-968 was that the start-of-tree position 
information was used in such a way that we ended up choosing the wrong value. 
Ie, we were choosing to re-write the same document instead of discarding it for 
the version that was already on disk. This ended up cascading through about 3 
critical sections causing the end result of dupes in _changes and _all_docs 
after compaction.

The fix that Adam has put together was to first fix the decision making process 
by making a specific preference for which tree's value is used for identical 
revisions. Due to more than a few subtleties this has take a bit to get right 
with all the intervening key tree code. Hence, my desire to rewrite it.

The last thing I'll write up is an off the cuff approach that I thought of this 
morning. I don't necessarily think this is the only way to rewrite the key 
tree, but it seems easier (before coding). If anyone has ideas for other 
approaches that's just as well as the only goal here is to reduce the 
complexity factor.

The current code was originally designed around the concept of rooted trees of 
edits. Ie, Git. As we added features like stemming, the actual problem has 
shifted towards considering a larger set of possibly unrelated trees. If we add 
one constraint, it is possible for us to reconsider the algorithms in terms of 
more general graph approaches. That one extra constraint:

    1. keys (ie, revisions) must be unique within the set of trees (ie, single 
docid).

The general algorithm for merging N trees is something like:

    1. For each tree, create a list of parent/child edges.
    2. Find the union of all edges.
    3. Rebuild the tree.

There'd be more to work out of course. Depth would disappear, which would 
affect revision patterns visible to the client. Though BigCouch has changed 
revisions as well. I'm sure there are probably other things to work out.

So, that's all I've got right now. Its a bit of a twinkle in the eye, but I 
figured I'd get my thoughts on the current code down on paper so I don't have 
to re-think them in the future.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to