Rewrite couch_key_tree.erl
--------------------------
Key: COUCHDB-988
URL: https://issues.apache.org/jira/browse/COUCHDB-988
Project: CouchDB
Issue Type: Improvement
Components: Database Core
Reporter: Paul Joseph Davis
Priority: Minor
The current key tree module is a fairly complicated piece of code with enough
subtlety to fill three romance novels. This ticket is a proposal to rewrite the
current module with an algorithm that will be more easy to reason about. Here
I'll write a brief explanation of the current algorithms, and then a short
proposal for a simpler algorithm.
A key tree is used to represent the edit history of a document. Each node of
the tree represents a particular version of the document. Relations between
nodes represent the order that these edits were applied. For instance, a simple
tree with a single linear path A->B->C means that the edit C was based on the
version B which was in turn based on A. In a world without replication (and no
ability to disable MVCC checks), all histories would be forced to be linear
lists of edits due to constraints imposed by MVCC (ie, new edits must be based
on the current version). However, we have replication, so we must deal with not
so easy cases.
Consider a document in state A. This doc is replicated to a second node. We
then edit the document one each node leaving it in two different states, B and
C. We now have two key trees, A->B and A->C. When we go to replicate a second
time, the key tree must combine these two trees which gives us A->(B|C). For
the astute reader, this is how conflicts are introduced. In terms of the key
tree, we say that we have two leaves (B and C) that are not deleted. Hence,
conflict. To remove a conflict, one of the edits (B or C) can be deleted, which
results in, A->(B|C->D) where D is an edit that is specially marked with a
deleted=true flag.
Also, of note which will help us down the line, remember that there's another
completely different possible mode of operation. Given two couchdb instances,
we could *create* two different docs with the same id. Now we have two
documents with edit histories E and F. Now after replication we have edit
history (E|D) but really, this isn't a tree. Its two roots to two different
trees. Our algorithms still work, we just have to consider multiple trees when
we apply them. Remember this after this next aside.
Hopefully from that brief description of simple operations its fairly intuitive
how we can end up with arbitrarily deep and branching trees due to distributed
edits. Of note here is that this operation parallels the nature of Git's commit
model. Each commit sha1 depends on the parent sha1 creating a sort of inverted
merkle tree.
At this point I'll mention a quick point on the implementation of these trees.
In Erlang, they are implemented as a list of nested tuples with the roots being
the most shallow node. For instance, A->B->C could be represented as roughly,
[{A, [{B, [{C}]}]}] which in english: "a single element list of a two element
tuple, where the first element is the key, and the second element is a single
element list of a two tuple....turtles."
The reason I make this note on implementation is that it should be apparent
that the depth of this data structure is going to be linearly dependent on the
edit history of a document. As we found out quickly, Erlang also apparently has
some core C functions that are written recursively. As everyone knows about
recursion in C, there is a point where one more turtle is one too many. Or, in
other words, we run out of room on the stack to allocate more frames.
At this point we had a couple options. Patch Erlang (we did, it was rejected
IIRC). Change our representation (could work, but would break some of the
algorithmic assumptions, especially with how Erlang's SSA works). Or, add
pruning to the revision history. This is what revision stemming is.
Revision stemming in a nutshell means "Keep only the last N edits". But then
its a tree, so its actually, "keep the last N edits per leaf." The tradeoff
here is that stemming may introduce a conflict by accident. For instance, Given
A->B on one node, replicate to a second, edit it so that its A->B->C->D. Then
applying stemming with an N of 2, we end up with A->B on node 1, and C->D on
node 2. After replication, this turns into (A->B|C->D) ie, we weren't able to
see that C was a descendant of B. This is generally considered an acceptable
tradeoff. You just need to make sure you replicate more often than the N
rev_stemming edits.
The second important note about rev_stemming is that it changes our storage of
trees. The current implementation keeps track of how many edits have been
discarded. Ie, in revisions you see in documents of the format "Number-Hash",
that number is basically how deep that revision is in the path from root. We
then leverage this information when merging stemmed trees so that we know how
deep we need to look to start matching edits.
One final minor piece of implementation. Each key in the tree also carries a
value. This can either be a document body about to be written, or perhaps the
location in the file where that revision can be read.
The bug that was uncovered in COUCHDB-968 was that the start-of-tree position
information was used in such a way that we ended up choosing the wrong value.
Ie, we were choosing to re-write the same document instead of discarding it for
the version that was already on disk. This ended up cascading through about 3
critical sections causing the end result of dupes in _changes and _all_docs
after compaction.
The fix that Adam has put together was to first fix the decision making process
by making a specific preference for which tree's value is used for identical
revisions. Due to more than a few subtleties this has take a bit to get right
with all the intervening key tree code. Hence, my desire to rewrite it.
The last thing I'll write up is an off the cuff approach that I thought of this
morning. I don't necessarily think this is the only way to rewrite the key
tree, but it seems easier (before coding). If anyone has ideas for other
approaches that's just as well as the only goal here is to reduce the
complexity factor.
The current code was originally designed around the concept of rooted trees of
edits. Ie, Git. As we added features like stemming, the actual problem has
shifted towards considering a larger set of possibly unrelated trees. If we add
one constraint, it is possible for us to reconsider the algorithms in terms of
more general graph approaches. That one extra constraint:
1. keys (ie, revisions) must be unique within the set of trees (ie, single
docid).
The general algorithm for merging N trees is something like:
1. For each tree, create a list of parent/child edges.
2. Find the union of all edges.
3. Rebuild the tree.
There'd be more to work out of course. Depth would disappear, which would
affect revision patterns visible to the client. Though BigCouch has changed
revisions as well. I'm sure there are probably other things to work out.
So, that's all I've got right now. Its a bit of a twinkle in the eye, but I
figured I'd get my thoughts on the current code down on paper so I don't have
to re-think them in the future.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.