[
https://issues.apache.org/jira/browse/CASSANDRA-193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776600#action_12776600
]
Stu Hood edited comment on CASSANDRA-193 at 11/12/09 4:52 AM:
--------------------------------------------------------------
Thanks a lot for the review Jun! I'll respond to #1 in a separate comment,
since it is a design issue that still needs a lot of discussion.
> 2
> 2.1 In MerkleTree.Node.insert, why do you increment the depth of the left
> child even when the node doesn't split?
Node.insert() is only used during split operations (perhaps it is misnamed...
but this is not quite a traditional B-Tree). The child to the left is the node
that contains the Token we were splitting on, and whenever we split a range we
increment its depth to indicate how far it is from being a complete (0,0]
range. I'll add a comment to this effect.
> 2.2 In the same function, if the node does split, where is the code to shrink
> the children list in the splitted node to half?
Node.insert() uses List.subList() and List.clear() inline to take half of its
neighbor's children.
> 2.3 In the same function, do you have to keep calling invalidate during
> insertion?
The original design assumed that the tree was going to live for a while in
memory, and be maintained between repair sessions, so the split operation is
intended to be used on a tree that might be partially valid. We might be able
to have the initial building of the tree skip this check somehow, but I don't
think the 'hash == null? hash = null;' check is too intensive.
> 3
> 3.1 In Validator.add, there is comment about generating a new range. However,
> no code does that.
The "private MerkleTree.TreeRangeIterator ranges" variable is an iterator
generated by the MerkleTree: it iterates in order over all of the ranges in the
tree that have null hashes.
> 3.2 In TreeRange.validateHelper [...] Why do you have to compute multiple
> hash values recursively?
The reasoning here is that a MerkleTree is supposed to be a sparse
representation of a 'perfect/complete' binary tree. Each leaf of the perfect
tree represents a hashed range, and each inner node represents a binary hash of
its two children. The perfect tree is of depth "hashdepth", so when
validateHelper() reaches the maximum/hashdepth, it is in one of the leaves of
the perfect tree, and rows are hashed sequentially there. <EDIT comment="Ignore
everything in in this tag">If a single call to validate() starts in a TreeRange
that is at depth == maxdepth, then what is stored in the MkT.Node is the value
of a perfect leaf, otherwise, the MkT.Node is storing the value of a perfect
inner node.</EDIT>
I'll probably copy and past this exact explanation into the next revision =x
> 4. I need some text description to really follow the Differencer code.
I'll make sure this gets in the next revision, but basically:
1. It recurses using midpoint as long as both trees have the resolution to
continue, and are not equal.
2a. If it finds a range that has only one invalid child, it adds that child
range, since that is the smallest possible invalid range contained in the
parent.
2b. Otherwise, if both children are invalid (and since it can't recurse
deeper), the parent range is entirely invalid, and recursion keeps rolling up
until 2a is met.
> 5. The Hashable class is confusing.
You're right: I definitely should have worried more about overloading the word
"hash": I'll see what I can do about this one.
> 6. The repair logic is missing in Differencer.
The repair logic is the one FIXME for this ticket: CASSANDRA-520 deals with
implementing the actual repair logic.
Thanks again for taking time out to look at this!
was (Author: stuhood):
Thanks a lot for the review Jun! I'll respond to #1 in a separate comment,
since it is a design issue that still needs a lot of discussion.
> 2
> 2.1 In MerkleTree.Node.insert, why do you increment the depth of the left
> child even when the node doesn't split?
Node.insert() is only used during split operations (perhaps it is misnamed...
but this is not quite a traditional B-Tree). The child to the left is the node
that contains the Token we were splitting on, and whenever we split a range we
increment its depth to indicate how far it is from being a complete (0,0]
range. I'll add a comment to this effect.
> 2.2 In the same function, if the node does split, where is the code to shrink
> the children list in the splitted node to half?
Node.insert() uses List.subList() and List.clear() inline to take half of its
neighbor's children.
> 2.3 In the same function, do you have to keep calling invalidate during
> insertion?
The original design assumed that the tree was going to live for a while in
memory, and be maintained between repair sessions, so the split operation is
intended to be used on a tree that might be partially valid. We might be able
to have the initial building of the tree skip this check somehow, but I don't
think the 'hash == null? hash = null;' check is too intensive.
> 3
> 3.1 In Validator.add, there is comment about generating a new range. However,
> no code does that.
The "private MerkleTree.TreeRangeIterator ranges" variable is an iterator
generated by the MerkleTree: it iterates in order over all of the ranges in the
tree that have null hashes.
> 3.2 In TreeRange.validateHelper [...] Why do you have to compute multiple
> hash values recursively?
The reasoning here is that a MerkleTree is supposed to be a sparse
representation of a 'perfect/complete' binary tree. Each leaf of the perfect
tree represents a hashed range, and each inner node represents a binary hash of
its two children. The perfect tree is of depth "hashdepth", so when
validateHelper() reaches the maximum/hashdepth, it is in one of the leaves of
the perfect tree, and rows are hashed sequentially there. If a single call to
validate() starts in a TreeRange that is at depth == maxdepth, then what is
stored in the MkT.Node is the value of a perfect leaf, otherwise, the MkT.Node
is storing the value of a perfect inner node.
I'll probably copy and past this exact explanation into the next revision =x
> 4. I need some text description to really follow the Differencer code.
I'll make sure this gets in the next revision, but basically:
1. It recurses using midpoint as long as both trees have the resolution to
continue, and are not equal.
2a. If it finds a range that has only one invalid child, it adds that child
range, since that is the smallest possible invalid range contained in the
parent.
2b. Otherwise, if both children are invalid (and since it can't recurse
deeper), the parent range is entirely invalid, and recursion keeps rolling up
until 2a is met.
> 5. The Hashable class is confusing.
You're right: I definitely should have worried more about overloading the word
"hash": I'll see what I can do about this one.
> 6. The repair logic is missing in Differencer.
The repair logic is the one FIXME for this ticket: CASSANDRA-520 deals with
implementing the actual repair logic.
Thanks again for taking time out to look at this!
> Proactive repair
> ----------------
>
> Key: CASSANDRA-193
> URL: https://issues.apache.org/jira/browse/CASSANDRA-193
> Project: Cassandra
> Issue Type: New Feature
> Components: Core
> Reporter: Jonathan Ellis
> Assignee: Stu Hood
> Fix For: 0.5
>
> Attachments: 193-1-tree-preparation.diff, 193-2-tree.diff,
> 193-3-aes-preparation.diff, 193-4-aes.diff, mktree-and-binary-tree.png
>
>
> Currently cassandra supports "read repair," i.e., lazy repair when a read is
> done. This is better than nothing but is not sufficient for some cases (e.g.
> catastrophic node failure where you need to rebuild all of a node's data on a
> new machine).
> Dynamo uses merkle trees here. This is harder for Cassandra given the CF
> data model but I suppose we could just hash the serialized CF value.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.