[jira] Issue Comment Edited: (CASSANDRA-193) Proactive repair

Stu Hood (JIRA) Wed, 11 Nov 2009 20:54:15 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776600#action_12776600
 ]


Stu Hood edited comment on CASSANDRA-193 at 11/12/09 4:52 AM:
--------------------------------------------------------------

Thanks a lot for the review Jun! I'll respond to #1 in a separate comment, 
since it is a design issue that still needs a lot of discussion.

> 2
> 2.1 In MerkleTree.Node.insert, why do you increment the depth of the left 
> child even when the node doesn't split?
Node.insert() is only used during split operations (perhaps it is misnamed... 
but this is not quite a traditional B-Tree). The child to the left is the node 
that contains the Token we were splitting on, and whenever we split a range we 
increment its depth to indicate how far it is from being a complete (0,0] 
range. I'll add a comment to this effect.

> 2.2 In the same function, if the node does split, where is the code to shrink 
> the children list in the splitted node to half?
Node.insert() uses List.subList() and List.clear() inline to take half of its 
neighbor's children.

> 2.3 In the same function, do you have to keep calling invalidate during 
> insertion?
The original design assumed that the tree was going to live for a while in 
memory, and be maintained between repair sessions, so the split operation is 
intended to be used on a tree that might be partially valid. We might be able 
to have the initial building of the tree skip this check somehow, but I don't 
think the 'hash == null? hash = null;' check is too intensive.

> 3
> 3.1 In Validator.add, there is comment about generating a new range. However, 
> no code does that.
The "private MerkleTree.TreeRangeIterator ranges" variable is an iterator 
generated by the MerkleTree: it iterates in order over all of the ranges in the 
tree that have null hashes.

> 3.2 In TreeRange.validateHelper [...] Why do you have to compute multiple 
> hash values recursively?
The reasoning here is that a MerkleTree is supposed to be a sparse 
representation of a 'perfect/complete' binary tree. Each leaf of the perfect 
tree represents a hashed range, and each inner node represents a binary hash of 
its two children. The perfect tree is of depth "hashdepth", so when 
validateHelper() reaches the maximum/hashdepth, it is in one of the leaves of 
the perfect tree, and rows are hashed sequentially there. <EDIT comment="Ignore 
everything in in this tag">If a single call to validate() starts in a TreeRange 
that is at depth == maxdepth, then what is stored in the MkT.Node is the value 
of a perfect leaf, otherwise, the MkT.Node is storing the value of a perfect 
inner node.</EDIT>

I'll probably copy and past this exact explanation into the next revision =x

> 4. I need some text description to really follow the Differencer code.
I'll make sure this gets in the next revision, but basically:
 1. It recurses using midpoint as long as both trees have the resolution to 
continue, and are not equal.
 2a. If it finds a range that has only one invalid child, it adds that child 
range, since that is the smallest possible invalid range contained in the 
parent.
 2b. Otherwise, if both children are invalid (and since it can't recurse 
deeper), the parent range is entirely invalid, and recursion keeps rolling up 
until 2a is met.

> 5. The Hashable class is confusing.
You're right: I definitely should have worried more about overloading the word 
"hash": I'll see what I can do about this one.

> 6. The repair logic is missing in Differencer.
The repair logic is the one FIXME for this ticket: CASSANDRA-520 deals with 
implementing the actual repair logic.

Thanks again for taking time out to look at this! 

      was (Author: stuhood):
    Thanks a lot for the review Jun! I'll respond to #1 in a separate comment, 
since it is a design issue that still needs a lot of discussion.

> 2
> 2.1 In MerkleTree.Node.insert, why do you increment the depth of the left 
> child even when the node doesn't split?
Node.insert() is only used during split operations (perhaps it is misnamed... 
but this is not quite a traditional B-Tree). The child to the left is the node 
that contains the Token we were splitting on, and whenever we split a range we 
increment its depth to indicate how far it is from being a complete (0,0] 
range. I'll add a comment to this effect.

> 2.2 In the same function, if the node does split, where is the code to shrink 
> the children list in the splitted node to half?
Node.insert() uses List.subList() and List.clear() inline to take half of its 
neighbor's children.

> 2.3 In the same function, do you have to keep calling invalidate during 
> insertion?
The original design assumed that the tree was going to live for a while in 
memory, and be maintained between repair sessions, so the split operation is 
intended to be used on a tree that might be partially valid. We might be able 
to have the initial building of the tree skip this check somehow, but I don't 
think the 'hash == null? hash = null;' check is too intensive.

> 3
> 3.1 In Validator.add, there is comment about generating a new range. However, 
> no code does that.
The "private MerkleTree.TreeRangeIterator ranges" variable is an iterator 
generated by the MerkleTree: it iterates in order over all of the ranges in the 
tree that have null hashes.

> 3.2 In TreeRange.validateHelper [...] Why do you have to compute multiple 
> hash values recursively?
The reasoning here is that a MerkleTree is supposed to be a sparse 
representation of a 'perfect/complete' binary tree. Each leaf of the perfect 
tree represents a hashed range, and each inner node represents a binary hash of 
its two children. The perfect tree is of depth "hashdepth", so when 
validateHelper() reaches the maximum/hashdepth, it is in one of the leaves of 
the perfect tree, and rows are hashed sequentially there. If a single call to 
validate() starts in a TreeRange that is at depth == maxdepth, then what is 
stored in the MkT.Node is the value of a perfect leaf, otherwise, the MkT.Node 
is storing the value of a perfect inner node.

I'll probably copy and past this exact explanation into the next revision =x

> 4. I need some text description to really follow the Differencer code.
I'll make sure this gets in the next revision, but basically:
 1. It recurses using midpoint as long as both trees have the resolution to 
continue, and are not equal.
 2a. If it finds a range that has only one invalid child, it adds that child 
range, since that is the smallest possible invalid range contained in the 
parent.
 2b. Otherwise, if both children are invalid (and since it can't recurse 
deeper), the parent range is entirely invalid, and recursion keeps rolling up 
until 2a is met.

> 5. The Hashable class is confusing.
You're right: I definitely should have worried more about overloading the word 
"hash": I'll see what I can do about this one.

> 6. The repair logic is missing in Differencer.
The repair logic is the one FIXME for this ticket: CASSANDRA-520 deals with 
implementing the actual repair logic.

Thanks again for taking time out to look at this! 
  
> Proactive repair
> ----------------
>
>                 Key: CASSANDRA-193
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-193
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Stu Hood
>             Fix For: 0.5
>
>         Attachments: 193-1-tree-preparation.diff, 193-2-tree.diff, 
> 193-3-aes-preparation.diff, 193-4-aes.diff, mktree-and-binary-tree.png
>
>
> Currently cassandra supports "read repair," i.e., lazy repair when a read is 
> done.  This is better than nothing but is not sufficient for some cases (e.g. 
> catastrophic node failure where you need to rebuild all of a node's data on a 
> new machine).
> Dynamo uses merkle trees here.  This is harder for Cassandra given the CF 
> data model but I suppose we could just hash the serialized CF value.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (CASSANDRA-193) Proactive repair

Reply via email to