[ 
https://issues.apache.org/jira/browse/CASSANDRA-193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12781641#action_12781641
 ] 

Stu Hood edited comment on CASSANDRA-193 at 11/23/09 9:46 PM:
--------------------------------------------------------------

> see whether we lose any precision by simply using bitwise XOR to combine row 
> hashes
Doing the math on this one was simpler than actually testing it: using a 
commutative hash function like XOR means that the number of possible inputs 
goes from being a Permutation of the leaves to being a Combination of the 
leaves (since a set of leaves in any order are equal).

For MD5 you have:
(2^127)! / (2^127 - 2^16)! == number of possible permutations of 2^16 hashes of 
length 127
And for XOR:
(2^127)! / (2^16)! * (2^127 - 2^16)! == number of possible combinations of 2^16 
hashes of length 127

I wouldn't think it would be possible to notice such a small difference. 
Nonetheless, I think it is a moot point:

> Using XOR further simplifies how hash values for internal nodes are computed.
I don't think that using XOR is significantly more efficient. Because XOR is 
associative, it is possible to hash arbitrary sequential leaves together, which 
is impossible with MD5, but we never do this: all of our comparison happens on 
the boundaries defined by IPartitioner.midpoint(), so the tree structure 
containing predefined/precomputed values can contain any value required for 
comparison.

      was (Author: stuhood):
    > see whether we lose any precision by simply using bitwise XOR to combine 
row hashes
Doing the math on this one was simpler than actually testing it: using a 
commutative hash function like XOR means that the number of possible inputs 
goes from being a Permutation of the leaves to being a Combination of the 
leaves (since a set of leaves in any order are equal).

For XOR you have:
(2^127)! / (2^127 - 2^16)! == number of possible permutations of 2^16 hashes of 
length 127
And for MD5:
(2^127)! / (2^16)! * (2^127 - 2^16)! == number of possible combinations of 2^16 
hashes of length 127

I wouldn't think it would be possible to notice such a small difference. 
Nonetheless, I think it is a moot point:

> Using XOR further simplifies how hash values for internal nodes are computed.
I don't think that using XOR is significantly more efficient. Because XOR is 
associative, it is possible to hash arbitrary sequential leaves together, which 
is impossible with MD5, but we never do this: all of our comparison happens on 
the boundaries defined by IPartitioner.midpoint(), so the tree structure 
containing predefined/precomputed values can contain any value required for 
comparison.
  
> Proactive repair
> ----------------
>
>                 Key: CASSANDRA-193
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-193
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Stu Hood
>             Fix For: 0.5
>
>         Attachments: 193-1-tree-preparation.diff, 
> 193-1-tree-preparation.diff, 193-2-tree.diff, 193-2-tree.diff, 
> 193-3-aes-preparation.diff, 193-3-aes-preparation.diff, 193-4-aes.diff, 
> 193-4-aes.diff, 193-5-manual-repair.diff, 193-6-inverted-filter.diff, 
> 193-6-inverted-filter.diff, 193-7-disable-caching-and-fix-minimum-token.diff, 
> 193-breakdown.txt, mktree-and-binary-tree.png
>
>
> Currently cassandra supports "read repair," i.e., lazy repair when a read is 
> done.  This is better than nothing but is not sufficient for some cases (e.g. 
> catastrophic node failure where you need to rebuild all of a node's data on a 
> new machine).
> Dynamo uses merkle trees here.  This is harder for Cassandra given the CF 
> data model but I suppose we could just hash the serialized CF value.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to