[ 
https://issues.apache.org/jira/browse/CASSANDRA-10728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15011068#comment-15011068
 ] 

Yuki Morishita edited comment on CASSANDRA-10728 at 11/18/15 2:33 PM:
----------------------------------------------------------------------

_Partition key range_ is assigned to each Merkle Tree's node, and we compare 
hash to see if there is mismatch within the range. So partition key is somewhat 
part of Merkle Tree. We do this in order to reduce the size of data to exchange 
and time to compare the difference, instead of comparing individual partition.

bq. use case where all partitions have exactly the same data and just the 
partition key matters.

Say we have partition 1 with value 'a' and partition 2 with 'b' in one replica 
and partition 1 with value 'b' and partition with 'a' in the other, all at the 
same timestamp. We will have identical hash in Merkle Tree node with partition 
range for, say, (0, 2]. But since timestamps among replica are same, we cannot 
tell which is the collect value anyway.


was (Author: yukim):
_Partition key range_ is assigned to each Merkle Tree's node, and we compare 
hash to see if there is mismatch within the range. So partition key is somewhat 
part of Merkle Tree. We do this in order to reduce the size of data to exchange 
and time to compare the difference, instead of comparing individual partition.


> Hash used in repair does not include partition key
> --------------------------------------------------
>
>                 Key: CASSANDRA-10728
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10728
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Nadav Har'El
>            Priority: Minor
>
> When the repair code builds the Merkle Tree, it appears to be using 
> AbstractCompactedRow.update() to calculate a partition's hash. This method's 
> documentation states that it calculates a "digest with the data bytes of the 
> row (not including row key or row size).". The code itself seems to agree 
> with this comment.
> However, I believe that not including the row (actually, partition) key in 
> the hash function is a mistake: This means that if two nodes have the same 
> data but different key, repair would not notice this discrepancy. Moreover, 
> if two different keys have their data switched - or have the same data - 
> again this would not be noticed by repair. Actually running across this 
> problem in a real repair is not very likely, but I can imagine seeing it 
> easily in an hypothetical use case where all partitions have exactly the same 
> data and just the partition key matters.
> I am sorry if I'm mistaken and the partition key is actually taken into 
> account in the Merkle tree, but I tried to find evidence that it does and 
> failed. Glancing over the code, it almost seems that it does use the key: 
> Validator.add() calculates rowHash() which includes the digest (without the 
> partition key) *and* the key's token. But then, the code calls 
> MerkleTree.TreeRange.addHash() on that tuple, and that function conspicuously 
> ignores the token, and only uses the digest.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to