[jira] [Commented] (CASSANDRA-10728) Hash used in repair does not include partition key

Nadav Har'El (JIRA) Wed, 18 Nov 2015 06:53:38 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-10728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15011118#comment-15011118
 ]


Nadav Har'El commented on CASSANDRA-10728:
------------------------------------------

What if in one replica we have partition 'a' with value 1 and some timestamp, 
and in a second replica, we have partition 'b' with a value 1 and the same 
timestamp - and 'a' and 'b' happen to be close enough in their tokens to be in 
the same Merkle Tree partition range?

I realize this is very unlikely case (especially considering the need for the 
timestamps to be identical, which I haven't considered before). But it seems 
it's possible... For example, consider a contrieved use case which uses 
Cassandra to store a large set of keys - the value of each key is always set to 
"1" (or whatever). Now, at exactly the same time (at millisecond resolution, 
which is Cassandra's default), two servers want to write two different keys "a" 
and "b" - and because of a partition in the cluster, "a" ends up on one 
machine, "b" on a second machine - and both have the same time (in miilisecond 
resolution, it's not completely improbable).

> Hash used in repair does not include partition key
> --------------------------------------------------
>
>                 Key: CASSANDRA-10728
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10728
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Nadav Har'El
>            Priority: Minor
>
> When the repair code builds the Merkle Tree, it appears to be using 
> AbstractCompactedRow.update() to calculate a partition's hash. This method's 
> documentation states that it calculates a "digest with the data bytes of the 
> row (not including row key or row size).". The code itself seems to agree 
> with this comment.
> However, I believe that not including the row (actually, partition) key in 
> the hash function is a mistake: This means that if two nodes have the same 
> data but different key, repair would not notice this discrepancy. Moreover, 
> if two different keys have their data switched - or have the same data - 
> again this would not be noticed by repair. Actually running across this 
> problem in a real repair is not very likely, but I can imagine seeing it 
> easily in an hypothetical use case where all partitions have exactly the same 
> data and just the partition key matters.
> I am sorry if I'm mistaken and the partition key is actually taken into 
> account in the Merkle tree, but I tried to find evidence that it does and 
> failed. Glancing over the code, it almost seems that it does use the key: 
> Validator.add() calculates rowHash() which includes the digest (without the 
> partition key) *and* the key's token. But then, the code calls 
> MerkleTree.TreeRange.addHash() on that tuple, and that function conspicuously 
> ignores the token, and only uses the digest.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-10728) Hash used in repair does not include partition key

Reply via email to