[ 
https://issues.apache.org/jira/browse/CASSANDRA-12991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15721857#comment-15721857
 ] 

Benjamin Roth edited comment on CASSANDRA-12991 at 12/5/16 10:35 AM:
---------------------------------------------------------------------

You assumption is partially correct. The race condition is that the flushing is 
not synced among nodes.
So to continue the example from above:

t = 10000: 
Repair starts, triggers validations, validation timestamp is tv = t = 10000
Node A starts validation, sstable s1 is flushed to disk.
t = 10001:
Coordinator node (e.g. node A) receives mutation m1
Mutation m1 is applied to memtable at Node A
t = 10002:
Mutation m1 arrives at Node B and is applied to memtable
t = 10003:
Node B starts validation, sstable s2 is flushed to disk

s1 does NOT contain m1
s2 DOES contain m1

This creates different merkle trees for node A+B due to differences in s1 <> s2.

I didn't mean to filter out s2 but to filter out m1 when iterating over s2 
during validation compaction because m1.timestamp > tv.

m1.timestamp could even be < tv and still be in s1 but not in s2 because it 
could have been blocked on node 2 due to a full mutation stage queue, network 
delays or whatever. So it would be more safe to filter mutations out if 
m1.timestamp > (tv - write_timeout)

Was this more comprehensible?


was (Author: brstgt):
You assumption is partially correct. The race condition is that the flushing is 
not synced among nodes.
So to continue the example from above:

t = 10000: 
Repair starts, triggers validations, validation timestamp is tv = t = 10000
Node A starts validation, sstable s1 is flushed to disk.
t = 10001:
Coordinator node (e.g. node A) receives mutation m1
Mutation m1 is applied to memtable at Node A
t = 10002:
Mutation m1 arrives at Node B and is applied to memtable
t = 10003:
Node B starts validation, sstable s2 is flushed to disk

s1 does NOT contain m1
s2 DOES contain m1

This creates different merkle trees for node A+B due to differences in s1 <> s2.

I didn't mean to filter out s2 but to filter out m1 when iterating over s2 
during validation compaction because m1.timestamp > tv.

Was this more comprehensible?

> Inter-node race condition in validation compaction
> --------------------------------------------------
>
>                 Key: CASSANDRA-12991
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12991
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Benjamin Roth
>            Priority: Minor
>
> Problem:
> When a validation compaction is triggered by a repair it may happen that due 
> to flying in mutations the merkle trees differ but the data is consistent 
> however.
> Example:
> t = 10000: 
> Repair starts, triggers validations
> Node A starts validation
> t = 10001:
> Mutation arrives at Node A
> t = 10002:
> Mutation arrives at Node B
> t = 10003:
> Node B starts validation
> Hashes of node A+B will differ but data is consistent from a view (think of 
> it like a snapshot) t = 10000.
> Impact:
> Unnecessary streaming happens. This may not a big impact on low traffic CFs, 
> partitions but on high traffic CFs and maybe very big partitions, this may 
> have a bigger impact and is a waste of resources.
> Possible solution:
> Build hashes based upon a snapshot timestamp.
> This requires SSTables created after that timestamp to be filtered when doing 
> a validation compaction:
> - Cells with timestamp > snapshot time have to be removed
> - Tombstone range markers have to be handled
>  - Bounds have to be removed if delete timestamp > snapshot time
>  - Boundary markers have to be either changed to a bound or completely 
> removed, depending if start and/or end are both affected or not
> Probably this is a known behaviour. Have there been any discussions about 
> this in the past? Did not find an matching issue, so I created this one.
> I am happy about any feedback, whatsoever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to