Benjamin Roth created CASSANDRA-12991:
-----------------------------------------

             Summary: Inter-node race condition in validation compaction
                 Key: CASSANDRA-12991
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-12991
             Project: Cassandra
          Issue Type: Improvement
            Reporter: Benjamin Roth
            Priority: Minor


Problem:
When a validation compaction is triggered by a repair it may happen that due to 
flying in mutations the merkle trees differ but the data is not consistent.

Example:
t = 10000: 
Repair starts validation
Node A starts validation
t = 10001:
Mutation arrives at Node A
t = 10002:
Mutation arrives at Node B
t = 10003:
Node B starts validation

Hashes of node A+B will differ but data is consistent from a view (think of it 
like a snapshot) t = 10000.

Impact:
Unnecessary streaming happens. This may not a big impact on low traffic CFs, 
partitions but on high traffic CFs and maybe very big partitions, this may have 
a bigger impact and is a waste of resources.

Possible solution:
Build hashes based upon a snapshot timestamp.
This requires SSTables created after that timestamp to be filtered when doing a 
validation compaction:
- Cells with timestamp > snapshot time have to be removed
- Tombstone range markers have to be handled
 - Bounds have to be removed if delete timestamp > snapshot time
 - Boundary markers have to be either changed to a bound or completely removed, 
depending if start and/or end are both affected or not

Probably this is a known behaviour. Have there been any discussions about this 
in the past? Did not find an matching issue, so I created this one.

I am happy about any feedback, whatsoever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to