[ https://issues.apache.org/jira/browse/CASSANDRA-12991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15723428#comment-15723428 ]
Benjamin Roth commented on CASSANDRA-12991: ------------------------------------------- Absolutely right! This is why I wrote: bq. m1.timestamp could even be < tv and still be in s1 but not in s2 because it could have been blocked on node 2 due to a full mutation stage queue, network delays or whatever. So it would be more safe to filter mutations out if m1.timestamp > (tv - write_timeout) Maybe this was confusing, so a bit more elaborate: To avoid also such kind of race condition where a mutation didn't arrive at a node that has a timestamp that is older than the timestamp of the validation request, there has to be a reasonable grace period. I personally would consider for example write_request_timeout_in_ms as a reasonable base, maybe also a fix period of some seconds. If a mutation doesn't make it to a remote node during that period is absolutely ok to count this one as a mismatch. So we have a timestamp when the validation was requested by the repair coordinator (tr) and a timestamp for the validation compaction (tc) that filters out all mutations after that timestamp and a grace period (gp), where roughly tc = tr - gp One could argue that the grace period means that the most recent mutations are not included in the repair but I'd say this is totally ok because we talk about a few seconds and no repair is executed within a few seconds after some outage. Normally a repair is a scheduled task or a manual task after recovery of a failure situation that definitively takes more than a few seconds to recover. > Inter-node race condition in validation compaction > -------------------------------------------------- > > Key: CASSANDRA-12991 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12991 > Project: Cassandra > Issue Type: Improvement > Reporter: Benjamin Roth > Priority: Minor > > Problem: > When a validation compaction is triggered by a repair it may happen that due > to flying in mutations the merkle trees differ but the data is consistent > however. > Example: > t = 10000: > Repair starts, triggers validations > Node A starts validation > t = 10001: > Mutation arrives at Node A > t = 10002: > Mutation arrives at Node B > t = 10003: > Node B starts validation > Hashes of node A+B will differ but data is consistent from a view (think of > it like a snapshot) t = 10000. > Impact: > Unnecessary streaming happens. This may not a big impact on low traffic CFs, > partitions but on high traffic CFs and maybe very big partitions, this may > have a bigger impact and is a waste of resources. > Possible solution: > Build hashes based upon a snapshot timestamp. > This requires SSTables created after that timestamp to be filtered when doing > a validation compaction: > - Cells with timestamp > snapshot time have to be removed > - Tombstone range markers have to be handled > - Bounds have to be removed if delete timestamp > snapshot time > - Boundary markers have to be either changed to a bound or completely > removed, depending if start and/or end are both affected or not > Probably this is a known behaviour. Have there been any discussions about > this in the past? Did not find an matching issue, so I created this one. > I am happy about any feedback, whatsoever. -- This message was sent by Atlassian JIRA (v6.3.4#6332)