[
https://issues.apache.org/jira/browse/CASSANDRA-12991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15723428#comment-15723428
]
Benjamin Roth commented on CASSANDRA-12991:
-------------------------------------------
Absolutely right!
This is why I wrote:
bq. m1.timestamp could even be < tv and still be in s1 but not in s2 because it
could have been blocked on node 2 due to a full mutation stage queue, network
delays or whatever. So it would be more safe to filter mutations out if
m1.timestamp > (tv - write_timeout)
Maybe this was confusing, so a bit more elaborate:
To avoid also such kind of race condition where a mutation didn't arrive at a
node that has a timestamp that is older than the timestamp of the validation
request, there has to be a reasonable grace period. I personally would consider
for example write_request_timeout_in_ms as a reasonable base, maybe also a fix
period of some seconds. If a mutation doesn't make it to a remote node during
that period is absolutely ok to count this one as a mismatch.
So we have a timestamp when the validation was requested by the repair
coordinator (tr) and a timestamp for the validation compaction (tc) that
filters out all mutations after that timestamp and a grace period (gp), where
roughly tc = tr - gp
One could argue that the grace period means that the most recent mutations are
not included in the repair but I'd say this is totally ok because we talk about
a few seconds and no repair is executed within a few seconds after some outage.
Normally a repair is a scheduled task or a manual task after recovery of a
failure situation that definitively takes more than a few seconds to recover.
> Inter-node race condition in validation compaction
> --------------------------------------------------
>
> Key: CASSANDRA-12991
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12991
> Project: Cassandra
> Issue Type: Improvement
> Reporter: Benjamin Roth
> Priority: Minor
>
> Problem:
> When a validation compaction is triggered by a repair it may happen that due
> to flying in mutations the merkle trees differ but the data is consistent
> however.
> Example:
> t = 10000:
> Repair starts, triggers validations
> Node A starts validation
> t = 10001:
> Mutation arrives at Node A
> t = 10002:
> Mutation arrives at Node B
> t = 10003:
> Node B starts validation
> Hashes of node A+B will differ but data is consistent from a view (think of
> it like a snapshot) t = 10000.
> Impact:
> Unnecessary streaming happens. This may not a big impact on low traffic CFs,
> partitions but on high traffic CFs and maybe very big partitions, this may
> have a bigger impact and is a waste of resources.
> Possible solution:
> Build hashes based upon a snapshot timestamp.
> This requires SSTables created after that timestamp to be filtered when doing
> a validation compaction:
> - Cells with timestamp > snapshot time have to be removed
> - Tombstone range markers have to be handled
> - Bounds have to be removed if delete timestamp > snapshot time
> - Boundary markers have to be either changed to a bound or completely
> removed, depending if start and/or end are both affected or not
> Probably this is a known behaviour. Have there been any discussions about
> this in the past? Did not find an matching issue, so I created this one.
> I am happy about any feedback, whatsoever.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)