[
https://issues.apache.org/jira/browse/CASSANDRA-2815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sylvain Lebresne resolved CASSANDRA-2815.
-----------------------------------------
Resolution: Duplicate
Marking as duplicate of CASSANDRA-2816 as the patch there fixes that issue too.
> Bad timing in repair can transfer data it is not suppose to
> ------------------------------------------------------------
>
> Key: CASSANDRA-2815
> URL: https://issues.apache.org/jira/browse/CASSANDRA-2815
> Project: Cassandra
> Issue Type: Bug
> Components: Core
> Reporter: Sylvain Lebresne
> Assignee: Sylvain Lebresne
> Labels: repair
> Fix For: 0.8.2
>
>
> The core of the problem is that the sstables used to construct a merkle tree
> are not necessarily the same than the ones for which streaming is initiated.
> This is usually not a big deal: newly compacted sstables don't matter since
> data hasn't change. Newly flushed data (between the start of the validation
> compaction and the start of the streaming) are a little bit more problematic,
> even though one could argue that on average this won't be not too much data.
> But there can be more problematic scenario: suppose a 3 nodes cluster with
> RF=3, all the data on node3 is removed and then repair is started on node3.
> Also suppose the cluster havs two CFs, cf1 and cf2, sharing evenly the data.
> Node3 will request the usual merkle trees and let's pretend validation
> compaction is mono-threaded to simplify.
> What can happen is the following:
> # node3 computes its merkle trees for all requests very quickly, having no
> data. It will thus wait on the other nodes trees (his own trees saying "I
> have no data").
> # node1 starts computing its tree for say cf1 (queuing the computation for
> cf2). In the meantime, node2 may start computing the tree for cf2 (queuing
> the one for cf1).
> # when node1 completes his first tree, it sends it to node3. Node3 receives
> it, compare to his own tree and initiate transfer of all the data for cf1
> from node1 to him.
> # not too long after that, node2 completes it's first tree, the one for cf2
> and send it to node3. Based on it, transfer of all the data for cf2 from
> node2 to node3 starts.
> # An arbitrary long time after that (compaction validation can take time),
> node2 will finish his second tree (for cf1 that is and send it back to node3.
> Node3 will compare to his own (empty) tree and decide that all the ranges
> should be repaired with node2 for cf1. The problem is that when that happens,
> the transfer of cf1 from node1 may have been done already, or at least partly
> done. For that reason, node3 will start streaming all this data to node2.
> # For the same reasons, node3 may end up transfering all or part of cf2 to
> node1.
> So the problem is, even though at the beginning node1 and node2 may be
> perfectly consistent, we will end up streaming potentially huge amount of
> data to them.
> I think this affects both 0.7 and 0.8, though differently because compaction
> multi-threading and the fact that 0.8 differentiates the ranges make the
> timing different.
> One solution (in a way, the theoretically right solution) would be to grab
> references to the sstables we use for the validation compaction, and only
> initiate streaming on these sstables. However, in 0.8 where compaction is
> multi-threaded, this would mean retaining compacted sstables longer that
> necessary. This is also a bit complicated and in particular would require a
> network protocol change (because a streaming request message would have to
> contain some information allowing to decide what set of sstables to use).
> A maybe simpler solution could be to have the node coordinating the repair
> wait for the tree of all the remotes (in this obviously only for a given
> column family and range) before starting streaming.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira