[jira] [Created] (CASSANDRA-2815) Bad timing in repair can transfer data it is not suppose to

Sylvain Lebresne (JIRA) Thu, 23 Jun 2011 00:54:21 -0700

Bad timing in repair can transfer data it is not suppose to 
------------------------------------------------------------


                 Key: CASSANDRA-2815
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2815
             Project: Cassandra
          Issue Type: Bug
          Components: Core
    Affects Versions: 0.8.0, 0.7.0
            Reporter: Sylvain Lebresne
            Assignee: Sylvain Lebresne


The core of the problem is that the sstables used to construct a merkle tree 
are not necessarily the same than the ones for which streaming is initiated. 
This is usually not a big deal: newly compacted sstables don't matter since 
data hasn't change. Newly flushed data (between the start of the validation 
compaction and the start of the streaming) are a little bit more problematic, 
even though one could argue that on average this won't be not too much data.

But there can be more problematic scenario: suppose a 3 nodes cluster with 
RF=3, all the data on node3 is removed and then repair is started on node3.  
Also suppose the cluster havs two CFs, cf1 and cf2, sharing evenly the data.
Node3 will request the usual merkle trees and let's pretend validation 
compaction is mono-threaded to simplify.
What can happen is the following:
  # node3 computes its merkle trees for all requests very quickly, having no 
data. It will thus wait on the other nodes trees (his own trees saying "I have 
no data").
  # node1 starts computing its tree for say cf1 (queuing the computation for 
cf2). In the meantime, node2 may start computing the tree for cf2 (queuing the 
one for cf1).
  # when node1 completes his first tree, it sends it to node3. Node3 receives 
it, compare to his own tree and initiate transfer of all the data for cf1 from 
node1 to him.
  # not too long after that, node2 completes it's first tree, the one for cf2 
and send it to node3. Based on it, transfer of all the data for cf2 from node2 
to node3 starts.
  # An arbitrary long time after that (compaction validation can take time), 
node2 will finish his second tree (for cf1 that is and send it back to node3. 
Node3 will compare to his own (empty) tree and decide that all the ranges 
should be repaired with node2 for cf1. The problem is that when that happens, 
the transfer of cf1 from node1 may have been done already, or at least partly 
done. For that reason, node3 will start streaming all this data to node2.
  # For the same reasons, node3 may end up transfering all or part of cf2 to 
node1.

So the problem is, even though at the beginning node1 and node2 may be 
perfectly consistent, we will end up streaming potentially huge amount of data 
to them.

I think this affects both 0.7 and 0.8, though differently because compaction 
multi-threading and the fact that 0.8 differentiates the ranges make the timing 
different.

One solution (in a way, the theoretically right solution) would be to grab 
references to the sstables we use for the validation compaction, and only 
initiate streaming on these sstables. However, in 0.8 where compaction is 
multi-threaded, this would mean retaining compacted sstables longer that 
necessary. This is also a bit complicated and in particular would require a 
network protocol change (because a streaming request message would have to 
contain some information allowing to decide what set of sstables to use).

A maybe simpler solution could be to have the node coordinating the repair wait 
for the tree of all the remotes (in this obviously only for a given column 
family and range) before starting streaming.


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Created] (CASSANDRA-2815) Bad timing in repair can transfer data it is not suppose to

Reply via email to