[jira] [Resolved] (CASSANDRA-2815) Bad timing in repair can transfer data it is not suppose to

Sylvain Lebresne (JIRA) Mon, 27 Jun 2011 07:06:13 -0700

     [ 
https://issues.apache.org/jira/browse/CASSANDRA-2815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sylvain Lebresne resolved CASSANDRA-2815.
-----------------------------------------

    Resolution: Duplicate

Marking as duplicate of CASSANDRA-2816 as the patch there fixes that issue too.

> Bad timing in repair can transfer data it is not suppose to 
> ------------------------------------------------------------
>
>                 Key: CASSANDRA-2815
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2815
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Sylvain Lebresne
>            Assignee: Sylvain Lebresne
>              Labels: repair
>             Fix For: 0.8.2
>
>
> The core of the problem is that the sstables used to construct a merkle tree 
> are not necessarily the same than the ones for which streaming is initiated. 
> This is usually not a big deal: newly compacted sstables don't matter since 
> data hasn't change. Newly flushed data (between the start of the validation 
> compaction and the start of the streaming) are a little bit more problematic, 
> even though one could argue that on average this won't be not too much data.
> But there can be more problematic scenario: suppose a 3 nodes cluster with 
> RF=3, all the data on node3 is removed and then repair is started on node3.  
> Also suppose the cluster havs two CFs, cf1 and cf2, sharing evenly the data.
> Node3 will request the usual merkle trees and let's pretend validation 
> compaction is mono-threaded to simplify.
> What can happen is the following:
>   # node3 computes its merkle trees for all requests very quickly, having no 
> data. It will thus wait on the other nodes trees (his own trees saying "I 
> have no data").
>   # node1 starts computing its tree for say cf1 (queuing the computation for 
> cf2). In the meantime, node2 may start computing the tree for cf2 (queuing 
> the one for cf1).
>   # when node1 completes his first tree, it sends it to node3. Node3 receives 
> it, compare to his own tree and initiate transfer of all the data for cf1 
> from node1 to him.
>   # not too long after that, node2 completes it's first tree, the one for cf2 
> and send it to node3. Based on it, transfer of all the data for cf2 from 
> node2 to node3 starts.
>   # An arbitrary long time after that (compaction validation can take time), 
> node2 will finish his second tree (for cf1 that is and send it back to node3. 
> Node3 will compare to his own (empty) tree and decide that all the ranges 
> should be repaired with node2 for cf1. The problem is that when that happens, 
> the transfer of cf1 from node1 may have been done already, or at least partly 
> done. For that reason, node3 will start streaming all this data to node2.
>   # For the same reasons, node3 may end up transfering all or part of cf2 to 
> node1.
> So the problem is, even though at the beginning node1 and node2 may be 
> perfectly consistent, we will end up streaming potentially huge amount of 
> data to them.
> I think this affects both 0.7 and 0.8, though differently because compaction 
> multi-threading and the fact that 0.8 differentiates the ranges make the 
> timing different.
> One solution (in a way, the theoretically right solution) would be to grab 
> references to the sstables we use for the validation compaction, and only 
> initiate streaming on these sstables. However, in 0.8 where compaction is 
> multi-threaded, this would mean retaining compacted sstables longer that 
> necessary. This is also a bit complicated and in particular would require a 
> network protocol change (because a streaming request message would have to 
> contain some information allowing to decide what set of sstables to use).
> A maybe simpler solution could be to have the node coordinating the repair 
> wait for the tree of all the remotes (in this obviously only for a given 
> column family and range) before starting streaming.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (CASSANDRA-2815) Bad timing in repair can transfer data it is not suppose to

Reply via email to