[ 
https://issues.apache.org/jira/browse/CASSANDRA-193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12717914#action_12717914
 ] 

Jonathan Ellis commented on CASSANDRA-193:
------------------------------------------

To start with the good news: one thing which may seem on the face of it to be a 
problem, isn't really.  That is, how do you get nodes replicating a given token 
range to agree where to freeze or snapshot the data set to be repaired, in the 
face of continuing updates?  The answer is, you don't; it doesn't matter.  If 
we repair a few columnfamilies that don't really need it (because one of the 
nodes was just a bit slower to process an update than the other), that's no big 
deal.  We accept that and move on.

The bad news is, I don't see a clever solution for performing broad-based 
repair against the Memtable/SSTable model similar to Merkle trees for 
Dynamo/bdb.  (Of course, that is no guarantee that none such exists. :)

There are several difficulties.  (In passing, it's worth noting that Bigtable 
sidesteps these issues by writing both commit logs and sstables to GFS, which 
takes care of durability.  Here we have to do more work in exchange for a 
simpler model and better performance on ordinary reads and writes.)

One difficulty lies in how data in one SSTable may be pre-empted by another.  
Because of this, any hash-based "summary" of a row may be obsoleted by rows in 
another.  For some workloads, particularly ones in which most keys are updated 
infrequently, caching such a summary in the sstable or index file might still 
be useful, but it should be kept in mind that in the worst case these will just 
be wasted effort.

(I think it would be a mistake to address this by forcing a major compaction -- 
combining all sstables for the columnfamily into one -- as a prerequisite to 
repair.  Reading and rewriting _all_ the data for _each_ repair is a 
significant amount of extra I/O.)

Another is that token regions do not correspond 1:1 to sstables, because each 
node is responsible for N token regions -- the regions for which is is the 
primary, secondar, tertiary, etc. repository for -- all intermingled in the 
SSTable files.  So any precomputation would need to be done separately N times.

Finally, we can't assume that sstable or even just row key names will fit into 
the heap, which limits the kind of in-memory structures we can build.

So from what I do not think it is worth the complexity to attempt to cache 
per-row hashes or summaries of the sstable data in the sstable or index files.

So the approach I propose is simply to iterate through the key space on a 
per-CF basis, compute a hash, and repair if there is a mismatch.  The code to 
iterate keys is already there (for the compaction code) and so is the code to 
compute hashes and repair if a mismatch is found (for read repair).  I think it 
will be worth flushing the current memtable first to avoid having to take a 
read lock on it.

Enhancements could include building a merkle tree from each batch of hashes to 
minimize round trips -- although unfortunately I think that is not going to be 
a bottleneck for Cassandra compared to the hash computation -- and fixing the 
compaction and hash computation code to iterate through columns in a CF rather 
than deserializing each ColumnFamily in its entirety.  These could definitely 
be split into separate tickets.


> Proactive repair
> ----------------
>
>                 Key: CASSANDRA-193
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-193
>             Project: Cassandra
>          Issue Type: New Feature
>            Reporter: Jonathan Ellis
>             Fix For: 0.4
>
>
> Currently cassandra supports "read repair," i.e., lazy repair when a read is 
> done.  This is better than nothing but is not sufficient for some cases (e.g. 
> catastrophic node failure where you need to rebuild all of a node's data on a 
> new machine).
> Dynamo uses merkle trees here.  This is harder for Cassandra given the CF 
> data model but I suppose we could just hash the serialized CF value.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to