[
https://issues.apache.org/jira/browse/CASSANDRA-193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12717914#action_12717914
]
Jonathan Ellis commented on CASSANDRA-193:
------------------------------------------
To start with the good news: one thing which may seem on the face of it to be a
problem, isn't really. That is, how do you get nodes replicating a given token
range to agree where to freeze or snapshot the data set to be repaired, in the
face of continuing updates? The answer is, you don't; it doesn't matter. If
we repair a few columnfamilies that don't really need it (because one of the
nodes was just a bit slower to process an update than the other), that's no big
deal. We accept that and move on.
The bad news is, I don't see a clever solution for performing broad-based
repair against the Memtable/SSTable model similar to Merkle trees for
Dynamo/bdb. (Of course, that is no guarantee that none such exists. :)
There are several difficulties. (In passing, it's worth noting that Bigtable
sidesteps these issues by writing both commit logs and sstables to GFS, which
takes care of durability. Here we have to do more work in exchange for a
simpler model and better performance on ordinary reads and writes.)
One difficulty lies in how data in one SSTable may be pre-empted by another.
Because of this, any hash-based "summary" of a row may be obsoleted by rows in
another. For some workloads, particularly ones in which most keys are updated
infrequently, caching such a summary in the sstable or index file might still
be useful, but it should be kept in mind that in the worst case these will just
be wasted effort.
(I think it would be a mistake to address this by forcing a major compaction --
combining all sstables for the columnfamily into one -- as a prerequisite to
repair. Reading and rewriting _all_ the data for _each_ repair is a
significant amount of extra I/O.)
Another is that token regions do not correspond 1:1 to sstables, because each
node is responsible for N token regions -- the regions for which is is the
primary, secondar, tertiary, etc. repository for -- all intermingled in the
SSTable files. So any precomputation would need to be done separately N times.
Finally, we can't assume that sstable or even just row key names will fit into
the heap, which limits the kind of in-memory structures we can build.
So from what I do not think it is worth the complexity to attempt to cache
per-row hashes or summaries of the sstable data in the sstable or index files.
So the approach I propose is simply to iterate through the key space on a
per-CF basis, compute a hash, and repair if there is a mismatch. The code to
iterate keys is already there (for the compaction code) and so is the code to
compute hashes and repair if a mismatch is found (for read repair). I think it
will be worth flushing the current memtable first to avoid having to take a
read lock on it.
Enhancements could include building a merkle tree from each batch of hashes to
minimize round trips -- although unfortunately I think that is not going to be
a bottleneck for Cassandra compared to the hash computation -- and fixing the
compaction and hash computation code to iterate through columns in a CF rather
than deserializing each ColumnFamily in its entirety. These could definitely
be split into separate tickets.
> Proactive repair
> ----------------
>
> Key: CASSANDRA-193
> URL: https://issues.apache.org/jira/browse/CASSANDRA-193
> Project: Cassandra
> Issue Type: New Feature
> Reporter: Jonathan Ellis
> Fix For: 0.4
>
>
> Currently cassandra supports "read repair," i.e., lazy repair when a read is
> done. This is better than nothing but is not sufficient for some cases (e.g.
> catastrophic node failure where you need to rebuild all of a node's data on a
> new machine).
> Dynamo uses merkle trees here. This is harder for Cassandra given the CF
> data model but I suppose we could just hash the serialized CF value.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.