[ https://issues.apache.org/jira/browse/CASSANDRA-8911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
C. Scott Andreas updated CASSANDRA-8911: ---------------------------------------- Component/s: Repair > Consider Mutation-based Repairs > ------------------------------- > > Key: CASSANDRA-8911 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8911 > Project: Cassandra > Issue Type: Improvement > Components: Repair > Reporter: Tyler Hobbs > Priority: Major > Labels: repair > Fix For: 4.x > > > We should consider a mutation-based repair to replace the existing streaming > repair. While we're at it, we could do away with a lot of the complexity > around merkle trees. > I have not planned this out in detail, but here's roughly what I'm thinking: > * Instead of building an entire merkle tree up front, just send the "leaves" > one-by-one. Instead of dealing with token ranges, make the leaves primary > key ranges. The PK ranges would need to be contiguous, so that the start of > each range would match the end of the previous range. (The first and last > leaves would need to be open-ended on one end of the PK range.) This would be > similar to doing a read with paging. > * Once one page of data is read, compute a hash of it and send it to the > other replicas along with the PK range that it covers and a row count. > * When the replicas receive the hash, the perform a read over the same PK > range (using a LIMIT of the row count + 1) and compare hashes (unless the row > counts don't match, in which case this can be skipped). > * If there is a mismatch, the replica will send a mutation covering that > page's worth of data (ignoring the row count this time) to the source node. > Here are the advantages that I can think of: > * With the current repair behavior of streaming, vnode-enabled clusters may > need to stream hundreds of small SSTables. This results in increased compact > ion load on the receiving node. With the mutation-based approach, memtables > would naturally merge these. > * It's simple to throttle. For example, you could give a number of rows/sec > that should be repaired. > * It's easy to see what PK range has been repaired so far. This could make > it simpler to resume a repair that fails midway. > * Inconsistencies start to be repaired almost right away. > * Less special code \(?\) > * Wide partitions are no longer a problem. > There are a few problems I can think of: > * Counters. I don't know if this can be made safe, or if they need to be > skipped. > * To support incremental repair, we need to be able to read from only > repaired sstables. Probably not too difficult to do. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org