[ https://issues.apache.org/jira/browse/CASSANDRA-8911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15272917#comment-15272917 ]
Paulo Motta commented on CASSANDRA-8911: ---------------------------------------- bq. not sure here, would be nice to be able to prioritise a given range for a table, say we lose an sstable for example, automatically repairing the range that sstable covered immediately. bq. It should probably be just a single thread, one table after the other. And then maybe having a priority queue or something with ranges to repair immediately, wdyt Paulo Motta? This prio queue thing might be a bit of gold plating that we could do later. Sounds good! While we will remove the need for coordination between inter-node repairs (if we replace traditional repairs with this in the future), we will still need some level of management/scheduling for MBRs that we need to consider when designing/implementing the repair scheduling/management framework of CASSANDRA-10070, so it can also be useful to MBR in the future. With this said, we should probably focus on core MBR functionality here and take any auto-repair considerations to CASSANDRA-10070 so we can have a single interface for managing repairs (MBRs or not). As for incremental release of this, I agree that we should have a minimum level of confidence that this is promising (at least for some use cases) before releasing even if it's only experimental, but throwing it into the wild early will certainly give us some important feedback that we can use to validate and improve this. Furthermore if this proves out worthy it will probably live alongside traditional repairs for a long time (maybe forever for counters/DTCS?). Some early bikeshedding here (since it's not ready for review yet), when possible I think we should add new options to the existing repair interfaces rather than creating specific interfaces for MBR (for instance, {{nodetool repair --mutation-based}} instead of {{nodetool mutationbasedrepair}}, {{StorageService.repairAsync}} instead of {{CFS.enableMutationBasedRepair}}), since this is just a different implementation of the same functionality (unless there`s something that does not fit well or is not covered by existing interfaces) and will also allow existing tools to try out MBR with little change. > Consider Mutation-based Repairs > ------------------------------- > > Key: CASSANDRA-8911 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8911 > Project: Cassandra > Issue Type: Improvement > Reporter: Tyler Hobbs > Assignee: Marcus Eriksson > Fix For: 3.x > > > We should consider a mutation-based repair to replace the existing streaming > repair. While we're at it, we could do away with a lot of the complexity > around merkle trees. > I have not planned this out in detail, but here's roughly what I'm thinking: > * Instead of building an entire merkle tree up front, just send the "leaves" > one-by-one. Instead of dealing with token ranges, make the leaves primary > key ranges. The PK ranges would need to be contiguous, so that the start of > each range would match the end of the previous range. (The first and last > leaves would need to be open-ended on one end of the PK range.) This would be > similar to doing a read with paging. > * Once one page of data is read, compute a hash of it and send it to the > other replicas along with the PK range that it covers and a row count. > * When the replicas receive the hash, the perform a read over the same PK > range (using a LIMIT of the row count + 1) and compare hashes (unless the row > counts don't match, in which case this can be skipped). > * If there is a mismatch, the replica will send a mutation covering that > page's worth of data (ignoring the row count this time) to the source node. > Here are the advantages that I can think of: > * With the current repair behavior of streaming, vnode-enabled clusters may > need to stream hundreds of small SSTables. This results in increased compact > ion load on the receiving node. With the mutation-based approach, memtables > would naturally merge these. > * It's simple to throttle. For example, you could give a number of rows/sec > that should be repaired. > * It's easy to see what PK range has been repaired so far. This could make > it simpler to resume a repair that fails midway. > * Inconsistencies start to be repaired almost right away. > * Less special code \(?\) > * Wide partitions are no longer a problem. > There are a few problems I can think of: > * Counters. I don't know if this can be made safe, or if they need to be > skipped. > * To support incremental repair, we need to be able to read from only > repaired sstables. Probably not too difficult to do. -- This message was sent by Atlassian JIRA (v6.3.4#6332)