[
https://issues.apache.org/jira/browse/CASSANDRA-8911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15253574#comment-15253574
]
Marcus Eriksson commented on CASSANDRA-8911:
--------------------------------------------
wip branch for this here:
https://github.com/krummas/cassandra/commits/marcuse/8911
It mostly follows what [~thobbs] outlined above:
* The repairing node pages through its local data with page size =
{{WINDOW_SIZE}}, [calculating a
hash|https://github.com/krummas/cassandra/blob/marcuse/8911/src/java/org/apache/cassandra/repair/mutation/MBRService.java#L149-L155]
for a page and sends the hash to its replicas. We figure out the {{start}} and
{{end}} "keys" (partition key + clustering) that the remote nodes should
compare the hash for. Repairing node sends a
[RepairPage|https://github.com/krummas/cassandra/blob/marcuse/8911/src/java/org/apache/cassandra/repair/mutation/MBRRepairPage.java#L47-L52]
containing the needed information.
* The replicas [read
up|https://github.com/krummas/cassandra/blob/marcuse/8911/src/java/org/apache/cassandra/repair/mutation/MBRVerbHandler.java#L58]
the local data between the {{start}}/{{end}} keys above with a limit of
{{WINDOW_SIZE * 2}}
** If the hashes match, we reply that the data matched
** If we hit the limit when reading within the {{start}}/{{end}}, we consider
this a "huge" response and we handle that separately - we reply to the
repairing node that we have many rows between {{start}}/{{end}}, and the
repairing node will page back the data from that node. This can happen if the
repairing node has lost an sstable for example.
** If the hashes don't match, we reply with the data and the repairing node
will diff its data within the window with the remote data and only write the
differences to the memtable
Regarding page cache pollution I think we should handle this as normal reads,
first, the intention is to read through the data slowly so we won't blow out
all the 'real' pages in a short time, and second, we will read the data twice
within a very short time span if there is a mismatch, so the page cache should
make the impact of this smaller.
*Discussion points:*
* How do we make repair invisible to the operator? (continuous process,
calculate how many rows/s we need to repair etc)
* Can we handle gcgs in a better way with this?
* Can we avoid anticompaction? Do we really need it to be incremental?
(probably, but having a single thread page through the entire dataset should be
something the cluster can handle)
TODO:
* Properly handle the "huge" responses above - need to have a way for the
remote paging to return {{UnfilteredPartitionIterator}}s
* Make it incremental - currently reads all data and puts the differences in
the regular memtable. We could probably have a separate memtable that is
flushed to a repaired sstable.
* Avoid breaking DTCS etc, since all mutations go into the same memtable, the
flushed sstable will cover a big time window. Best solution to this would
probably be to make flushing write to several sstables as that would help with
other DTCS issues as well (read repair, USING TIMESTAMP)
* Write tests / benchmarks / ...
* More metrics - we can get very accurate metrics on how much data was actually
diffing during the repair
If anyone wants to try it out, there is a JMX method
{{enableMutationBasedRepair}} on the {{ColumnFamilyStoreMBean}} to enable it
(it pages through and repairs all the data once).
> Consider Mutation-based Repairs
> -------------------------------
>
> Key: CASSANDRA-8911
> URL: https://issues.apache.org/jira/browse/CASSANDRA-8911
> Project: Cassandra
> Issue Type: Improvement
> Reporter: Tyler Hobbs
> Assignee: Marcus Eriksson
> Fix For: 3.x
>
>
> We should consider a mutation-based repair to replace the existing streaming
> repair. While we're at it, we could do away with a lot of the complexity
> around merkle trees.
> I have not planned this out in detail, but here's roughly what I'm thinking:
> * Instead of building an entire merkle tree up front, just send the "leaves"
> one-by-one. Instead of dealing with token ranges, make the leaves primary
> key ranges. The PK ranges would need to be contiguous, so that the start of
> each range would match the end of the previous range. (The first and last
> leaves would need to be open-ended on one end of the PK range.) This would be
> similar to doing a read with paging.
> * Once one page of data is read, compute a hash of it and send it to the
> other replicas along with the PK range that it covers and a row count.
> * When the replicas receive the hash, the perform a read over the same PK
> range (using a LIMIT of the row count + 1) and compare hashes (unless the row
> counts don't match, in which case this can be skipped).
> * If there is a mismatch, the replica will send a mutation covering that
> page's worth of data (ignoring the row count this time) to the source node.
> Here are the advantages that I can think of:
> * With the current repair behavior of streaming, vnode-enabled clusters may
> need to stream hundreds of small SSTables. This results in increased compact
> ion load on the receiving node. With the mutation-based approach, memtables
> would naturally merge these.
> * It's simple to throttle. For example, you could give a number of rows/sec
> that should be repaired.
> * It's easy to see what PK range has been repaired so far. This could make
> it simpler to resume a repair that fails midway.
> * Inconsistencies start to be repaired almost right away.
> * Less special code \(?\)
> * Wide partitions are no longer a problem.
> There are a few problems I can think of:
> * Counters. I don't know if this can be made safe, or if they need to be
> skipped.
> * To support incremental repair, we need to be able to read from only
> repaired sstables. Probably not too difficult to do.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)