[ https://issues.apache.org/jira/browse/CASSANDRA-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13070528#comment-13070528 ]
Jeremiah Jordan commented on CASSANDRA-2699: -------------------------------------------- I like the idea of not repairing stuff that has already been repaired, one problem I see with CL.ALL reads is that you won't get new keys repaired to that node until another node is repaired that has the key. > continuous incremental anti-entropy > ----------------------------------- > > Key: CASSANDRA-2699 > URL: https://issues.apache.org/jira/browse/CASSANDRA-2699 > Project: Cassandra > Issue Type: Improvement > Reporter: Peter Schuller > > Currently, repair works by periodically running "bulk" jobs that (1) > performs a validating compaction building up an in-memory merkle tree, > and (2) streaming ring segments as needed according to differences > indicated by the merkle tree. > There are some disadvantages to this approach: > * There is a trade-off between memory usage and the precision of the > merkle tree. Less precision means more data streamed relative to > what is strictly required. > * Repair is a periodic "bulk" process that runs for a significant > period and, although possibly rate limited as compaction (if 0.8 or > backported throttling patch applied), is a divergence in terms of > performance characteristics from "normal" operation of the cluster. > * The impact of imprecision can be huge on a workload dominated by I/O > and with cache locality being critical, since you will suddenly > transfers lots of data to the target node. > I propose a more incremental process whereby anti-entropy is > incremental and continuous over time. In order to avoid being > seek-bound one still wants to do work in some form of bursty fashion, > but the amount of data processed at a time could be sufficiently small > that the impact on the cluster feels a lot more continuous, and that > the page cache allows us to avoid re-reading differing data twice. > Consider a process whereby a node is constantly performing a per-CF > repair operation for each CF. The current state of the repair process > is defined by: > * A starting timestamp of the current iteration through the token > range the node is responsible for. > * A "finger" indicating the current position along the token ring to > which iteration has completed. > This information, other than being in-memory, could periodically (every > few minutes or something) be stored persistently on disk. > The finger advances by the node selecting the next small "bit" of the > ring and doing whatever merkling/hashing/checksumming is necessary on > that small part, and then asking neighbors to do the same, and > arranging for neighbors to send the node data for mismatching > ranges. The data would be sent either by way of mutations like with > read repair, or by streaming sstables. But it would be small amounts > of data that will act roughly the same as regular writes for the > perspective of compaction. > Some nice properties of this approach: > * It's "always on"; no periodic sudden effects on cluster performance. > * Restarting nodes never cancels or breaks anti-entropy. > * Huge compactions of entire CF:s never clog up the compaction queue > (not necessarily a non-issue even with concurrent compactions in > 0.8). > * Because we're always operating on small chunks, there is never the > same kind of trade-off for memory use. A merkel tree or similar > could be calculated at a very detailed level potentially. Although > the precision from the perspective of reading from disk would likely > not matter much if we are in page cache anyway, very high precision > could be *very* useful when doing anti-entropy across data centers > on slow links. > There are devils in details, like how to select an appropriate ring > segment given that you don't have knowledge of the data density on > other nodes. But I feel that the overall idea/process seems very > promising. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira