[ https://issues.apache.org/jira/browse/CASSANDRA-14346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16420504#comment-16420504 ]
Alexander Dejanovski commented on CASSANDRA-14346: -------------------------------------------------- I really like the idea of making repair something that is coordinated by the cluster instead of being node centric like currently. This is how it should be implemented, and external tools should only add features over this. nodetool really should be doing this by default. I globally agree with the state machine that is detailed (haven't spent that much time on it though...) I disagree with the doc Resiliency's point 6 that adding nodes won't impact the repair : it will change the token ranges and some of the splits will now spread across different replicas which will make them unsuitable for repair (think of clusters with 256 vnodes per node). You either have to cancel the repair or recompute the remaining splits to move on with the job. I would add a feature to your nodetool repairstatus command that allows to list only the currently running repairs. Then I think the approach of implementing a fully automated, seamless, continuous repair "that just works" without user intervention is unsafe in the wild, there are too many caveats. There are many different types of cluster out there and some of them just cannot run repair without careful tuning or monitoring (if at all). The current design shows no backpressure mechanism to ensure that further running sequences won't harm the cluster because it's already running late on compactions (may it be due to overstreaming or entropy, or just the activity of the cluster). Repairing by table will add a lot of overhead over repairing a list of tables (or all) in a single session, unless multiple repairs at once on a node are allowed, which won't permit to safely terminate a single repair. It is also unclear in the current design if repair can be disabled for select tables for example (like "type: none"). The proposal doesn't seem to involve any change into how "nodetool repair" behaves. Will it be changed to use the state machine and coordinate throughout the cluster ? Trying to replace external tools with built in features has its limits I think, and currently the design gives only limited control by such external tools (may it be Reaper or Datastax repair service or Priam or ...). To make an analogy that was seen recently on the ML, it's as if you implemented automatic spreading of configuration changes from within Cassandra instead of relying on tools like Chef or Puppet. You'll still need global tools to manage repairs over several clusters anyway, which a Cassandra built-in feature cannot (and should not) provide. My point is that making repair smarter and coordinated within Cassandra is a great idea and I support it 100%, but the current design makes it too automated and the defaults could easily lead to severe performance problems without the user triggering anything. I don't know either how it could be made to work along user defined repairs as you'll need to force terminate some sessions. To summarize, I would put aside the scheduling features and implement the coordinated repairs by splits within Cassandra. The StorageServiceMBean should evolve to allow manually setting the number of splits by node, or rely on a number of split generated by Cassandra itself. Then it should also be possible to track progress externally by listing splits (sequences) through JMX, and pause/resume select repair runs. Also, the current design should evolve to allow a single sequence to include multiple token ranges. We have that feature waiting to be merged in Reaper to group token ranges that have the same replicas, in order to reduce the overhead of vnodes. Starting with 3.0, repair jobs can be triggered with multiple token ranges that will be executed as a single session if the replicas are the same for all. So, to prevent having to change the data model in the future, I'd suggest storing a list of token ranges instead of just one. Repair events should be tracked in a separate table also to avoid overwriting the last event each time (one thing Reaper currently sucks at as well). I'll go back to the document soon and add my comments there. Cheers > Scheduled Repair in Cassandra > ----------------------------- > > Key: CASSANDRA-14346 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14346 > Project: Cassandra > Issue Type: Improvement > Components: Repair > Reporter: Joseph Lynch > Priority: Major > Labels: CommunityFeedbackRequested > Fix For: 4.0 > > Attachments: ScheduledRepairV1_20180327.pdf > > > There have been many attempts to automate repair in Cassandra, which makes > sense given that it is necessary to give our users eventual consistency. Most > recently CASSANDRA-10070, CASSANDRA-8911 and CASSANDRA-13924 have all looked > for ways to solve this problem. > At Netflix we've built a scheduled repair service within Priam (our sidecar), > which we spoke about last year at NGCC. Given the positive feedback at NGCC > we focussed on getting it production ready and have now been using it in > production to repair hundreds of clusters, tens of thousands of nodes, and > petabytes of data for the past six months. Also based on feedback at NGCC we > have invested effort in figuring out how to integrate this natively into > Cassandra rather than open sourcing it as an external service (e.g. in Priam). > As such, [~vinaykumarcse] and I would like to re-work and merge our > implementation into Cassandra, and have created a [design > document|https://docs.google.com/document/d/1RV4rOrG1gwlD5IljmrIq_t45rz7H3xs9GbFSEyGzEtM/edit?usp=sharing] > showing how we plan to make it happen, including the the user interface. > As we work on the code migration from Priam to Cassandra, any feedback would > be greatly appreciated about the interface or v1 implementation features. I > have tried to call out in the document features which we explicitly consider > future work (as well as a path forward to implement them in the future) > because I would very much like to get this done before the 4.0 merge window > closes, and to do that I think aggressively pruning scope is going to be a > necessity. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org