[ https://issues.apache.org/jira/browse/CASSANDRA-14346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16426773#comment-16426773 ]
Stefan Podkowinski commented on CASSANDRA-14346: ------------------------------------------------ If we keep the scope of this ticket to schedule repairs in Cassandra, we should really talk a bit more about the different requirements users have and how using the described solution in practice would look like. There are several aspect to consider for coming up with a working repair schedule: * number of tables (from a single table per cluster to hundreds of tables) * priority in repairing tables (some tables should be repaired more often, others never at all) * data size per table (large table should not block repairs for smaller more important ones) * predictable cluster load (try to schedule repairs off hours) * sustainable repair intensity (repair sessions should not leak into peak hours) * different gc_grace periods (plan intervals for each table so we can tolerate missing a repair run) Repair schedules, which will take these aspects into account, require a certain flexibility and some more careful configuration. Tools, such as reaper, allow you to put together such plans already. Looking at the configuration options described in the design document, I'd probably still want to use such an external tool. That would be mostly due to the use of delays instead of recurring repair times and the way you'd have to configure repairs on table level, which probably gets a bit "messy" fast when you have a lot of tables. The lack of any reporting doesn't help either to further tune these config options afterwards. I think the intention is to keep the scope of this ticket to "integrated repair scheduling and execution", so I'll spare you any of my thoughts about how we should coordinate and execute repairs differently in a post CASSANDRA-9143 world. But if we want to solve scheduling on top of our existing repair implementation, we have to make sure that we can compete with existing 3rd party solutions. So far it was already suggested to move on incrementally. But then we also have to think about how improvements could be implemented on top of the proposed solution. I'd assume that optimizations would be easier to implement in external tools or sidecars that communicates via an IPC interface, compared to a baked in solution, which is using the yaml config, table properties, or has to deal with upgrade paths. From my impression, 3rd party projects are probably also a better place to quickly iterate on these kind of problems. > Scheduled Repair in Cassandra > ----------------------------- > > Key: CASSANDRA-14346 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14346 > Project: Cassandra > Issue Type: Improvement > Components: Repair > Reporter: Joseph Lynch > Priority: Major > Labels: CommunityFeedbackRequested > Fix For: 4.0 > > Attachments: ScheduledRepairV1_20180327.pdf > > > There have been many attempts to automate repair in Cassandra, which makes > sense given that it is necessary to give our users eventual consistency. Most > recently CASSANDRA-10070, CASSANDRA-8911 and CASSANDRA-13924 have all looked > for ways to solve this problem. > At Netflix we've built a scheduled repair service within Priam (our sidecar), > which we spoke about last year at NGCC. Given the positive feedback at NGCC > we focussed on getting it production ready and have now been using it in > production to repair hundreds of clusters, tens of thousands of nodes, and > petabytes of data for the past six months. Also based on feedback at NGCC we > have invested effort in figuring out how to integrate this natively into > Cassandra rather than open sourcing it as an external service (e.g. in Priam). > As such, [~vinaykumarcse] and I would like to re-work and merge our > implementation into Cassandra, and have created a [design > document|https://docs.google.com/document/d/1RV4rOrG1gwlD5IljmrIq_t45rz7H3xs9GbFSEyGzEtM/edit?usp=sharing] > showing how we plan to make it happen, including the the user interface. > As we work on the code migration from Priam to Cassandra, any feedback would > be greatly appreciated about the interface or v1 implementation features. I > have tried to call out in the document features which we explicitly consider > future work (as well as a path forward to implement them in the future) > because I would very much like to get this done before the 4.0 merge window > closes, and to do that I think aggressively pruning scope is going to be a > necessity. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org