[ https://issues.apache.org/jira/browse/CASSANDRA-14346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16427312#comment-16427312 ]
Joseph Lynch commented on CASSANDRA-14346: ------------------------------------------ [~spo...@gmail.com] Thanks for the feedback, let me try to address your concerns. If you have time can you comment specifically in the design doc so that I make sure we address the outstanding concerns (keeping track of points in a jira is very hard, keeping track in gdoc comments is easier for me)? {quote}There are several aspect to consider for coming up with a working repair schedule: number of tables (from a single table per cluster to hundreds of tables) {quote} I don't think this is an issue with the design. We currently use this design to repair hundreds of clusters that vary between a few large tables and thousands of various size tables. Our distributed design makes continuous progress and gets the job done. We also provide a path forward in the document for highly [concurrent repair|https://docs.google.com/document/d/1RV4rOrG1gwlD5IljmrIq_t45rz7H3xs9GbFSEyGzEtM/edit#bookmark=id.xn6852786lv8] which further helps. {quote}priority in repairing tables (some tables should be repaired more often, others never at all) data size per table (large table should not block repairs for smaller more important ones) {quote} I think cluster sharding is the better way to fix this (and I believe in trunk you can run multiple Cassandra clusters on the same machine now because of the port refactor). You want to isolate critical workloads from non critical workloads for lots of reasons aside from repair. I don't see any reason why multiple schedules with table filters couldn't achieve this, but I question if that's the right level of abstraction to solve it at (i.e. I think cluster sharding is a much better solution). Do you have any proposals for how to achieve this kind of coordination without a central coordinator? I'll think on it but if you think it's important I encourage you to contribute to the design. {quote}predictable cluster load (try to schedule repairs off hours) sustainable repair intensity (repair sessions should not leak into peak hours) {quote} I address this in the [design|https://docs.google.com/document/d/1RV4rOrG1gwlD5IljmrIq_t45rz7H3xs9GbFSEyGzEtM/edit#bookmark=id.mykcdt32qw7i], and give a path forward. For what it's worth I think we disagree generally with doing something less frequently because it hurts; do it more so that you actually fix it (for example when we started running repair continuously we realized how important appropriately auto-sizing subrange are to preventing impact on the cluster, and now that's fixed and we run repair continuously without any impact to the cluster). {quote}different gc_grace periods (plan intervals for each table so we can tolerate missing a repair run) {quote} I also address this in the [design|https://docs.google.com/document/d/1RV4rOrG1gwlD5IljmrIq_t45rz7H3xs9GbFSEyGzEtM/edit#bookmark=id.j1wvnhf0scav]. Once repair scheduling is part of Cassandra the {{only_purge_repaired_tombstones}} option becomes much more attractive imo. {quote}Repair schedules, which will take these aspects into account, require a certain flexibility and some more careful configuration. Tools, such as reaper, allow you to put together such plans already. Looking at the configuration options described in the design document, I'd probably still want to use such an external tool. That would be mostly due to the use of delays instead of recurring repair times and the way you'd have to configure repairs on table level, which probably gets a bit "messy" fast when you have a lot of tables. The lack of any reporting doesn't help either to further tune these config options afterwards. {quote} We pretty strongly disagree that advanced scheduling is actually required. [Adaptive|https://docs.google.com/document/d/1RV4rOrG1gwlD5IljmrIq_t45rz7H3xs9GbFSEyGzEtM/edit#bookmark=id.x9qx96jfyivi] subrange as proposed in the design and eventually making repair much cheaper (via incremental and continuous+incremental, and making it FADV_DONTNEED so you don't blow the OS cache) are in our opinions better places to put the complexity than in the scheduler (since schedulers are comparatively harder). Regarding the table by table config, as stated [in the document|https://docs.google.com/document/d/1RV4rOrG1gwlD5IljmrIq_t45rz7H3xs9GbFSEyGzEtM/edit#bookmark=id.fhyfqyylq2p3] an explicit goal is to have almost no configuration. As a production datapoint I think we have thousands of tables and I think we set overrides for maybe a few dozen of them. We've tried to produce a minimally viable integration into Cassandra so that the 90% use cases (and even the 10% huge scale users such as us at Netflix) can have eventual consistency. {quote}I think the intention is to keep the scope of this ticket to "integrated repair scheduling and execution", so I'll spare you any of my thoughts about how we should coordinate and execute repairs differently in a post CASSANDRA-9143 world. But if we want to solve scheduling on top of our existing repair implementation, we have to make sure that we can compete with existing 3rd party solutions. So far it was already suggested to move on incrementally. But then we also have to think about how improvements could be implemented on top of the proposed solution. I'd assume that optimizations would be easier to implement in external tools or sidecars that communicates via an IPC interface, compared to a baked in solution, which is using the yaml config, table properties, or has to deal with upgrade paths. From my impression, 3rd party projects are probably also a better place to quickly iterate on these kind of problems. {quote} It sounds like the rough consensus is that we can't iterate quickly in the database itself, so I'll spend some time this week adding the additional resiliency and configuration components back to the design that we took out after discussions at NGCC indicated that a sidecar probably wouldn't get merged but integration into the database might. > Scheduled Repair in Cassandra > ----------------------------- > > Key: CASSANDRA-14346 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14346 > Project: Cassandra > Issue Type: Improvement > Components: Repair > Reporter: Joseph Lynch > Priority: Major > Labels: CommunityFeedbackRequested > Fix For: 4.0 > > Attachments: ScheduledRepairV1_20180327.pdf > > > There have been many attempts to automate repair in Cassandra, which makes > sense given that it is necessary to give our users eventual consistency. Most > recently CASSANDRA-10070, CASSANDRA-8911 and CASSANDRA-13924 have all looked > for ways to solve this problem. > At Netflix we've built a scheduled repair service within Priam (our sidecar), > which we spoke about last year at NGCC. Given the positive feedback at NGCC > we focussed on getting it production ready and have now been using it in > production to repair hundreds of clusters, tens of thousands of nodes, and > petabytes of data for the past six months. Also based on feedback at NGCC we > have invested effort in figuring out how to integrate this natively into > Cassandra rather than open sourcing it as an external service (e.g. in Priam). > As such, [~vinaykumarcse] and I would like to re-work and merge our > implementation into Cassandra, and have created a [design > document|https://docs.google.com/document/d/1RV4rOrG1gwlD5IljmrIq_t45rz7H3xs9GbFSEyGzEtM/edit?usp=sharing] > showing how we plan to make it happen, including the the user interface. > As we work on the code migration from Priam to Cassandra, any feedback would > be greatly appreciated about the interface or v1 implementation features. I > have tried to call out in the document features which we explicitly consider > future work (as well as a path forward to implement them in the future) > because I would very much like to get this done before the 4.0 merge window > closes, and to do that I think aggressively pruning scope is going to be a > necessity. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org