[jira] [Commented] (CASSANDRA-14346) Scheduled Repair in Cassandra

Joseph Lynch (JIRA) Thu, 05 Apr 2018 10:42:24 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-14346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16427312#comment-16427312
 ]


Joseph Lynch commented on CASSANDRA-14346:
------------------------------------------

[~spo...@gmail.com]
 Thanks for the feedback, let me try to address your concerns. If you have time 
can you comment specifically in the design doc so that I make sure we address 
the outstanding concerns (keeping track of points in a jira is very hard, 
keeping track in gdoc comments is easier for me)?
{quote}There are several aspect to consider for coming up with a working repair 
schedule:

number of tables (from a single table per cluster to hundreds of tables)
{quote}
I don't think this is an issue with the design. We currently use this design to 
repair hundreds of clusters that vary between a few large tables and thousands 
of various size tables. Our distributed design makes continuous progress and 
gets the job done. We also provide a path forward in the document for highly 
[concurrent 
repair|https://docs.google.com/document/d/1RV4rOrG1gwlD5IljmrIq_t45rz7H3xs9GbFSEyGzEtM/edit#bookmark=id.xn6852786lv8]
 which further helps.
{quote}priority in repairing tables (some tables should be repaired more often, 
others never at all)
 data size per table (large table should not block repairs for smaller more 
important ones)
{quote}
I think cluster sharding is the better way to fix this (and I believe in trunk 
you can run multiple Cassandra clusters on the same machine now because of the 
port refactor). You want to isolate critical workloads from non critical 
workloads for lots of reasons aside from repair. I don't see any reason why 
multiple schedules with table filters couldn't achieve this, but I question if 
that's the right level of abstraction to solve it at (i.e. I think cluster 
sharding is a much better solution). Do you have any proposals for how to 
achieve this kind of coordination without a central coordinator? I'll think on 
it but if you think it's important I encourage you to contribute to the design.
{quote}predictable cluster load (try to schedule repairs off hours)

sustainable repair intensity (repair sessions should not leak into peak hours)
{quote}
I address this in the 
[design|https://docs.google.com/document/d/1RV4rOrG1gwlD5IljmrIq_t45rz7H3xs9GbFSEyGzEtM/edit#bookmark=id.mykcdt32qw7i],
 and give a path forward. For what it's worth I think we disagree generally 
with doing something less frequently because it hurts; do it more so that you 
actually fix it (for example when we started running repair continuously we 
realized how important appropriately auto-sizing subrange are to preventing 
impact on the cluster, and now that's fixed and we run repair continuously 
without any impact to the cluster).
{quote}different gc_grace periods (plan intervals for each table so we can 
tolerate missing a repair run)
{quote}
I also address this in the 
[design|https://docs.google.com/document/d/1RV4rOrG1gwlD5IljmrIq_t45rz7H3xs9GbFSEyGzEtM/edit#bookmark=id.j1wvnhf0scav].
 Once repair scheduling is part of Cassandra the 
{{only_purge_repaired_tombstones}} option becomes much more attractive imo.
{quote}Repair schedules, which will take these aspects into account, require a 
certain flexibility and some more careful configuration. Tools, such as reaper, 
allow you to put together such plans already. Looking at the configuration 
options described in the design document, I'd probably still want to use such 
an external tool. That would be mostly due to the use of delays instead of 
recurring repair times and the way you'd have to configure repairs on table 
level, which probably gets a bit "messy" fast when you have a lot of tables. 
The lack of any reporting doesn't help either to further tune these config 
options afterwards.
{quote}
We pretty strongly disagree that advanced scheduling is actually required. 
[Adaptive|https://docs.google.com/document/d/1RV4rOrG1gwlD5IljmrIq_t45rz7H3xs9GbFSEyGzEtM/edit#bookmark=id.x9qx96jfyivi]
 subrange as proposed in the design and eventually making repair much cheaper 
(via incremental and continuous+incremental, and making it FADV_DONTNEED so you 
don't blow the OS cache) are in our opinions better places to put the 
complexity than in the scheduler (since schedulers are comparatively harder).

Regarding the table by table config, as stated [in the 
document|https://docs.google.com/document/d/1RV4rOrG1gwlD5IljmrIq_t45rz7H3xs9GbFSEyGzEtM/edit#bookmark=id.fhyfqyylq2p3]
 an explicit goal is to have almost no configuration. As a production datapoint 
I think we have thousands of tables and I think we set overrides for maybe a 
few dozen of them.

We've tried to produce a minimally viable integration into Cassandra so that 
the 90% use cases (and even the 10% huge scale users such as us at Netflix) can 
have eventual consistency.
{quote}I think the intention is to keep the scope of this ticket to "integrated 
repair scheduling and execution", so I'll spare you any of my thoughts about 
how we should coordinate and execute repairs differently in a post 
CASSANDRA-9143 world. But if we want to solve scheduling on top of our existing 
repair implementation, we have to make sure that we can compete with existing 
3rd party solutions.

So far it was already suggested to move on incrementally. But then we also have 
to think about how improvements could be implemented on top of the proposed 
solution. I'd assume that optimizations would be easier to implement in 
external tools or sidecars that communicates via an IPC interface, compared to 
a baked in solution, which is using the yaml config, table properties, or has 
to deal with upgrade paths. From my impression, 3rd party projects are probably 
also a better place to quickly iterate on these kind of problems.
{quote}
It sounds like the rough consensus is that we can't iterate quickly in the 
database itself, so I'll spend some time this week adding the additional 
resiliency and configuration components back to the design that we took out 
after discussions at NGCC indicated that a sidecar probably wouldn't get merged 
but integration into the database might.

> Scheduled Repair in Cassandra
> -----------------------------
>
>                 Key: CASSANDRA-14346
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14346
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Repair
>            Reporter: Joseph Lynch
>            Priority: Major
>              Labels: CommunityFeedbackRequested
>             Fix For: 4.0
>
>         Attachments: ScheduledRepairV1_20180327.pdf
>
>
> There have been many attempts to automate repair in Cassandra, which makes 
> sense given that it is necessary to give our users eventual consistency. Most 
> recently CASSANDRA-10070, CASSANDRA-8911 and CASSANDRA-13924 have all looked 
> for ways to solve this problem.
> At Netflix we've built a scheduled repair service within Priam (our sidecar), 
> which we spoke about last year at NGCC. Given the positive feedback at NGCC 
> we focussed on getting it production ready and have now been using it in 
> production to repair hundreds of clusters, tens of thousands of nodes, and 
> petabytes of data for the past six months. Also based on feedback at NGCC we 
> have invested effort in figuring out how to integrate this natively into 
> Cassandra rather than open sourcing it as an external service (e.g. in Priam).
> As such, [~vinaykumarcse] and I would like to re-work and merge our 
> implementation into Cassandra, and have created a [design 
> document|https://docs.google.com/document/d/1RV4rOrG1gwlD5IljmrIq_t45rz7H3xs9GbFSEyGzEtM/edit?usp=sharing]
>  showing how we plan to make it happen, including the the user interface.
> As we work on the code migration from Priam to Cassandra, any feedback would 
> be greatly appreciated about the interface or v1 implementation features. I 
> have tried to call out in the document features which we explicitly consider 
> future work (as well as a path forward to implement them in the future) 
> because I would very much like to get this done before the 4.0 merge window 
> closes, and to do that I think aggressively pruning scope is going to be a 
> necessity.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-14346) Scheduled Repair in Cassandra

Reply via email to