[jira] [Commented] (CASSANDRA-14346) Scheduled Repair in Cassandra

Joseph Lynch (JIRA) Thu, 19 Apr 2018 11:44:23 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-14346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16444598#comment-16444598
 ]


Joseph Lynch commented on CASSANDRA-14346:
------------------------------------------

[~KurtG]
{quote}I think sidecar is a better choice purely for isolation from the 
read/write path, but think that we need to fix up the interface to repair 
first. As Blake mentioned, most problems come from the fact that JMX sucks and 
managing repairs over JMX is worse. I think as part of this work (or as a first 
step) we should be better defining this interface, and making it far more 
robust.

I think we should target the initial work for 4.0 - sprucing up interfaces so 
that repair is easier to work with and making failure handling fool-proof, as 
at least we'll probably be able to reach agreement on that front in a somewhat 
timely fashion. It seems a bit optimistic to target all the scheduling for 4.0 
at this stage, but I suppose it depends how much time people want to dedicate 
to this.
{quote}
I'm not entirely sure which interfaces need to be spruced up? I think the 
existing trunk 
[methods|https://github.com/apache/cassandra/blob/8b3a60b9a7dbefeecc06bace617279612ec7092d/src/java/org/apache/cassandra/service/ActiveRepairServiceMBean.java#L28-L29]
 are sufficient for the sidecar to rectify the state of repairs that are 
running in Cassandra with those in the database. Since the repair scheduler 
keeps work very small (targeting ~30 minute pieces of work) even if we do the 
calculations wrong we shouldn't lose very much work.
{quote}Also we should keep in mind CASSANDRA-14395 as there's going to be a lot 
of overlap here if we go down the sidecar route.
{quote}
Yea, I agree but don't want to block on that. If/when that ends up getting 
merged we can definitely unify the two tools. One of the nice things about HTTP 
interfaces over a known port is that you can swap out what provides them pretty 
easily.
{quote}If referring to incremental repair, wouldn't this already be the case in 
4.0? Subrange repair works with incremental repair in trunk at the moment, so 
we should already get some major benefits here. Unless I'm missing something...

In other news, for interests sake (slightly off topic) it seems DS is trying to 
do away with traditional repair, and instead they've gone the query at CL.ALL 
route (or similar) in their new "repair" system. I don't think this is a good 
idea, but good to keep in mind how everyone is approaching the problem.
{quote}
Adaptive subrange is an existing strategy we use for 2.1 and 3.0 
(pre-incremental) where the repair scheduler ensures lots of small pieces of 
work (which can be done in parallel) so that if we lose it we can resume 
without losing too much work; essentially you never ever ever do full range 
unless the dataset is small. I think incremental or continuous repair (read 
repairing only data that is inconsistent) are complementary to this concept in 
that they provide a way to make the work take less time generally speaking. 
If/when those techniques are production ready, I believe the design makes it 
super easy for users to switch (by changing the {{type}}).

> Scheduled Repair in Cassandra
> -----------------------------
>
>                 Key: CASSANDRA-14346
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14346
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Repair
>            Reporter: Joseph Lynch
>            Priority: Major
>              Labels: CommunityFeedbackRequested
>             Fix For: 4.0
>
>         Attachments: ScheduledRepairV1_20180327.pdf
>
>
> There have been many attempts to automate repair in Cassandra, which makes 
> sense given that it is necessary to give our users eventual consistency. Most 
> recently CASSANDRA-10070, CASSANDRA-8911 and CASSANDRA-13924 have all looked 
> for ways to solve this problem.
> At Netflix we've built a scheduled repair service within Priam (our sidecar), 
> which we spoke about last year at NGCC. Given the positive feedback at NGCC 
> we focussed on getting it production ready and have now been using it in 
> production to repair hundreds of clusters, tens of thousands of nodes, and 
> petabytes of data for the past six months. Also based on feedback at NGCC we 
> have invested effort in figuring out how to integrate this natively into 
> Cassandra rather than open sourcing it as an external service (e.g. in Priam).
> As such, [~vinaykumarcse] and I would like to re-work and merge our 
> implementation into Cassandra, and have created a [design 
> document|https://docs.google.com/document/d/1RV4rOrG1gwlD5IljmrIq_t45rz7H3xs9GbFSEyGzEtM/edit?usp=sharing]
>  showing how we plan to make it happen, including the the user interface.
> As we work on the code migration from Priam to Cassandra, any feedback would 
> be greatly appreciated about the interface or v1 implementation features. I 
> have tried to call out in the document features which we explicitly consider 
> future work (as well as a path forward to implement them in the future) 
> because I would very much like to get this done before the 4.0 merge window 
> closes, and to do that I think aggressively pruning scope is going to be a 
> necessity.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (CASSANDRA-14346) Scheduled Repair in Cassandra

Reply via email to