[jira] [Commented] (CASSANDRA-14346) Scheduled Repair in Cassandra

Blake Eggleston (JIRA) Sat, 31 Mar 2018 10:00:46 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-14346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16421395#comment-16421395
 ]


Blake Eggleston commented on CASSANDRA-14346:
---------------------------------------------

Just to be clear, I’m not necessarily opposed to the idea of scheduling repairs 
as part of the cassandra daemon. I’m just also not convinced that it’s the 
right way to solve this problem, and think we need to zoom out a bit and 
discuss the pros and cons of that approach compared to others before we go too 
far in any single direction.

So far, the main arguments of doing it in process vs in a hypothetical ops tool 
seem to be that 1. jmx sucks and 2. being in process makes it easier to 
determine if repairs are in a bad state and stop them if they are.

The point about jmx being crap, sure, no arguments there. However, we could do 
a much better job of how we do communication between nodetool and cassandra. 
For instance, if we returned a repair id and nodetool just polled for updates 
instead of relying on a single connection, that would solve a lot of problems.

The point about being in process making it easier to detect and react to 
failures is where I’m really not convinced. There might some straightforward 
failures that you’d be able to pick up on, but the real problem you need to 
solve is a distributed one. Specifically, you need a way to recover when the 
repair coordinator misses the success or failure message from a remote sync 
task. If you haven’t solved that, then you’ve only solved part of the problem 
and are just guessing. That’s something you can’t solve in process, and is 
going to require some internode communication. Also, solving that problem would 
probably provide the infrastructure you need to detect and resolve failures 
that aren’t as difficult to detect.

So the jmx thing is not super difficult, and arguably something we should do 
anyway. The visibility into repair state isn’t solved by being in process, and 
is really out of scope for a discussion about the best way to coordinate when 
and where repairs are run.

> Scheduled Repair in Cassandra
> -----------------------------
>
>                 Key: CASSANDRA-14346
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14346
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Repair
>            Reporter: Joseph Lynch
>            Priority: Major
>              Labels: CommunityFeedbackRequested
>             Fix For: 4.0
>
>         Attachments: ScheduledRepairV1_20180327.pdf
>
>
> There have been many attempts to automate repair in Cassandra, which makes 
> sense given that it is necessary to give our users eventual consistency. Most 
> recently CASSANDRA-10070, CASSANDRA-8911 and CASSANDRA-13924 have all looked 
> for ways to solve this problem.
> At Netflix we've built a scheduled repair service within Priam (our sidecar), 
> which we spoke about last year at NGCC. Given the positive feedback at NGCC 
> we focussed on getting it production ready and have now been using it in 
> production to repair hundreds of clusters, tens of thousands of nodes, and 
> petabytes of data for the past six months. Also based on feedback at NGCC we 
> have invested effort in figuring out how to integrate this natively into 
> Cassandra rather than open sourcing it as an external service (e.g. in Priam).
> As such, [~vinaykumarcse] and I would like to re-work and merge our 
> implementation into Cassandra, and have created a [design 
> document|https://docs.google.com/document/d/1RV4rOrG1gwlD5IljmrIq_t45rz7H3xs9GbFSEyGzEtM/edit?usp=sharing]
>  showing how we plan to make it happen, including the the user interface.
> As we work on the code migration from Priam to Cassandra, any feedback would 
> be greatly appreciated about the interface or v1 implementation features. I 
> have tried to call out in the document features which we explicitly consider 
> future work (as well as a path forward to implement them in the future) 
> because I would very much like to get this done before the 4.0 merge window 
> closes, and to do that I think aggressively pruning scope is going to be a 
> necessity.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (CASSANDRA-14346) Scheduled Repair in Cassandra

Reply via email to