[ 
https://issues.apache.org/jira/browse/CASSANDRA-14346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16420504#comment-16420504
 ] 

Alexander Dejanovski commented on CASSANDRA-14346:
--------------------------------------------------

I really like the idea of making repair something that is coordinated by the 
cluster instead of being node centric like currently.
This is how it should be implemented, and external tools should only add 
features over this. nodetool really should be doing this by default.
I globally agree with the state machine that is detailed (haven't spent that 
much time on it though...)

I disagree with the doc Resiliency's point 6 that adding nodes won't impact the 
repair : it will change the token ranges and some of the splits will now spread 
across different replicas which will make them unsuitable for repair (think of 
clusters with 256 vnodes per node).
You either have to cancel the repair or recompute the remaining splits to move 
on with the job.

I would add a feature to your nodetool repairstatus command that allows to list 
only the currently running repairs.

Then I think the approach of implementing a fully automated, seamless, 
continuous repair "that just works" without user intervention is unsafe in the 
wild, there are too many caveats.
There are many different types of cluster out there and some of them just 
cannot run repair without careful tuning or monitoring (if at all).
The current design shows no backpressure mechanism to ensure that further 
running sequences won't harm the cluster because it's already running late on 
compactions (may it be due to overstreaming or entropy, or just the activity of 
the cluster).
Repairing by table will add a lot of overhead over repairing a list of tables 
(or all) in a single session, unless multiple repairs at once on a node are 
allowed, which won't permit to safely terminate a single repair.
It is also unclear in the current design if repair can be disabled for select 
tables for example (like "type: none").
The proposal doesn't seem to involve any change into how "nodetool repair" 
behaves. Will it be changed to use the state machine and coordinate throughout 
the cluster ?

Trying to replace external tools with built in features has its limits I think, 
and currently the design gives only limited control by such external tools (may 
it be Reaper or Datastax repair service or Priam or ...).
To make an analogy that was seen recently on the ML, it's as if you implemented 
automatic spreading of configuration changes from within Cassandra instead of 
relying on tools like Chef or Puppet.
You'll still need global tools to manage repairs over several clusters anyway, 
which a Cassandra built-in feature cannot (and should not) provide.

My point is that making repair smarter and coordinated within Cassandra is a 
great idea and I support it 100%, but the current design makes it too automated 
and the defaults could easily lead to severe performance problems without the 
user triggering anything.
I don't know either how it could be made to work along user defined repairs as 
you'll need to force terminate some sessions.

To summarize, I would put aside the scheduling features and implement the 
coordinated repairs by splits within Cassandra. The StorageServiceMBean should 
evolve to allow manually setting the number of splits by node, or rely on a 
number of split generated by Cassandra itself.
Then it should also be possible to track progress externally by listing splits 
(sequences) through JMX, and pause/resume select repair runs.

Also, the current design should evolve to allow a single sequence to include 
multiple token ranges. We have that feature waiting to be merged in Reaper to 
group token ranges that have the same replicas, in order to reduce the overhead 
of vnodes.
Starting with 3.0, repair jobs can be triggered with multiple token ranges that 
will be executed as a single session if the replicas are the same for all. So, 
to prevent having to change the data model in the future, I'd suggest storing a 
list of token ranges instead of just one.
Repair events should be tracked in a separate table also to avoid overwriting 
the last event each time (one thing Reaper currently sucks at as well).

I'll go back to the document soon and add my comments there.

 

Cheers

> Scheduled Repair in Cassandra
> -----------------------------
>
>                 Key: CASSANDRA-14346
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14346
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Repair
>            Reporter: Joseph Lynch
>            Priority: Major
>              Labels: CommunityFeedbackRequested
>             Fix For: 4.0
>
>         Attachments: ScheduledRepairV1_20180327.pdf
>
>
> There have been many attempts to automate repair in Cassandra, which makes 
> sense given that it is necessary to give our users eventual consistency. Most 
> recently CASSANDRA-10070, CASSANDRA-8911 and CASSANDRA-13924 have all looked 
> for ways to solve this problem.
> At Netflix we've built a scheduled repair service within Priam (our sidecar), 
> which we spoke about last year at NGCC. Given the positive feedback at NGCC 
> we focussed on getting it production ready and have now been using it in 
> production to repair hundreds of clusters, tens of thousands of nodes, and 
> petabytes of data for the past six months. Also based on feedback at NGCC we 
> have invested effort in figuring out how to integrate this natively into 
> Cassandra rather than open sourcing it as an external service (e.g. in Priam).
> As such, [~vinaykumarcse] and I would like to re-work and merge our 
> implementation into Cassandra, and have created a [design 
> document|https://docs.google.com/document/d/1RV4rOrG1gwlD5IljmrIq_t45rz7H3xs9GbFSEyGzEtM/edit?usp=sharing]
>  showing how we plan to make it happen, including the the user interface.
> As we work on the code migration from Priam to Cassandra, any feedback would 
> be greatly appreciated about the interface or v1 implementation features. I 
> have tried to call out in the document features which we explicitly consider 
> future work (as well as a path forward to implement them in the future) 
> because I would very much like to get this done before the 4.0 merge window 
> closes, and to do that I think aggressively pruning scope is going to be a 
> necessity.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to