[
https://issues.apache.org/jira/browse/CASSANDRA-14346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16420504#comment-16420504
]
Alexander Dejanovski commented on CASSANDRA-14346:
--------------------------------------------------
I really like the idea of making repair something that is coordinated by the
cluster instead of being node centric like currently.
This is how it should be implemented, and external tools should only add
features over this. nodetool really should be doing this by default.
I globally agree with the state machine that is detailed (haven't spent that
much time on it though...)
I disagree with the doc Resiliency's point 6 that adding nodes won't impact the
repair : it will change the token ranges and some of the splits will now spread
across different replicas which will make them unsuitable for repair (think of
clusters with 256 vnodes per node).
You either have to cancel the repair or recompute the remaining splits to move
on with the job.
I would add a feature to your nodetool repairstatus command that allows to list
only the currently running repairs.
Then I think the approach of implementing a fully automated, seamless,
continuous repair "that just works" without user intervention is unsafe in the
wild, there are too many caveats.
There are many different types of cluster out there and some of them just
cannot run repair without careful tuning or monitoring (if at all).
The current design shows no backpressure mechanism to ensure that further
running sequences won't harm the cluster because it's already running late on
compactions (may it be due to overstreaming or entropy, or just the activity of
the cluster).
Repairing by table will add a lot of overhead over repairing a list of tables
(or all) in a single session, unless multiple repairs at once on a node are
allowed, which won't permit to safely terminate a single repair.
It is also unclear in the current design if repair can be disabled for select
tables for example (like "type: none").
The proposal doesn't seem to involve any change into how "nodetool repair"
behaves. Will it be changed to use the state machine and coordinate throughout
the cluster ?
Trying to replace external tools with built in features has its limits I think,
and currently the design gives only limited control by such external tools (may
it be Reaper or Datastax repair service or Priam or ...).
To make an analogy that was seen recently on the ML, it's as if you implemented
automatic spreading of configuration changes from within Cassandra instead of
relying on tools like Chef or Puppet.
You'll still need global tools to manage repairs over several clusters anyway,
which a Cassandra built-in feature cannot (and should not) provide.
My point is that making repair smarter and coordinated within Cassandra is a
great idea and I support it 100%, but the current design makes it too automated
and the defaults could easily lead to severe performance problems without the
user triggering anything.
I don't know either how it could be made to work along user defined repairs as
you'll need to force terminate some sessions.
To summarize, I would put aside the scheduling features and implement the
coordinated repairs by splits within Cassandra. The StorageServiceMBean should
evolve to allow manually setting the number of splits by node, or rely on a
number of split generated by Cassandra itself.
Then it should also be possible to track progress externally by listing splits
(sequences) through JMX, and pause/resume select repair runs.
Also, the current design should evolve to allow a single sequence to include
multiple token ranges. We have that feature waiting to be merged in Reaper to
group token ranges that have the same replicas, in order to reduce the overhead
of vnodes.
Starting with 3.0, repair jobs can be triggered with multiple token ranges that
will be executed as a single session if the replicas are the same for all. So,
to prevent having to change the data model in the future, I'd suggest storing a
list of token ranges instead of just one.
Repair events should be tracked in a separate table also to avoid overwriting
the last event each time (one thing Reaper currently sucks at as well).
I'll go back to the document soon and add my comments there.
Cheers
> Scheduled Repair in Cassandra
> -----------------------------
>
> Key: CASSANDRA-14346
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14346
> Project: Cassandra
> Issue Type: Improvement
> Components: Repair
> Reporter: Joseph Lynch
> Priority: Major
> Labels: CommunityFeedbackRequested
> Fix For: 4.0
>
> Attachments: ScheduledRepairV1_20180327.pdf
>
>
> There have been many attempts to automate repair in Cassandra, which makes
> sense given that it is necessary to give our users eventual consistency. Most
> recently CASSANDRA-10070, CASSANDRA-8911 and CASSANDRA-13924 have all looked
> for ways to solve this problem.
> At Netflix we've built a scheduled repair service within Priam (our sidecar),
> which we spoke about last year at NGCC. Given the positive feedback at NGCC
> we focussed on getting it production ready and have now been using it in
> production to repair hundreds of clusters, tens of thousands of nodes, and
> petabytes of data for the past six months. Also based on feedback at NGCC we
> have invested effort in figuring out how to integrate this natively into
> Cassandra rather than open sourcing it as an external service (e.g. in Priam).
> As such, [~vinaykumarcse] and I would like to re-work and merge our
> implementation into Cassandra, and have created a [design
> document|https://docs.google.com/document/d/1RV4rOrG1gwlD5IljmrIq_t45rz7H3xs9GbFSEyGzEtM/edit?usp=sharing]
> showing how we plan to make it happen, including the the user interface.
> As we work on the code migration from Priam to Cassandra, any feedback would
> be greatly appreciated about the interface or v1 implementation features. I
> have tried to call out in the document features which we explicitly consider
> future work (as well as a path forward to implement them in the future)
> because I would very much like to get this done before the 4.0 merge window
> closes, and to do that I think aggressively pruning scope is going to be a
> necessity.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]