[
https://issues.apache.org/jira/browse/CASSANDRA-14346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16423716#comment-16423716
]
Nate McCall commented on CASSANDRA-14346:
-----------------------------------------
We've never coordinated system level maintenance tasks before, and just looking
through the state machine in your design document (thanks a bunch for taking
the time to put that together) makes me nervous about the amount of moving
parts (basically what [~bdeggleston] pointed out above) that we'd be
introducing.
I'm in the camp of relying on externalized coordination and control as being an
easier place to reason about what is happening in a repair session for now.
There has been so much excellent work on repair over the past year that I would
really like to see some of that 'bake in' to get people comfortable and
trusting us again before we add a dimension of complexity. I very much
appreciate that you are running a version of this in production currently, but
there is just so much that can go wrong and it's a whole new paradigm for us to
include in the code base. We just cant afford to screw this up again.
Curious about what [[email protected]] thinks here, as I agree that some of
these ideas might be much smoother to implement with CASSANDRA-12944 in place.
As Blake suggested, maybe we walk this back a bit and start from the
control-plain/event loop and approach this as part of refactoring management in
general?
> Scheduled Repair in Cassandra
> -----------------------------
>
> Key: CASSANDRA-14346
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14346
> Project: Cassandra
> Issue Type: Improvement
> Components: Repair
> Reporter: Joseph Lynch
> Priority: Major
> Labels: CommunityFeedbackRequested
> Fix For: 4.0
>
> Attachments: ScheduledRepairV1_20180327.pdf
>
>
> There have been many attempts to automate repair in Cassandra, which makes
> sense given that it is necessary to give our users eventual consistency. Most
> recently CASSANDRA-10070, CASSANDRA-8911 and CASSANDRA-13924 have all looked
> for ways to solve this problem.
> At Netflix we've built a scheduled repair service within Priam (our sidecar),
> which we spoke about last year at NGCC. Given the positive feedback at NGCC
> we focussed on getting it production ready and have now been using it in
> production to repair hundreds of clusters, tens of thousands of nodes, and
> petabytes of data for the past six months. Also based on feedback at NGCC we
> have invested effort in figuring out how to integrate this natively into
> Cassandra rather than open sourcing it as an external service (e.g. in Priam).
> As such, [~vinaykumarcse] and I would like to re-work and merge our
> implementation into Cassandra, and have created a [design
> document|https://docs.google.com/document/d/1RV4rOrG1gwlD5IljmrIq_t45rz7H3xs9GbFSEyGzEtM/edit?usp=sharing]
> showing how we plan to make it happen, including the the user interface.
> As we work on the code migration from Priam to Cassandra, any feedback would
> be greatly appreciated about the interface or v1 implementation features. I
> have tried to call out in the document features which we explicitly consider
> future work (as well as a path forward to implement them in the future)
> because I would very much like to get this done before the 4.0 merge window
> closes, and to do that I think aggressively pruning scope is going to be a
> necessity.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]