[ 
https://issues.apache.org/jira/browse/CASSANDRA-14346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16423716#comment-16423716
 ] 

Nate McCall commented on CASSANDRA-14346:
-----------------------------------------

We've never coordinated system level maintenance tasks before, and just looking 
through the state machine in your design document (thanks a bunch for taking 
the time to put that together) makes me nervous about the amount of moving 
parts (basically what [~bdeggleston] pointed out above) that we'd be 
introducing. 

I'm in the camp of relying on externalized coordination and control as being an 
easier place to reason about what is happening in a repair session for now. 
There has been so much excellent work on repair over the past year that I would 
really like to see some of that 'bake in' to get people comfortable and 
trusting us again before we add a dimension of complexity. I very much 
appreciate that you are running a version of this in production currently, but 
there is just so much that can go wrong and it's a whole new paradigm for us to 
include in the code base. We just cant afford to screw this up again. 

Curious about what [~spo...@gmail.com] thinks here, as I agree that some of 
these ideas might be much smoother to implement with CASSANDRA-12944 in place. 
As Blake suggested, maybe we walk this back a bit and start from the 
control-plain/event loop and approach this as part of refactoring management in 
general?

> Scheduled Repair in Cassandra
> -----------------------------
>
>                 Key: CASSANDRA-14346
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14346
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Repair
>            Reporter: Joseph Lynch
>            Priority: Major
>              Labels: CommunityFeedbackRequested
>             Fix For: 4.0
>
>         Attachments: ScheduledRepairV1_20180327.pdf
>
>
> There have been many attempts to automate repair in Cassandra, which makes 
> sense given that it is necessary to give our users eventual consistency. Most 
> recently CASSANDRA-10070, CASSANDRA-8911 and CASSANDRA-13924 have all looked 
> for ways to solve this problem.
> At Netflix we've built a scheduled repair service within Priam (our sidecar), 
> which we spoke about last year at NGCC. Given the positive feedback at NGCC 
> we focussed on getting it production ready and have now been using it in 
> production to repair hundreds of clusters, tens of thousands of nodes, and 
> petabytes of data for the past six months. Also based on feedback at NGCC we 
> have invested effort in figuring out how to integrate this natively into 
> Cassandra rather than open sourcing it as an external service (e.g. in Priam).
> As such, [~vinaykumarcse] and I would like to re-work and merge our 
> implementation into Cassandra, and have created a [design 
> document|https://docs.google.com/document/d/1RV4rOrG1gwlD5IljmrIq_t45rz7H3xs9GbFSEyGzEtM/edit?usp=sharing]
>  showing how we plan to make it happen, including the the user interface.
> As we work on the code migration from Priam to Cassandra, any feedback would 
> be greatly appreciated about the interface or v1 implementation features. I 
> have tried to call out in the document features which we explicitly consider 
> future work (as well as a path forward to implement them in the future) 
> because I would very much like to get this done before the 4.0 merge window 
> closes, and to do that I think aggressively pruning scope is going to be a 
> necessity.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to