[ 
https://issues.apache.org/jira/browse/CASSANDRA-14346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16693844#comment-16693844
 ] 

Joseph Lynch commented on CASSANDRA-14346:
------------------------------------------

Quick update for those watching this ticket.

Looks like during the Reaper donation discussion this sort of took a different 
turn and we don't have a shepherd to get this in any time soon. As such, I 
believe the plan is to start collaborating in the management process ticket 
(CASSANDRA-14395) on something more basic like health checks instead of going 
straight to solving the hard problems of robust repair scheduling, and then 
once we build up that management process and it becomes a mature part of the 
project we will re-visit robust repair scheduling that works out of the box for 
everyone; or Reaper will have been donated / integrated at that point.

For those that need repair scheduling in production right now, I'm not sure 
what the path forward is yet. We at Netflix would like to re-work the attached 
patch and include it with our releases of Priam instead of Cassandra as a 
"distributed execution engine" for executing distributed maintenance tasks 
(repair, restart, upgrade, etc ...), but I'm not sure what the timeline there 
is and the relative prioritization vs the management process.

> Scheduled Repair in Cassandra
> -----------------------------
>
>                 Key: CASSANDRA-14346
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14346
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Repair
>            Reporter: Joseph Lynch
>            Assignee: Joseph Lynch
>            Priority: Major
>              Labels: 4.0-feature-freeze-review-requested, 
> CommunityFeedbackRequested
>             Fix For: 4.x
>
>         Attachments: ScheduledRepairV1_20180327.pdf
>
>
> There have been many attempts to automate repair in Cassandra, which makes 
> sense given that it is necessary to give our users eventual consistency. Most 
> recently CASSANDRA-10070, CASSANDRA-8911 and CASSANDRA-13924 have all looked 
> for ways to solve this problem.
> At Netflix we've built a scheduled repair service within Priam (our sidecar), 
> which we spoke about last year at NGCC. Given the positive feedback at NGCC 
> we focussed on getting it production ready and have now been using it in 
> production to repair hundreds of clusters, tens of thousands of nodes, and 
> petabytes of data for the past six months. Also based on feedback at NGCC we 
> have invested effort in figuring out how to integrate this natively into 
> Cassandra rather than open sourcing it as an external service (e.g. in Priam).
> As such, [~vinaykumarcse] and I would like to re-work and merge our 
> implementation into Cassandra, and have created a [design 
> document|https://docs.google.com/document/d/1RV4rOrG1gwlD5IljmrIq_t45rz7H3xs9GbFSEyGzEtM/edit?usp=sharing]
>  showing how we plan to make it happen, including the the user interface.
> As we work on the code migration from Priam to Cassandra, any feedback would 
> be greatly appreciated about the interface or v1 implementation features. I 
> have tried to call out in the document features which we explicitly consider 
> future work (as well as a path forward to implement them in the future) 
> because I would very much like to get this done before the 4.0 merge window 
> closes, and to do that I think aggressively pruning scope is going to be a 
> necessity.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to