[jira] [Commented] (CASSANDRA-14346) Scheduled Repair in Cassandra

Blake Eggleston (JIRA) Thu, 19 Apr 2018 12:14:34 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-14346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16444654#comment-16444654
 ]


Blake Eggleston commented on CASSANDRA-14346:
---------------------------------------------

bq. I'm not entirely sure which interfaces need to be spruced up? I think the 
existing trunk methods are sufficient for the sidecar to rectify the state of 
repairs that are running in Cassandra with those in the database.

[~jolynch], those methods only work with incremental repairs, full repairs 
can't be controlled through those. Also, they only fail repairs in the sense 
that they prevent the incremental repair session (which is different from a 
parent repair session) from moving to it's next state and instead short circuit 
it to failed. Validations and streams that are in flight are still 
uncontrollable. What I'm assuming Kurt meant is the fact that we rely on an 
unbroken jmx connection on the repair client side. 

bq. Since the repair scheduler keeps work very small (targeting ~30 minute 
pieces of work) even if we do the calculations wrong we shouldn't lose very 
much work.

Should I interpret this to mean that your scheduler breaks incremental repairs 
into small subranges?

> Scheduled Repair in Cassandra
> -----------------------------
>
>                 Key: CASSANDRA-14346
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14346
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Repair
>            Reporter: Joseph Lynch
>            Priority: Major
>              Labels: CommunityFeedbackRequested
>             Fix For: 4.0
>
>         Attachments: ScheduledRepairV1_20180327.pdf
>
>
> There have been many attempts to automate repair in Cassandra, which makes 
> sense given that it is necessary to give our users eventual consistency. Most 
> recently CASSANDRA-10070, CASSANDRA-8911 and CASSANDRA-13924 have all looked 
> for ways to solve this problem.
> At Netflix we've built a scheduled repair service within Priam (our sidecar), 
> which we spoke about last year at NGCC. Given the positive feedback at NGCC 
> we focussed on getting it production ready and have now been using it in 
> production to repair hundreds of clusters, tens of thousands of nodes, and 
> petabytes of data for the past six months. Also based on feedback at NGCC we 
> have invested effort in figuring out how to integrate this natively into 
> Cassandra rather than open sourcing it as an external service (e.g. in Priam).
> As such, [~vinaykumarcse] and I would like to re-work and merge our 
> implementation into Cassandra, and have created a [design 
> document|https://docs.google.com/document/d/1RV4rOrG1gwlD5IljmrIq_t45rz7H3xs9GbFSEyGzEtM/edit?usp=sharing]
>  showing how we plan to make it happen, including the the user interface.
> As we work on the code migration from Priam to Cassandra, any feedback would 
> be greatly appreciated about the interface or v1 implementation features. I 
> have tried to call out in the document features which we explicitly consider 
> future work (as well as a path forward to implement them in the future) 
> because I would very much like to get this done before the 4.0 merge window 
> closes, and to do that I think aggressively pruning scope is going to be a 
> necessity.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (CASSANDRA-14346) Scheduled Repair in Cassandra

Reply via email to