[ 
https://issues.apache.org/jira/browse/CASSANDRA-14346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16444675#comment-16444675
 ] 

Joseph Lynch commented on CASSANDRA-14346:
------------------------------------------

Ah ok, good to know. The scheduler right now just uses the 
[{{ActiveRepairService}}|https://github.com/apache/cassandra/blob/34a1d5da58fb8edcad39633084541bb4162f5ede/src/java/org/apache/cassandra/service/ActiveRepairService.java#L98]
 thread pool to see if Cassandra is running repairs (looks like this is renamed 
to {{Repair-Task}} in trunk) and 
[{{forceTerminateAllRepairSessions}}|https://github.com/apache/cassandra/blob/cb67bfc1639ded1b6937e7347ad42177ea3f24e3/src/java/org/apache/cassandra/service/StorageServiceMBean.java#L348]
 to kill them after a resume + timeout. We can do the same for trunk or enrich 
the interface to give more granular control (status by repair cmd number would 
probably be sufficient, although maybe we'd need actual uuids for repairs). 
Right now we don't support timing out individual parallel subranges (we just 
kill everything), so that would be a nice improvement to be able to cancel 
individual repairs (I know this doesn't cancel the streaming in 2.x, not sure 
about trunk).
{quote}Should I interpret this to mean that your scheduler breaks incremental 
repairs into small subranges?
{quote}
We're running 2.1 in production so we only do full range since we heard that 
incremental was very broken in 2.1, but the subrange breaking is at a higher 
level of abstraction so I don't see why it couldn't apply to incremental if we 
wanted. I'm not sure of the state of incremental + subrange, is it fixed in 
trunk? If so we can definitely do the splitting for incremental as well. We 
like splitting up the token ranges into similarly sized pieces because it makes 
the timeout logic much easier to reason about (long running repairs are super 
annoying to tell if they are stuck or not).

 

> Scheduled Repair in Cassandra
> -----------------------------
>
>                 Key: CASSANDRA-14346
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14346
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Repair
>            Reporter: Joseph Lynch
>            Priority: Major
>              Labels: CommunityFeedbackRequested
>             Fix For: 4.0
>
>         Attachments: ScheduledRepairV1_20180327.pdf
>
>
> There have been many attempts to automate repair in Cassandra, which makes 
> sense given that it is necessary to give our users eventual consistency. Most 
> recently CASSANDRA-10070, CASSANDRA-8911 and CASSANDRA-13924 have all looked 
> for ways to solve this problem.
> At Netflix we've built a scheduled repair service within Priam (our sidecar), 
> which we spoke about last year at NGCC. Given the positive feedback at NGCC 
> we focussed on getting it production ready and have now been using it in 
> production to repair hundreds of clusters, tens of thousands of nodes, and 
> petabytes of data for the past six months. Also based on feedback at NGCC we 
> have invested effort in figuring out how to integrate this natively into 
> Cassandra rather than open sourcing it as an external service (e.g. in Priam).
> As such, [~vinaykumarcse] and I would like to re-work and merge our 
> implementation into Cassandra, and have created a [design 
> document|https://docs.google.com/document/d/1RV4rOrG1gwlD5IljmrIq_t45rz7H3xs9GbFSEyGzEtM/edit?usp=sharing]
>  showing how we plan to make it happen, including the the user interface.
> As we work on the code migration from Priam to Cassandra, any feedback would 
> be greatly appreciated about the interface or v1 implementation features. I 
> have tried to call out in the document features which we explicitly consider 
> future work (as well as a path forward to implement them in the future) 
> because I would very much like to get this done before the 4.0 merge window 
> closes, and to do that I think aggressively pruning scope is going to be a 
> necessity.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to