[ 
https://issues.apache.org/jira/browse/CASSANDRA-14346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16421098#comment-16421098
 ] 

Joseph Lynch commented on CASSANDRA-14346:
------------------------------------------

[~bdeggleston]
{quote}I think the problems that exist in C* with regard to understanding the 
state of repairs and streams, and the inability to cancel them without 
restarting nodes are orthogonal to talking about the best approach to 
coordinate them. 
{quote}
I disagree. When you're inside Cassandra you have one process lifecycle and 
don't have to do IPC via JMX (which is, honestly speaking, really bad IPC). A 
concrete example, when the outside process restarts it loses all active JMX 
connections and therefore loses track of all repairs, and it can't get them 
back. We'd have to implement some kind of more robust IPC than JMX (e.g. 
CASSANDRA-12944) for this to ever work well imo. On the other hand when the 
scheduler is inside the same process, we don't have to solve IPC, just 
inter-thread communication which is much easier.
{quote}As far as I’m aware, it’s not currently possible for a repair to 
determine if it’s taking a long time, finished with a lost notification, or 
stuck somewhere. So that’s really a limitation in the design of how cassandra 
does individual streams and repair sessions that should be solved regardless, 
and not really an argument in favor of one approach or the other.
{quote}
I definitely agree this is a big problem either way, and I think the core idea 
of our proposal is to keep work small so that if we do have to cancel or lose 
them it's not a big deal. Hopefully with robust incremental repair this won't 
be as big an issue because the occasional full range can just do super small 
subranges and not worry about streaming too many sstables since incrementals 
theoretically repaired most of the data already.

> Scheduled Repair in Cassandra
> -----------------------------
>
>                 Key: CASSANDRA-14346
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14346
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Repair
>            Reporter: Joseph Lynch
>            Priority: Major
>              Labels: CommunityFeedbackRequested
>             Fix For: 4.0
>
>         Attachments: ScheduledRepairV1_20180327.pdf
>
>
> There have been many attempts to automate repair in Cassandra, which makes 
> sense given that it is necessary to give our users eventual consistency. Most 
> recently CASSANDRA-10070, CASSANDRA-8911 and CASSANDRA-13924 have all looked 
> for ways to solve this problem.
> At Netflix we've built a scheduled repair service within Priam (our sidecar), 
> which we spoke about last year at NGCC. Given the positive feedback at NGCC 
> we focussed on getting it production ready and have now been using it in 
> production to repair hundreds of clusters, tens of thousands of nodes, and 
> petabytes of data for the past six months. Also based on feedback at NGCC we 
> have invested effort in figuring out how to integrate this natively into 
> Cassandra rather than open sourcing it as an external service (e.g. in Priam).
> As such, [~vinaykumarcse] and I would like to re-work and merge our 
> implementation into Cassandra, and have created a [design 
> document|https://docs.google.com/document/d/1RV4rOrG1gwlD5IljmrIq_t45rz7H3xs9GbFSEyGzEtM/edit?usp=sharing]
>  showing how we plan to make it happen, including the the user interface.
> As we work on the code migration from Priam to Cassandra, any feedback would 
> be greatly appreciated about the interface or v1 implementation features. I 
> have tried to call out in the document features which we explicitly consider 
> future work (as well as a path forward to implement them in the future) 
> because I would very much like to get this done before the 4.0 merge window 
> closes, and to do that I think aggressively pruning scope is going to be a 
> necessity.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to