[ 
https://issues.apache.org/jira/browse/CASSANDRA-10070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15050440#comment-15050440
 ] 

Marcus Olsson commented on CASSANDRA-10070:
-------------------------------------------

{quote}
While it may intuitively seem like you want to kick-off a repair as soon as a 
node comes back online, it can be very dangerous in a production environment.

Starting the most resource intensive process on a node that is already 
problematic, in a cluster that is already having issues can exacerbate the 
issue and lead to a longer outage, or degradation, than anticipated. 
{quote}
True, it should probably be a feature enabled by the user and maybe with a 
configurable delay before it actually performs the repair?

{quote}
Network reliability is also another aspect of this. Lets say you have 3 nodes, 
RF=3 and there is a partition dividing node A and node B. All nodes are still 
actually, up, but in this case node A will start a repair on B and B will start 
a repair on A. Now 2/3 of your cluster is un-needly repairing which can cause 
serious performance problems, especially when running a loaded cluster.
{quote}
The repairs are still executed with respect to the distributed locking, so 
there would only be one node running repair at a time. But they would send the 
job information to each other in parallel.

{quote}
Also:
Other times you might not want a repair automatically started:
* The cluster is in the middle of a rolling upgrade where streaming is broken 
between versions.
* Heavily loaded clusters during normal operation (some users schedule repairs 
at night to not affect performance during normal hours of operation)
* Clusters where the read-consistency is high enough to account for the hints 
beyond the window allowing the user to schedule the repair for a time that 
makes sense for their cluster and use-case.
{quote}
* This is something that the repair scheduler should be handling either way, to 
avoiding repairing if the cluster is unable to perform it. (version 
incompatibility, nodes are down, etc.)
* There is a plug-in point for schedule policies that can be used to decide if 
repairs should run, so it would be possible to prevent repairs due to some 
condition(s). The conditions could be based on what the user wants, be it 
maintenance windows or resource usage. It would also be possible to prevent 
normal scheduled repairs during some hours, but allow manually scheduled 
repairs at all times.
* This would be possible by making this feature optional.

---

{quote}
I don't know much about Cassandra internals, so one of the regular devs would 
know better, buy my thought would be during a restart, somewhere it figures out 
that it needs to replay part of the commit log to rebuild memtables that hadn't 
been flushed to disk. The timestamp of the last thing in the commit log might 
be a good estimate of when the node went down, and you could compare that to 
the current time to figure out how long the node was down.

I wouldn't worry about the second case since it would be hard to get that right.
{quote}
Looking at the commitlog might be a good enough approach. I'll look in to that.

---

Overall I'd say that if this feature(exceeding hint window repairs) should 
exist, it should probably be something that is enabled per table, but disabled 
by default.

> Automatic repair scheduling
> ---------------------------
>
>                 Key: CASSANDRA-10070
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10070
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Marcus Olsson
>            Assignee: Marcus Olsson
>            Priority: Minor
>             Fix For: 3.x
>
>
> Scheduling and running repairs in a Cassandra cluster is most often a 
> required task, but this can both be hard for new users and it also requires a 
> bit of manual configuration. There are good tools out there that can be used 
> to simplify things, but wouldn't this be a good feature to have inside of 
> Cassandra? To automatically schedule and run repairs, so that when you start 
> up your cluster it basically maintains itself in terms of normal 
> anti-entropy, with the possibility for manual configuration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to