[jira] [Commented] (CASSANDRA-10070) Automatic repair scheduling

Marcus Olsson (JIRA) Tue, 23 Feb 2016 01:50:43 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-10070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15158634#comment-15158634
 ]


Marcus Olsson commented on CASSANDRA-10070:
-------------------------------------------

bq. We could probably replace the single resource lock 
('RepairResource\-{dc}\-{i}') with global ('Global\-{dc}\-{i}') or mutually 
exclusive resources ('CleanupAndRepairResource-{dc}-{i}') later if necessary. 
We'll probably only need some special care during upgrades when we introduce 
these new locks, but other than that I don't see any problem that could arise 
with renaming the resources later if necessary. Do you see any issue with this 
approach?
No that should probably work, so we can have it as 
'RepairResource-\{dc\}-\{i\}' for now. For the upgrades we could add a release 
note that says something like "pause/stop all scheduled repairs while upgrading 
from x.y to x.z". But in that case the pause/stop feature should be implemented 
as early as possible to avoid having an upgrade scenario that requires the user 
to upgrade to the version that introduces the pause feature before upgrading to 
the latest. Another way would be to have the "system interrupts" feature in 
place early, so that the repairs would be paused during an upgrade.

bq. Created CASSANDRA-11190 for failing repairs fast and linked as a 
requirement of this ticket.
Great!

bq. No unless there is a bug. Repair messages are undroppable, and the nodes 
report the coordinator on failure.
bq. We could probably handle explicit failures in CASSANDRA-11190 making sure 
all nodes are properly informed and abort their operations in case of failures 
in any of the nodes. The timeout in this context could be helpful in case of 
hangs in streaming or validation. But I suppose that as the protocol becomes 
more mature/correct and with fail fast in place these hanging situations will 
become more rare so I'm not sure timeouts would be required if we assume there 
are no hangs. I guess we can leave them out of the initial version for 
simplicity and add them later if necessary.
I think the timeout might be good to have to prevent a hang from stopping the 
entire repair process. But I think it would only work if the repair would only 
hang occasionally, otherwise the same repair would be retried until it is 
marked as a "fail". Another option is to have a "slow repair"-detector that 
would log a warning if a repair session is taking too long time, to avoid 
aborting it if it's actually repairing and leaving it up to the user to handle 
it. Either way I'd say it's out of the scope of the initial version.

---

We might also want to be able to detect if it would be impossible to repair the 
whole cluster within gc grace and report it to the user. This could happen for 
multiple reasons like too many tables, too many nodes, too few parallel repairs 
or simply overload. I guess it would be hard to make accurate predictions with 
all of these variables so it might be good enough to check through the history 
of the repairs, do an estimation of the time and compare it to gc grace? I 
think this is something out of scope for the first version, but I thought I'd 
just mention it here to remember it.

Should we maybe compile a list of  "features that should be in the initial 
version" and also a "improvements" list for future work to make the scope clear?

> Automatic repair scheduling
> ---------------------------
>
>                 Key: CASSANDRA-10070
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10070
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Marcus Olsson
>            Assignee: Marcus Olsson
>            Priority: Minor
>             Fix For: 3.x
>
>         Attachments: Distributed Repair Scheduling.doc
>
>
> Scheduling and running repairs in a Cassandra cluster is most often a 
> required task, but this can both be hard for new users and it also requires a 
> bit of manual configuration. There are good tools out there that can be used 
> to simplify things, but wouldn't this be a good feature to have inside of 
> Cassandra? To automatically schedule and run repairs, so that when you start 
> up your cluster it basically maintains itself in terms of normal 
> anti-entropy, with the possibility for manual configuration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-10070) Automatic repair scheduling

Reply via email to